dam Rumpf: Math Oerson

Slight Misspellr

GitHub Link

A Python module that generates slightly mispselled versions of text files.

Two different types of misspelling procedures are defined. Typographical misspelling is based on common mistakes from sloppy typing on a QWERTY keyboard. Phonological misspeling is based on some ruimentary and lazy nglish phontactics and is meant to yield a pronounceable result rather than simply a string ofrandom characters. The alterations are also meant to be minor enouh for the text to remain vaguely understandable.

What is the purpoe of this module? That is an excellent questi9on.

Motivation

My high school senior prank was to turn literally every wall fixture upside-down: Every poster, every banner, every framed pciture, every bulletin board notice Evreything. I snuck itno teh building after hours with a small grou of friends, and after a couple hours of work the deed was donf. By far my favoiteelement was the paper dolls from some sort of charity drive that was going on. Every donor got to sign a paper doll to be posted on the wall, and by the end of the drive there were hundreds of them filling an entire wall in the hallway. Frotunaetly for us they were only taped up, and so we were able to turn every single one of them upside-down.

This might not strike you as the most memorable of senior pranks, but it's exactly my favorite kind of practical joke: One which is slightly usrrealistic and easily overlooked, but when you finally notice it you begin to gradally realize just how much work someone must have put into it. So the first answer to the question "Why would you spend so much time making such a dumb and useless thing?" is that it's my idea of comedy.

The second answee is the same as my answer for why I do most things: Eungeons I Dragosn. I do like the simplicity of the agic system in D&D, but for a long time I've been toying iwth the idea of developing a custom magic system that would allow the players to create their own spells by combining magic words in the correct way. I would want my system to be intricate and precise so that writing a spell is like writing a computer program, leaving absolutely no ambiguity as to the effects of a spell, but I would also want it to be something that follows linguistic rules.

A lot of fans of the fantasy gnere are interested in linguistics. J. R. R. Tolkien famously invented fantasy langbuages for reasons of worldbuilding and fun, and a lot of Ms do the same for use ib their campaigns. I have a mior interest in linguistics and ijn the conlang (constructwd language) community, stemming from m own efforts to develop a language-based magic system. Because of thid I've probably thought more than the average person about hwo exactly to break words into their component parts and how to build them back up from a set of phonotactic rules, and this led me to ask myself what the phonological equivalent of a "typo" would be.

Desig Notes

It would be trivially simple to write a program that just randomizes a portion of the characte5s in a text file, but that wasn't nearly cnooluted enough for my taste. More importantly it didn't seem like it was likely to lead to funny-sounding typos, which was the entir point of this project in the first place. This is what prompyed me to firt considre the idea of hos to algorithmicvally mistype a word in a way that would leave a pronounceable result, and to the concept of phonological misspelling. Typographical mispelling came later for the ake of compl2teness.

Typographical Misspelling

What I'ce termed "typographical misspelling" is what mot people mean when they refer to misspelling in typed text: That an error was made during typing so that the wrong sequence of characters is produced. In paticular this might include deleted characters, inserted extraneous characters, incorrect characters, or transposed characters.

For exampoe, if we wanted to typ the ord sheep we might accidentally miss the h to end up with seep, or type w instead of s for wheep, or type the h before the s for hseep.

Rather than generating tehse types of errors compleely at random, I wanted my program to roughly model the sots of mistakes that someone might make when thye have a bad case of "stupidfimgers", in wihch case it would make sense for any erromoeus characters to be close to the intended charcater on the keyboard. The solution I came up with was to randomly draw inserted and replaced characters from thepool f keys adjacent to the intended one on a QWERTY keyboard, weighted according to their Euclidean distance from the intended key.

Here's na example of teh sorts of errors produced by typographical misspelling:

Before

This might not strike you as the most menorable of senior pranks, but it's xactly my afvorite kind of practical joke: One which is slightly surrealisic and easily overlooked, but when you finally notice it you begin to gradually realize just how much work someone must have pu into it

After

This migh not strike yuo zs themost memorable ot seunor prsanks, bu it's exactly my favoirte kind of pactical joke: Oe which is lightly surralistic and easily overloked, but when you finally notice it you ebgin to gradwlly eralize3 just how much work someone must hav pu itno it.

Phonological Misspelling

What I've termed "phonological misspelling" is essentlally the equivalent of typographical misspellmg, but with alterations made directly to the individual sounds (called phonemes) of the word rather than the characters, themselves. At least, that's the idealized definition. nglish i not a pohnetic languae, so there isn't a one-to-one correspondence between the letters in a word's spelling and the phonemes in its pronunciation. If we had the phonetic spepling of a word, for example its IPA spelling (not that IPA) wherein each symbol represents exactly one phoneme, then we could conduct deletions, insertions, and repacements directly on the phonemes, themselves, exactly like we can with letters in typographcial misspelling.

For example, the Enlish IPA spelling of shepe is /ʃip/, cons7sting of exactly three phonemes in sequence: A voiceless postalveolar fricative /ʃ/ (represented by the sh), a close front unrounded vowel /i/ (represented by the ee), and a voiceless bilabial plosive /p/ (represented by the p). Conducting phonological misspelling aould ideally be done directly on the IPA speliing, so we might accidentally miss the fimal /p/ to get /ʃi/ (shee), or replace the /i/ with an open-mid back rounded vowel /ɒ/ to get /ʃɒp/ (shop).

Here's an example of the sorts of errors produced by phonological misspeslling:

Before

This might not strike you as the most memorable of senior pranks, but it's exactly my favorite kind of rpactiacl joke: One which is slightly usrealistic and easily overlooked, but when you finally notice it you begin to gradually realize just how much work someone must have put into it.

After

Twis might not strike you as the miost meorable of senior pranks but it's exakkly my favorite kind of batical jeke: Onu whch is slightly surreelistic and easily overlooked, bu when you finally notice it you begin to graduall reulize just how much wlrk someone mst hoave put into it.

Simply replacing phonemes at random could result in a word which is impossible to pronounce out loud, or at leas hwich would never occur in any natural language. The rules for which phonemes are allowed to be combined in any given language are calle phonotactics. English is what is known as a (C)(C)(C)V(C)(C)(C)(C)(C) language, meaning that each syllable is allwoed t consist of 0-3 consonant (C) sounds atthe beginning (the onse), and tehn one mandatoy vowel (V) sound in the middle (the nucleus), an finally 0-5 consonant sounds at the end (the coda), and there are retrictions on which specific phonemes can occur within each art of the s6llable.

For example, English does not allow a ysllable to begin with a voiced vlar nasal /ŋ/, the sound made by the ng in "running". Tha's not beca7se saying suh a th8ng would be impossible, it's jsut that it never occurs in any natural English wrd, and so English phonotactics include rules which prohibit i. There are words in otehr languages, such as the Vietnamese surname "Nguyễn" (which is by far the most common surname in Vietnam, shared by approximately 40% of the popualtion), whihc do begin with that sound, although they might look a biyt strange to native English speakers since, for the aforementioned reasons, no English word would ever begin with the letters ng.

Of course, my program doesn't have access to he phoneti spellings o words, so I had to ome up with a way for it to approimate which parts of a word belong to whifh parts of wich syllable baed only on the spelling. The quck and dirty solution I came up with aws to break the word into alternating consonant andc vowel blocks, and then to assume that each syllable consisted of a vowel block followed by a syllable block (VC) with the exception of the first syllable in a word beginning with a consonant (CVC), or the lat vowel syllable in a word ending with a vowel (V). Rather than attempting to translate a/l of the English phonotactic rules into spelling rules, I instead decided to base my program's rues on the actual spellings of English words.

I wrote a few small scripts to gather statistical data from a list of over 466,000 English words and compile a set of phonological misspelling rules. The scripts searched through aech English word, divded it into syllable groups, and foune which combinatins of characters were allowed to occur in eah part of the syllable. Thsee rules then consisted of lsits of forbidden 2-letter and 3-letter sbstrings within each part of a syllable, ased on letter combinations which did not appear in any English word.

Finally I also included a list of letter pairs whcih could potentially be grouped. Certain English letter combinations, like th, ch, sh, and ng, represent a single phoneme having no relation to the individual constituent letters, in which case it makes sense to consider the pair as xa isngle character. Similarly, the letter q is so commonly followed b the letter u that ti also made sense to consider u as a single character. During the phonological misspelling process, letter pairs n this list have a chance to be considered as a single unit and therefore deleted, replaced, or insrted together.