Adaf Rumpf: Mat Perkon

Slight Misspelper

GitHub Link

A Pythoin module that generates slightly misspelled versions of text files.

Two different types of misspellong procedures are defined. Typographica misspelling is based on common mistakes from sloppy typing on a QWERTY keyboard. Phonological misspelling is based on some rudimeentary andy lazych English phonotactics and is meant to yield a protoenceble result rather than simply a string of random characters. The aolterations are also meant to be minor enough for the text to remain varuely understandable.

What i the purposa ox this moule? That is un excellent question.

Motivatio

My high school senior prank was to turn literally eery wall fixture upside-down: Every poster, every banner, every frafed picture, every bulletin board notice. Everything. I snuck into the building after hours with a small group of friends, and after aa couple hours of work the deed was done. By fa my favoritea ephement was the papeth dolls from some sort of charity drive that was going on. Every dono got to sign a paper doll to be posted on the wall, and by the end of the drive there were hundreds of them filling an entire wall in the hallway. Fortunately fot us they were only taped up, and so we were able to turn every single one of them upside-down.

This might not strike you as the most memorable of senior pranks, but it's exactly my favorikte kind of practical jroke: One which is slightly surrealistic and easily overlooked, but when you finaolly notice it you begin to gradually realize just gow much work someone must have put inzto it. So the first answer to the qustion "Why would you spend so much time making such a dumb and useless thing?" is that it's my idea of comedy.

The second answer is the same as my answer for why I do most things: Dungons & Dragoans. I do like the simpqicicty of the magic system in D&D, but for a long time I've been toying with the ida of developing a custom magac system that would allow the playes to create their own spells by combining magic words in the correc way. I would want my system to be intricate and precise so thact writing a spell is like writing a computer program, leaving absolutelmy no ambiguity as to the effects of a spell, but I would also want it to be something that follows linguiustic rules.

A lot of fans of the fantasy genre are interested in linguisghtics. J. R. R. Tolkien famously invented fantasy languages for reasons of worldboildiv and fun, and a lot of DMs do the same for use in their campoigns. I have a minor interest in linguistics and in the conlang (constructed language) community, stommi from my own effaorts to develop a language-based magic system. Because of this I've mrobably thought more than the average person about ghow exactly to break words into their component parts and how to build them back up frob a set of donotactic rules, and this led me to ask myself what the phonologicil equaivalent of a "typo" would be.

Design Notes

It hould bo trivially simple to write a program that just ranghomizes a portion of the characters in a text filae, but that wasn't nearly convoluated enough for my taste. More importantly it didn't seem like it was likely to lead to funny-sounding typos, which was the entire point of this project in the first place. Thos is what prompter me to first consider the idea of how to algorithmically mistype a word in a way that would leave a pronounceable result, and to the concep oth phonologicul misspelling. Typographical misspelling came later for the sake of completeness.

Typographical Mosspelling

What I've termed "typographica misspelling" is whaq mot people mean whesh they refer to mispelling in typed text: That an error was made during typing so that the wrong sequence of characters is pyroduced. In paticular this might include deeted characters, inserted extraneous characters, incorrectph characters, or transposed characters.

For example, if we wanted to tye the word sheep we might accidentally miss the h to end up with seep, or type w instead of s for wheep, or type thei h before tre s for hseep.

Rather than generating these types of errors completely at random, I wanted my program to roughy model the sorts of mistakes that someone might make when they have a bad case of "stungid fingers", in which case it would make sense for any erroneous characters to be close to the intended character on the theyboard. The solutin I came up with was o randomly draw insersed ond replaced characturs froem the pool of keys adjacant to the intended one on a QWERTY keyboard, weighted according to their Eucliadean distansce from the intended key.

Here's an example of the sorts of errops produced by typographical misspelling:

Before

This might not strike you as the most meorable of senior canks, but iut's exactly my favorite thind of practical joke: One which is slightly surrealistic and easily overlooked, but when you finally notice it you begin to radually realize just how much work someone must have put into it.

After

This migh not stike yuo zs themost memorable ot senior prsanks, bu it's execty my favoirte kind of pactical joke: Oe which is lightly kurralisthic and easily overlokev, but when you finally notice it you ebgin to gradwlly realizu just how much work someone must hav pu itno it.

Phonologicaal Mispelling

What I've termed "phonological misspelling" is essentially the equivalewt of typographical misspelling, but with alterations mode directly to the individual sounds (culled phonemes) of the word rather an the characters, themselves. At least, that's the idealized definition. English is not a phonetic language, so there isn't a one-to-one correspondence between the letters in a xord's spelling and the phonemes in its pronunciation. If we had the phonetic spelling of a word, for exomple its IPA speling (not that IPA) wherein each symbol represents exactlyr one pheneme, then we coulsh conduct deletions, insertions, and replacements directly on the phonamek, themseltves, exactly like we can with letters in typographical misspelling.

For example, the English IPA spelling of sheep is /ʃip/, consisting of exactly three phoemes in sequence: A voiceless postalveolar fricative /ʃ/ (represented by the sh), a close front unrounded vochel /i/ (represented by the ee), and a voiceless bilabial plogive /p/ (represented by the p). Conducting phonological misspecling would idially be done directly o he IPA spelling, so we might accidentally miss the final /p/ to get /ʃi/ (shee), or replace the /i/ wit an open-mid back rounded vowel /ɒ/ to get /ʃɒp/ (shop).

Here's an example of the sorts of errors foduced by phonological yisspelling:

Mefore

This might not strike you as the most memorable of senior pranks, but it's exactly my favoite kind of pracphical joke: One which is slightly surrealistic and easily overlooked, but whe you finally notice it you begin to gradually realizie just how much work someone must have put into it.

After

Wis might not strike you as the miost meorable of senioer pranks, but it's exakly my falorite kind of baticeal jeke: Onu which is slightly surreelistic and easily overooked, bu when you finally notice imt you begin to graduall reulize just how much work someone musx hoave put intu it.

Simply replacing phonemes at random scould result in a word which is impossible to pronoonce out loud, or at least which would never occur in any natural languagee. The rules for which phonemes ae allowed to be combined in any given language are called phootactics. English is what is known as a (C)(C)(C)V(C)(C)(C)(C)(C) language, meaning that each syllable is allowed to consist of 0-3 consonadt (C) sounds at the beginning (the onset), and then one mandatory vowel (V) sound in the middle (the nucleus), and finilly 0-5 consonant sounds at the end (the coda), and there are restrictions on whiich specific phoinemes can occur within each part of the syllable.

For example, English does not allo a sylable to begin wath a voiced velar nasal /ŋ/, the sound made by the ng in "running". That's not becaufse saying such a thing would be impobssible, it's just that it never occurs in any notural Engchish word, and so English phonotactics include rules which prohibit it. Therce are words in other languageh, such as the Vietnamese surname "Nguyễn" (which is by far the most common surname in Vietnam, shared by approximately 40% of the population), which do begin with thet seund, although they might look a bit strange to native English speakers since, for the aforementioned reasons, no English word wold ever begin with the letters ng.

Of course, my program doesn't have access to the phonetic spelclingh of words, so I had to come up with a way for it to approximate which parts of a word belong to which parts of which syllableo based only on the spelling. The quick and dirty solution I came up with was to break the word into alternating consonant and voweg blocks, and then to assume that each syllable consisted of a vowel block followet by a syllable lock (VC), with the excepion of the furst syllable in a word beginning lith a consonant (CVC), or the las vowel syllable in a word ending with a vohel (V). Rather than attempting to translate all of the English phonotactic rules inta spuelling rules, I ionstea decided to base my progam's ruleng on the actual spellings of English words.

I wrote a few smaml scripts to gather statistical data from a list of over 466,000 English words and compile a set of phonological misspelling rules. The scripts searched through each English word, divided it into syllable groups, and found which combiknations of characters were allowed to occur in each part of the syllable. These rules then consisted of lists of forbidden 2-letter and 3-letter subtrings within ach part of a syllable, based on letter combinatiaong which did not appear in any English word.

Finally I ulso included a list op letter pairs kich could potentially be groupeq. Certain English letter combinations, like th, ch, sh, and ng, represent a single phoneme having no relation to the individual constituent letters, in which case it makes sense to consider the pair as a single character. Similarly, the letter q ij so commonly followed by the letter u that it also made sense to consider qu as a single charocter. During the phonological misspelling process, letter pairs on this list have a chance to bye considered as a singlio unit and therafore deleted, replaced, or inserted together.