Saturday, July 7, 2012

Sorting a Word List by Syllable with AWK

Download the Word List Files

Recently I've been thinking about changing my domain name to something more technology oriented and less personal.  I never intended to use my name for this site, my first priority was just getting something on-line, then I could worry about choosing the perfect domain name later.  There's only one problem with selecting a new site name, when it comes to things like this, I've got the creativity of a piece of mouldy toast.  So to help me along I thought I could use a bit of computer assisted inspiration in the form of a word list.

Before coming up with a name I already had a bit of an idea of what I wanted.  Preferably I'd like it to be 2 words, with a total of 3 or less syllables.  The last letter of the first word needs to be different from the first letter of the second word.  This prevents having a situation where double letters occur and makes it easier to communicate the site verbally.  For example if I said my site was "hot tech" would you type "hottech" or "hotech".  So basically I want a simple name that is unambiguous to type.  Perfect examples being adafruit, sparkfun, and makerbot.  These names are simple, have impact and hard to type incorrectly.  So that's what I'm trying to achieve.  I'm also trying my hardest not to do some lame rip off of something already out there, for example "arcjoy" instead of "sparkfun".

With these requirements in mind I thought that a good place to start would be a list of words that also indicated how many syllables are in each word.  The closest I could find was a word list from Project Gutenberg.  It's a list of words from the Moby Project with each word split into syllables by a delimiter character.  It's not perfect but it's massive and reasonably easy to work with.  The file does need to be processed slightly to make it more usable.

Unsurprisingly, this task is most efficiently done using the Linux command line and awk, a tool that can read files 1 line at a time and allows you to process them by breaking them up into fields.  So let's get started by downloading the file.  I'm going to demonstrate the commands using a small 10 line snippet from it.  When processing the actual file, commands were redirected to files.  Firstly we'll run a command to display all of the non printable characters.  This will show the line ending and delimiter characters.

cat -v mhyph.txt

linux cat command
display non-printable characters

The first thing you may notice is the ^M ending of the lines.  This is the control code for a carriage return and indicates that the file is formatted for windows.  To process the file with awk it needs to be converted to UNIX format, which only has a newline character at the end of a line, Windows files have a newline and a carriage return character.  If the carriage return isn't removed awk will interpret it to be part of the last field.  Printing this field would then return the cursor to the start of the line, leading to unexpected results.  The sub command of awk can be used to find and replace the carriage return characters at the end of lines (which can also be represented by \r) with nothing.

awk 'sub("\r$", "")' mhyph.txt

linux awk command
removing the carriage return character

We'll now run the cat command again to inspect the delimiters.

cat -v mhyph_unix.txt

linux cat command
Identifying delimiters

The documentation states that the delimiter is represented by the ASCII character 165.  This can be identified in the file as as an M code with the characters M-% representing ASCII character 165.  Calculating an M code can be done by adding 128 to the ASCII value of the character after the dash, in this case the ASCII value of % is 37.  Adding 128 to 37 gives 165, or 0xA5 in hex.

I'd like the format of the final output file to use only readable characters, so I'm using different field separators.  Each line will contain three main fields separated by an underscore character.  The first field will be the number of syllables in each word, the second field will be the actual word, and the last field will be the original word split into syllables using an equals sign as a delimiter.


The next step is where we actually process each line and is reasonably complicated.  In the "begin" part of the command we set the field separator to 0xA5, this lets awk know how to separate the input.  The output record separator is also set to nothing instead of the default new line character.  If this wasn't done a new line would be inserted after every print statement.  The next part of the command prints NF followed by an underscore, this is the number of fields in the input, which is the number of syllables in the word.  A for loop is then run which prints each field in the input. This prints the word, but removes the delimiting characters. An underscore is then inserted.  Next another for loop is run which does basically the same thing, except it prints and equals sign after each field except the the last field.  A new line character is then printed to finish off the line.

awk 'BEGIN{FS="\xA5"; ORS="";}{print ((NF) "_");  for (i=1; i<=NF; i++) print $i; print "_"; for (i=1; i<NF; i++) print ($i "="); print(($NF) "\n"); }' mhyph_unix.txt

linux awk command
Processing the file

Now we have a file that contains a list of words with their syllable count, along with how the word is broken up.  Using the sort command we can sort the file by the number of syllables in the word while preserving order.  The -t option indicates the field separator, -k1,1 tells sort to use only the first field to determine rank, -s makes sure that order is preserved, and -n specifies that when sorting, treat the field as a number.

sort -t _ -k1,1 -s -n mhyph_count.txt

linux sort command
Sorted file

The following command will filter the previous results for words with 2 or less syllables as this is all I'm after for my application.  Using awk, we set the field separator to an underscore character and print the second field if the first field is two or less.

awk -F _ '$1 <= 2 {print $2}' mhyph_count_s.txt

linux awk command
Words with 2 or less syllables

To increase the usefulness of this file and to aid creativity it would be better to randomise the file.  That way similar words will be split up and adjacent words will be dissimilar.  This can be done simply with the sort command and the -R option to randomise the sort order.

sort -R 1n2.txt

linux random sort command
Randomised word list

There you go, a randomised list of words with 1 or 2 syllables.  Easy as that.  I don't know if this will help me in my search for a domain name, or inspiration will just hit me and I'll come up with something.  I not even sure that I'll change it yet.  It's a bit of a hassle.  Even if this doesn't help I can still see uses for this list in other places, maybe coming up with pass phrases or even generating a random Haiku.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.