Monday, October 19, 2015

Process an MBOX file with python

Getting snowed under with email?  Before you go and clear it all out, have a read of this because it may prevent the same thing happening in the future.

I'm on holidays, and one of the first things on my to do list was to clean out my inbox.  From what I've heard of other peoples experiences, my 400 email inbox isn't too bad, but it was getting unmanageable.  It's been that way for a while and I eventually got to a point where I didn't delete anything and just let it build up.  I did that for a reason, I wanted to analyse all of the email to see where it was coming from.  That way I could take appropriate action to change email settings on services, unsubscribe from things, or set up some filters.  By investing some time now, I can reduce my ongoing maintenance.

I wasn't sure how I was going to go about this.  I originally thought that I'd have to use something like a google apps script to retrieve data from my email account, but as it turns out Google finally got their act together and now have a way to download an archive of your email.  They even give you the option to select specific labels to include in the downloadable MBOX file.  This is handy as I only want the emails in the inbox.

download settings
Email download process


From this point things were easy.  Python is able to parse the MBOX file and extract the required information from the emails.  The subject and sender emails addresses were extracted and processed with the following procedure.

1.  Split the strings at whitespace
2.  Remove everything except alphanumeric characters
3.  Convert the string to lowercase

All the results were combined and then sorted and counted to give a result similar to the below image.

word frequency
Email word frequency

It's not perfect, but it gives you a quick way to identify problem areas.  Watch out for Unicode problems though, I think I've taken care of it but it's hard to tell.

Github Gist

No comments:

Post a Comment