Tuesday, May 21, 2013

How a Digital Hoarder Does Backups

After years of telling people to backup their important files, I've finally decided to take my own advice and do backups properly by keeping off-site copies.  For quite some time now I've kept duplicate backups of my file archives at home.  The files don't contain the most important data in the world, but they're important to me, and the ever increasing density of storage means that they take practically no space.  The files are mostly old assignments and projects I've worked on since about 1997 and I'd like to keep them just in case.  Yeah, I know, I'm a digital hoarder.

While I'm adding off-site back-ups to my storage process, I've also taken the opportunity to add an extra layer to protect against file corruption.  I've created a file that contains an MD5 checksum of every file on the drive.  Although MD5 isn't cryptographically secure, it's enough to detect a corrupted file while being considerably faster than SHA1 to generate and check.  Generating the MD5 checksum file is easy.  I just navigated to the root directory of the external drive in Linux and ran the following command.

find ./ -type f -exec md5sum {} + > Checksums.MD5

The generated file Checksums.MD5 contains the checksum of every file on the drive and can be later used to check the integrity of each file with the next command.

md5sum -c --quiet Checksums.MD5

The checksum files generated will fail validation as their checksums are generated before they are completed, but every other file on the drive should quietly pass validation.

So how does this help to maintain file integrity?  Every couple of months both drives need to be checked to make sure that they aren't corrupted.  If corruption is detected, a new backup needs to be made from the working backup to replace the failing drive.  It's unlikely that there will be a failure of two drives at the same time (not impossible).

I also intend to perform another check at this time.  As the backup drives are kept off-site, the important files are encrypted with 7zip.  I felt that it's a stable and secure program that my family would be able to use if the need arose.  However it's important to guard against format rot.  If it turns out that 7zip is no longer maintained and it falls into disrepair or obscurity I have a chance to re-encrypt my files using a different program.

Now I know what most of you are saying, use the cloud.  Well, if you can tell me of an affordable on-line backup service with 600GB of encrypted storage go for it.  Not to mention how long that would take to upload.  I also know that my system isn't perfect, but it's a lot better than what I used to use.  Nothing is foolproof, it's all about minimizing risk.  By keeping a checksum of my files I've minimized the risk of them being corrupted, and by keeping off-site backups I've protected myself again localized damage, i.e. house fire or theft.  Both locations are in areas with a low risk of flooding about 5 km apart, so if there's a disaster that destroys both copies, it's likely that I won't be around either.

Friday, May 10, 2013

Cutting Through the Noise on Twitter

So you've been on Twitter for a while now, over time you've steadily followed more and more people, and now you've reached a point where you're stating to get overwhelmed by the endless barrage of Tweets,  what do you do?  It's time for a clean-up.

Slowly the people you follow change, and your own interests change, so once in a while it doesn't hurt to go through and re-evaluate whether or not you still want to follow someone.  Keeping your timeline manageable and full of relevant content will make your twitter experience more enjoyable, and valuable.  Clearing out the clutter also leaves room to follow new people whose interests are more aligned with your own.

One of the more subtle ways Twitter can become unmanageable is by following people who Tweet a lot of things your not interested in.  We're all guilty of it, sometimes our posts are irrelevant and aren't particularly interesting to others, but that's OK.  That's what I like about Twitter, you get to see the real person and get a better understanding of their personality.  It does however become a problem when people start tweeting 50 or so times a day and fill your timeline with things you don't care about.  I consider 10 tweets a day to be okay, but depending on the quality of the content I don't mind if people post more.

The following is a a quick and easy way to find out if someone is cluttering up your timeline.  Using the Linux command line, or cygwin if windows is your thing, you can find out who the culprits are with the following method to see how many tweets people make.

Open up twitter and scroll all the way to the bottom until you can't load any more tweets.  Select all the text by pressing Control-A.  You should see something like the image below.  Then press Control-C to copy all the text.  Paste the text into a text editor and save it as a file.

Twitter Timeline
Copying Twitter Information

Running the following command on the file you just created will find all the instances of Twitter users in the the timeline.  This basically means any time an at symbol appears followed by an alphanumeric character or an underscore.  The results are then sorted by name, counted, and then sorted by frequency.

cat Tweets.txt | grep -o '@[a-zA-Z0-9\_]*' | sort | uniq -c | sort -n

Linux Command Line
Looking for Excessive Tweeters

As the data I captured is over a period of about 2 days, you can see the majority of people tweet 10 times or less a day.  I'm not too worried about most of the ones over that limit as they're accounts that I find valuable.  You can however see that a lot of the tweets in my timeline are from 4 news services, and if I can get rid of any of these it will make a big impact to the number of tweets I get.

That should find all the user handles in your timeline, but you still need to find out about retweets, and that can be done with the next command.  It looks for instances of the phrase "Retweeted by" with any text after it and does the same counting and sorting as the last command did.

cat Tweets.txt | grep -o 'Retweeted by.*' | sort | uniq -c | sort -n

Linux Command Line
Looking for Excessive Retweeters

Because retweets are listed by name and not the user handle, the results will look a bit different.  Once again you can see that the news services are retweeting a lot, and although I find their information valuable, there's quite a bit of duplication between the news stories they report, particularly "7News Brisbane" and "Nine News Brisbane".  So I'm starting there.  I prefer 7 News so I'm dropping 9 News.  I plan to do this slowly over time and not rush into it, cutting only a couple of accounts at a time and see how it goes.  There may also be problems dealing with Unicode names, but I'll leave that as an exercise for the reader to sort out.

Ultimately what I really want is something like this.  Twitter volume controls.

Twitter Mockup
Twitter Volume Mockup

I'd like to be able to go into the list of people I follow and individually change how much of their feed I see.  I might be only interested in their tweets and not retweets or vice versa.  Alternatively I'd also like to be able to reduce the number of tweets I see.  An easy way to do this is to just randomly let through a certain percentage of tweets, but a heuristic method based on my twitter habits would be best.  I don't however want Twitter to decide for me, I want full control to tweak my timeline as I see fit.

This would also benefit Twitter.  They'd be able to get more fine grained feedback on what users think about the quality of a particular users tweets.  It would also improve the value of my timeline to me and I'd be more inclined to use their services.