Showing posts with label backup. Show all posts
Showing posts with label backup. Show all posts

Sunday, October 30, 2016

Create Compressed Encrypted Backups Only When Files Change

Most of the files that I back up aren't really that important, but some of them contain personal information that I'd like to keep private.  My usual strategy is to sync things to Google Drive, but I do so on the assumption that one day a data breach will make everything visible.  So I needed a way to encrypt some items before backing them up.  Writing a PowerShell script seemed the best way to accomplish this.

In my last post I described a method to generate hashes for files and directories.  My intent for this is to be able to tell if they have changed and need their backups replaced.  Using this as a starting point I was able to create a script to compress and encrypt items ready to sync them to Google Drive.  It's not too complicated but needs some explanation.

A naming strategy for the backups was needed and the solution that seemed to fit best was to use the following format.

YYYYMMDD_XXXXXXXX_<Orginal Item Name>.tar.gpg

YYYYMMDD - represents the date the backup was created
XXXXXXXX - are the last 8 hex characters of the item hash of the backed up data (8 is enough, I didn't want to make the filenames too long.)
<Orginal Item Name > - is the original name of the file or directory
.tar.gpg - denotes that the that the file is an encrypted archive

For example, something like test.txt may become 20161029_A4F88BC1_test.txt.tar.gpg

The backup process is as follows.  Each file or directory in the specified input directory is processed by first calculating its hash.  The output directory is searched to see if a backup of the data already exists.  A valid match is found if the fingerprint and original name in the filename matches the data that is being considered for backup.  If so, the process doesn't need to continue as there is already a backup.  To change an encrypted backup for no reason means it would have to be re-uploaded.

If the data has changed since the last backup, or no backup exists, a new one is created.  The file or directory is added to a tar archive, compressed, and encrypted.  This is all done in a temporary folder created inside the system temp directory.  If the encryption process succeeds, the old backups are removed and replaced by the new one.

Back up output
Command line output of script
.
Backed up Files
Encrypted files in the the Windows explorer
I was determined to make sure that the script supports Unicode file names, but unfortunately gpg can't handle files with unicode characters in the name.  To get around this the file is redirected into and out of the command so that gpg only deals with the data.  This causes a problem though.  If the encryption step fails, the output file is still created but 0 bytes are redirected to it.  To make sure this isn't a problem the program checks to see in the gpg command completed successfully before replacing the backup. 

gpg encryption command
How to encrypt files with Unicode filenames
I really like encrypting back ups with public key cryptography.  There are no passwords to accidentally leave in scripts that can lead to security problems.

 Get The Code!
Get The Code

Saturday, October 15, 2016

Generating Directory Hashes

In my ongoing efforts to back up my files, I have a directory structure that I want to archive, but only if it has changed.  The reason for this is that the back ups are compressed and every time they get changed the whole archive needs to be re-uploaded.  Unfortunately the internet in Australia makes this a daunting task.  So I need a way to know if a directory has changed.  There are a couple ways I could do this, I could use time stamps to see if files or directories have been modified, or I could look at file contents.  I decided to look at the file contents, and basically create a "hash" for files and directories.  This would allow me to compare values over time.

Most people reading this would be familiar with taking the MD5 hash of a file and what that means.  It gives you a fingerprint of the file contents, and if any of the contents change, the hash value changes dramatically.  That's great, but it's only for files and it completely ignores the file name.  To me, a directory structure has changed if even a file name has changed.  I explored using something like an archiving format like tar to bundle up the directory structure with file contents and then taking a hash, but there's no guarantee that one implementation of tar will give exactly the same results as another, i.e. it's not deterministic.  This would give different hash values and is useless.

To overcome these problems I came up with something that I think is reasonably simple that only takes into account changes in directory structure, file and directory names, and file contents when determining if something has changed.

  • A directory structure can have files and sub-directories.
  • A file hash is equal to MD5(MD5(file contents) XOR MD5(UTF-8 byte array of name))
  • A directory hash is equal to MD5((directory contents) XOR MD5(UTF-8 byte array of name))
  • The content of a directory is equal to the XOR of the hashes of all files and directories it contains in the level below it.

Let me just state now that I know MD5 isn't secure, this isn't a security thing, I just need a fast way to get a file checksum.

So with these basic rules I wrote a PowerShell script so that we can take the hash of files and directories to a create a fingerprint so that they can be compared to future versions.  In the test below I created a directory structure with some test files to play around with.  Some directories are junction points and symlinks.  Some files are hardlinks and symlinks.  The script can be configured to ignore junction points and symlinks, not hardlinks as these are indistinguishable from other files.  I also threw in some unicode file and directory just to make sure every thing works as expected.

In the image below, each item in the directory structure has its own box with 4 different hexadecimal strings.  The red string describes the content hash.  If it's a file, that's just the normal MD5 hash of the file, if it's a directory it's the combined XOR of the all files and directories in the level below it.  The green string is the MD5 hash of the byte array of the name of the file or directory.  The blue string is the XOR of the content hash and the name hash.  The black string with green highlighting is the MD5 hash of the XOR result. (I'll get back to why this is done later)

directory structure
Calculating a directory hash
The implementation isn't too hard.  First create a function that calculates our version of a file hash.  Then create a function that can create a directory hash that calculates the hash of all items in it, along with the other operations needed to create the directory hash.  By recursion this will then explore the directory tree.

It may seem excessive to perform a hash on the result of the XOR value, but in the scenario below I'll show how you can get the same hash for two different directory contents if you don't do it.

File Hashes
No final hash can lead to different directories with equal hashes
You can see in the image above that if you swap the content of two files and don't do a final hash you can end up in a situation where they can give equal content hashes and if they happen to be in directories with the same name, those directories will have the same hash value. Hashing the values of the XOR prevents this as can be seen below.

File hashes
Adding a final hash leads to directories with different hashes
You can now differentiate between the directories as they have different hashes.  The final MD5 operation basically "scrambles" the information of the content and name hash before it can propagate to the level above.  Without it, and because of the associative and commutative properties of the XOR function you can end up with equal XOR results.

Get the code!
As with a lot of my projects they're a little rough.  I'd love for someone to take the ball and run with it to create a more professional version.  I think I've provided enough information to get people started.

Tuesday, October 4, 2016

Back up Git repositories to Google Drive

I'm trying to come up with a decent backup strategy and I'm almost there.  Figuring out a way to back up git repositories was a little confusing though.  I use GitHub to host repositories that I'm working on locally, and that's an OKish backup, but I don't check every file into Git.  For example, if I'm working on an electronics design I don't really want the manual for the micro-controller to be tracked by version control, but I do want a backup of the manual just in case they change it for some reason.  So for files like this I keep them with all the others and add them to the .gitignore file.  This is great, but they're not backed up anywhere.

Normally I use Google Drive for my backups.  There are other services that are probably better and have desktop syncing apps that are more polished, but I can easily access files from any device and I trust Google not to go broke in 6 months.  So a simple solution to my problem might be to store local repositories in the Google Drive directory.  That may work, but I just don't trust Git and the Google Drive app to get along together.  So what I ended up doing was just copying a backup of the repository to Google Drive.

This works, but if you copy files to the Google Drive directory and overwrite the old versions it wants to re-upload everything even if the files are unchanged.  You could do a copy where only newer files are overwritten but then another problem arises, files that are deleted from the repository remain in the backup taking up space.  That might be ideal in some situations but I want this basically to be a mirror of the current state of the local repository folder.  In reality what I want is a one way sync to the backup location.  Luckily the robocopy command can manage this.

cmd /k robocopy "Repositories To Backup" "Backup Location" /e /purge

By placing the above command in a batch file, anything new in the "Repositories To Backup" directory will be copied to the "Backup Location".  Don't worry about the cmd /k part, it just lets the command window stay open after it runs robocopy.  By default robocopy copies a file if it's changed in any way.  If unchanged, it will just skip the file.  This will prevent Drive from wanting to upload the file again.  The /e option means it will also copy empty subdirectories and the /purge option means that it will delete files from the backup location that don't appear in the source directory.  This keeps the backup location synced to the source location.

I keep all my git repositories in a Projects folder, so I just set the "repositories to backup" to the this folder, so that when I run the batch file it backs up all the repositories at once.  I run the batch script it manually, but you could schedule it to run automatically too.  I know it's not the best solution, but it works for me.

Friday, May 29, 2015

Simple Data Backup with Paper Based QR Codes

Let's say you have some important data you want to protect, how do you do it?  The obvious answer is encryption, this then leaves you with the smaller but more manageable problem of protecting the key.  This is really important though, if you loose the key, the data becomes useless.  So it's not uncommon to back it up.  How you want to safeguard the key and where you want to store it aren't the subject of this post, what I want to talk about is a method to ensure the longevity of the data and medium you store it on that's also dead easy to recover.  (It looks hard, but it really isn't)


The first thing to consider when thinking about backups is the medium.  If you archived data on a 5.25 inch floppy 20 years ago, you might have a hard time recovering that today.  First you have to find a disk drive to read the information, then you have to hope that the information stored on the magnetic media hasn't degraded, then you need to be able to read the format of the recovered file.  This doesn't just apply to magnetic media.  To quote the National Archive of Australia about the preservation of physical media:

Recordable CDs and DVDs, USB keys and various forms of flash memory have doubtful long-term reliability and are subject to format and software obsolescence.

So what do you do?  The least worst solution is to store the data on paper.  Print it out and put it somewhere safe.  If you want a bit more safety, print out multiple copies and put them in different locations, it's up to you.  If you want to get all tin foil hat, you could split the data into n pieces that only require k parts to reassemble using Shamir's Secret Sharing algorithm.  For example split the key into 6 parts that only require any 4 pieces to reassemble, then store each portion in a different location.  I'll leave that for another time.    The point is, paper has proven that it can stand the test of time if stored with even the slightest bit of care.  You also don't need need specialised equipment to read it (although it helps).

Once you decide to store the data on paper then comes the question of how you plan to do this.  If the file is binary data you can't just print it as there'll be non printable characters, and unless you choose the right font it can be hard to tell the difference between characters.  E.g. | l 1.  You could print a hex dump of the file, but if you need to recover the file re-entering that data could be a very long process.  The easiest way is to use bar codes, QR codes to be exact.  The ubiquity of QR codes leads me to believe that a major catastrophe will  have to befall humanity before we forget how to read them.  Even if it has to be done by hand, I think they're a stable format.

The process to go from file to printed QR codes and back again is surprisingly simple when you use the right tools.  There are solutions like PaperBack that accomplish a similar goal, but it seems to use it's own barcode format and doesn't use a standard like a QR code.  That brings long term reliability into question.  The method I propose is listed below and uses software with functions that can be performed manually or easily reproduced with other software.

I decided to test this out using a live USB of the TAILS operating system.  The file I've backed up is an example Keepass database I created.  Start by installing the required tools.


    sudo apt-get update
    sudo apt-get install zbar-tools imagemagick qrencode



QRencode is used to create QR codes from terminal input, and that's what we'll be using it for.
Zbar-tools is a flexible easy to use barcode reader that can decode bar codes from an image or webcam.  We're going to use it to scan the data back into the computer.
ImageMagick is like the Swiss army knife of Linux image editing.  This will be used to combine 6 barcodes onto one page ready for printing.

 

Create the Barcodes


Next the input file will be encoded in base 64 format.  This probably isn't needed as QR codes are capable of encoding 8-bit binary data.  I just do it to be safe.  What I did actually wastes space, so do what works best for you.


    base64 keyfile.kdb > keyfile.64


The file is then split into a series of smaller files that can be converted to QR codes.  You can only fit so much data into a QR code.  A couple thousand bytes depending on your encoding and level of error correction.  Once again, use your judgement.


    split -n 6 keyfile.64 Passwords_kdb_64


Encode each portion of the split file as a QR code.  The -l H option gives the maximum amount of error correction in case the bar code is damaged.  I've processed all files using a command line loop.  This is something to generally avoid.


    for file in ./Passwords_kdb_64*; do qrencode -l H -o $file.png < $file; done


We'll then combine 6 QR codes into one image containing 3 rows of 2 codes with the filenames under each code.  If you have more than 6 bar codes don't worry about it, imagmagick will create as many output images as you need.


    montage -label '%f' *.png -geometry '1x1<' -tile 2x3 Passwords_kdb_64.png


QR code Backup
Resulting QR codes storing a password database

Recover the Original Data


Scan each of the QR codes in order using zbarcam and redirect the output to a file.  Each code is on a new line with a header identifying the type of code scanned.  The new lines and headers need to be removed.  This was done manually.


    zbarcam > keyfile.64


The last step is to convert the base 64 encoded file back to the original binary file.


    base64 -d keyfile.64 > keyfile.kdb


There you have it, file to QR code and back again.  What I like about this method is that even if all the software used to create the final output image disappears, the encoded data can still be recovered as long as you can decode a QR code and convert a file from base64 back to binary.  Both of these processes are widely known.

You can find all the associated files below.
https://gist.github.com/GrantTrebbin/0c6aadc7ecebe3107d08
https://drive.google.com/folderview?id=0B5Hb04O3hlQSfmwzVFdCTS1YZm8xSVVLZm95by0zLVpaTHR2WE1XcTVicWE5NUFJZjg4cGs&usp=sharing


Wednesday, April 9, 2014

A Platform Agnostic Way To Collect Photos of an Event

My sister's wedding was on the weekend and I wanted a way to collect photos from guests.  There are a lot of ways to do this, but I wanted a way for anyone to contribute photos, no matter what platform they're on or what social network they're a part of.

I had initially settled on using Flickr because people can log on using a Facebook or Google account and pretty much everyone belongs to one of those services.  The less people had to do the better.  If they had to sign up for something they'd run for the hills.  This didn't pan out though.  Once you've logged in via Google or Facebook you still need to set up a Fickr account.  So I needed to find another idea.

Then I came across a service called dbinbox.com .  It allows people to upload files to your Dropbox account without signing up for a service.  Setup is easy.  You select a user name, decide if you want an access passphrase, link it with your dropbox account, and you're ready to go.  People can then upload files and they'll appear in the /Apps/dbinbox folder in your Dropbox account.

Sign up page
Setup dbinbox

The interface to upload files is easy to use.  You can either upload a file or send a message that will appear as a text file.  The page dynamically adjusts and changes depending on screen size.  If your on a computer, files can be dragged and dropped onto the page, but it's just as easy to use on a mobile device by selecting the files manually.

Upload page
dbinbox interface

For example I uploaded a file and sent a message in the image below.  As an improvement I'd like to see a way for people to enter their name and have that somehow associated with the files they upload.  Maybe create a subdirectory with the persons name and put the files in that folder.  That way I could know who sent what.

Another suggestion would be the interface.  Given the layout, how things work is pretty obvious to me, but some less technically oriented people think they need to press the send button at the bottom after they have selected their files.  I know this because I keep getting a few empty text files when people upload images.  Maybe this could be made clearer to users.

Edit: Christian has updated the site.  The "Send" button is now the "Send Text" button.  This should clear up any confusion.

Upload page
Using dbinbox

Because I have Dropbox running on my computer the files are automatically downloaded.  Not entirely necessary, but I don't have a lot of storage on Dropbox so I just wanted get them out of Dropbox quickly.  I know it's gilding the lily a bit but I also installed Folder Monitor to make an audible alert each time a file arrives.

Downloaded Files
Files in my dropbox folder

It's a simple process and it allows me to collect files, supplementing the photos from the photographer. So far I have about 400 images.  If I was better organised I would have used a url shortener and encoded the link in a QR code and put it on the back of the place cards.  People could have been uploading images while the event was fresh in their memory.  Overall dbinbox is a simple and easy to use service that I'd highly recommend.

My sister and I

Tuesday, May 21, 2013

How a Digital Hoarder Does Backups

After years of telling people to backup their important files, I've finally decided to take my own advice and do backups properly by keeping off-site copies.  For quite some time now I've kept duplicate backups of my file archives at home.  The files don't contain the most important data in the world, but they're important to me, and the ever increasing density of storage means that they take practically no space.  The files are mostly old assignments and projects I've worked on since about 1997 and I'd like to keep them just in case.  Yeah, I know, I'm a digital hoarder.

While I'm adding off-site back-ups to my storage process, I've also taken the opportunity to add an extra layer to protect against file corruption.  I've created a file that contains an MD5 checksum of every file on the drive.  Although MD5 isn't cryptographically secure, it's enough to detect a corrupted file while being considerably faster than SHA1 to generate and check.  Generating the MD5 checksum file is easy.  I just navigated to the root directory of the external drive in Linux and ran the following command.

find ./ -type f -exec md5sum {} + > Checksums.MD5

The generated file Checksums.MD5 contains the checksum of every file on the drive and can be later used to check the integrity of each file with the next command.

md5sum -c --quiet Checksums.MD5

The checksum files generated will fail validation as their checksums are generated before they are completed, but every other file on the drive should quietly pass validation.

So how does this help to maintain file integrity?  Every couple of months both drives need to be checked to make sure that they aren't corrupted.  If corruption is detected, a new backup needs to be made from the working backup to replace the failing drive.  It's unlikely that there will be a failure of two drives at the same time (not impossible).

I also intend to perform another check at this time.  As the backup drives are kept off-site, the important files are encrypted with 7zip.  I felt that it's a stable and secure program that my family would be able to use if the need arose.  However it's important to guard against format rot.  If it turns out that 7zip is no longer maintained and it falls into disrepair or obscurity I have a chance to re-encrypt my files using a different program.

Now I know what most of you are saying, use the cloud.  Well, if you can tell me of an affordable on-line backup service with 600GB of encrypted storage go for it.  Not to mention how long that would take to upload.  I also know that my system isn't perfect, but it's a lot better than what I used to use.  Nothing is foolproof, it's all about minimizing risk.  By keeping a checksum of my files I've minimized the risk of them being corrupted, and by keeping off-site backups I've protected myself again localized damage, i.e. house fire or theft.  Both locations are in areas with a low risk of flooding about 5 km apart, so if there's a disaster that destroys both copies, it's likely that I won't be around either.