Saturday, November 5, 2011

Finding Duplicate Files With Linux

Recently I needed to to clean up an external USB drive and remove duplicate files.  Since I run Windows, things like this were a pain in the arse.  That changed about two years ago when I started using a virtual Ubuntu machine on my laptop.  It gives you the best of both worlds, things that are annoying on one OS are a breeze on the other.  This was one of those times I knew that Linux would be perfect.

So after a bit of searching I  found the following command by syssyphus over at www.commandlinefu.com.

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Let's break it down and take a closer look at it.  It's made up of a number of commands strung together with pipes, the first being a find command that lists the size of all non empty files in the current directory and its subdirectories. The -printf "%s\n" prints the file size followed by a newline.  Next sort -rn orders that output numerically in reverse order.  If there are duplicate files, the sizes are now beside each other in the output. If you then run uniq -d all unique file sizes are removed from the output and a list of duplicates remains, because if a file has a unique file size, it can't be a duplicate.

That's the easy part, you're now left with a unique list of file sizes that could be duplicates.  The next command, xargs, goes through this list one at a time and uses find to locate all files that size. By using -print0 a null terminator is used to separate the files that are found (this helps if there are unusual characters in the file name).  xargs then runs the files through md5sum to produce an output of a hash value followed by the file name.  The -0 option is used to tell xargs that the incoming list is null terminated.  The list is then sorted and sent to uniq.  As the hash only takes up the first 32 characters of the line, the -w32 option is used to compare lines based only on the hash. The --all-repeated=separate option removes all unique hashes and groups all duplicates together separated by blank lines.

A pretty elegant solution that only fails if you have two files the same size and you get a hash collision between them.  I think there may also be a slight problem in the last sort command if there are newline characters in the file name, but all of these problems are pretty uncommon and since the command doesn't delete anything it doesn't really matter.

One problem I ran into though was that I had about 2000 different file sizes that were listed after the first find command.  This means that xargs has to run the next find command 2000 times and it was taking a while.  So instead of re-finding the files I managed to alter the command to keep the file names from the first find.

find -not -empty -type f -printf "%s %p\n" | awk '{if (x[$1]) { xc[$1]++; print $0; if (xc[$1] == 1) {print x[$1]}} x[$1] = $0}' | sed 's/^[0-9 ]*//' | xargs -d '\n' md5sum | sort | uniq -w32 --all-repeated=separate

The first find command is basically the same as before except it prints the file name after the size.  awk is then used to remove unique file size entries. The if (x[$1]) checks if anything is in the associative array using the first field $1 (file size) as an index.  If nothing is in the array at this index the entire line is then added to this location x[$1] = $0, nothing is printed.

If something is in the array at $1, we now have a duplicate file size.  A counter array is then incremented and the line is printed.  If this is the first time a duplicate has been found the original line that was used for comparison can be printed from the x array, as it is also a duplicate.  This is accomplished by the if (xc[$1] == 1) {print x[$1]} section of the command.   Note that the incoming list of file sizes and names doesn't need to be ordered for this to work.  What is left now is a list of files that have other files the same size, basically unique file sizes removed.

sed is now used to strip off the leading spaces and numbers, ie the file size, this just leaves the file name.  Once again xargs then sends the file names through md5sum, but this time a new line character is used as the delimiter.  As before the output is sorted and unique hashes removed.

Doing it this way means that the find command is run only once, and gave me a fair bit of a speed up.  However, because awk is a line based tool, any files with newlines characters will cause this to fall over.  It was more or less just an intellectual exercise to see if it could be done.

The outputs of these commands were then redirected to a file for checking.  The easiest way I found was to go though the file and place the letter x at the start of any line that contained a file you want deleted.  Then just run the following commands.

infile.txt | grep '^x' | cut -c 36- > outfile.txt
xargs -d '\n' rm < outfile.txt

The first command initially removes lines from the list that don't start with an x, it then cuts 36 characters from the start of the line.  This is 32 for the hash, 1 for the x, and 3 for spaces, leaving only the file name. The second command redirects the file to xargs, and using a newline character as a delimiter then sends it to rm to delete the file.

There are programs like fdupes to do this for you, but I was after a little more control, enabling me to do a little bit of the processing at a time, because at pretty much any point you can break the command up and redirect it to a file for later processing.

It must be said though that if you intend to do anything like this, TEST, TEST, and TEST.  Oh, did I mention TEST.  One of the commands used on your system may be slightly different than mine and you could end up losing a lot of data.

1 comment:

Note: Only a member of this blog may post a comment.