I’ve finally managed to polish and publish my pet-project for cleaning my home directory from duplicates. It’s here: https://github.com/dopiera/dupa. It has actually proven useful to me multiple times, so I thought I’d share it broader.
I bet that I am not the only person who have repeatedly downloaded photos from my camera/phone without removing them from the device, so downloaded all of them again every time I wanted to download only the newest ones. I also bet I’m not the only person in the world who has copied the data between computers and ended up with 2 mostly similar data sets. This tool helped me get out of this situation.
It works by computing hashes from files and then uses some heuristics to find similar directories or directories which contain mostly duplicates of files scattered elsewhere (think of a big dump of photos, most of which are in other directories sorted by your trips).
The code is available on github: https://github.com/dopiera/dupa. Help yourselves. It actually has a man page, so you can read on how to use it and how it works there.