Deduplication tool

I’m in the process of starting a proper backup solution however over the years I’ve had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I’m looking for a deduplication tool.

accept a destination directory
source locations should be deleted after the operation
if files content is the same then delete the redundant copy
if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.

Edit: Some comments suggested using btrfs’ feature duperemove. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.

Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I’m not going for this myself as I don’t trust my self to write a bug free implementation.

Image

Image alternative text

deadbeef79000, 4 days ago

I have exactly the same problem.

I got as far as using fdupe to identify duplicates and delete the extras. It was slow.

Thinking about some of the other comments… If you use a tool to create hardlinks first, then one could then traverse the entire tree and deleting a file if it has more than one hardlink. The two phases could be done piecemeal and are cancelable and restartable.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Agility0971, 3 days ago

That sounds doable. I would however not trust my self to code something bug free on the first go xD

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

deadbeef79000, 3 days ago

Backup backup backup! If you have btrfs them just take a snapshot first: instantly.

One could do a non-destructive rename first. E.g. prepend deleteme. to the file name, sanity check it, then ‘rollback’ by renaming back without the prefix or commit and delete anything with the prefix.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

geoma, 4 days ago

What about folders? Because sometimes when you have duplicated folders (sometimes with a lot of nested subfolders), a file deduplicator will take forever. Do you know of a software that works with duplicate folders?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Agility0971, 3 days ago

What do you mean that a file deduplication will take forever if there are duplicated directories? That the scan will take forever or that manual confirmation will take forever?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

geoma, 2 days ago

That manual confirmation will take forever

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

lemmyvore, 4 days ago

Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

FryAndBender, 3 days ago
Here are the stats from a backup of 1 server with approx 600gig
<span style="color:#323232;">                   Original size      Compressed size    Deduplicated size
</span>
This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB
<span style="color:#323232;">                   Unique chunks         Total chunks
</span>
Chunk index: 2760965 19590945

13meg… nice
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

possiblylinux127, 4 days ago

I use rsync and ZFS snapshots

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

deadbeef79000, 4 days ago

For backup or for file-level reduplication?

If the latter, how?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

slavanap, 4 days ago

1 rsync allows to sync hardlinks correctly 2 zfs has pretty fast (zfs set dedup=edonr,verify) block level duplication where block size is 1MB (zfs set blocksize=1M). 3 in reality I tried to achieve proper data structure but it was way too time consuming so I couldn’t do any work other than that, thus I established zfs as a history backtrack where I can rollback to something very important what I accidentally can delete, thus using ZFS and all aforementioned its benefits

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Kualk, 4 days ago

hardlink

Most underrated tool that is frequently installed on your system. It recognizes BTRFS. Be aware that there are multiple versions of it in the wild.

It is unattended.

www.man7.org/linux/man-pages/…/hardlink.1.html

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Tramort, 4 days ago

Is hardlink the same as ln without the -s switch?

I tried reading the page but it’s not clear

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

deadbeef79000, 4 days ago

ln creates a hard link, ln -s creates a symlink.

So, yes, the hardlink tool effectively replaces a file’s duplicates with hard links automatically, as if you’d used ln manually.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Tramort, 4 days ago

Ahh! Cool! Thanks for the explanation.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Agility0971, 4 days ago

This will indeed save space but I don’t want links either. I unique files

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

TheAnonymouseJoker, 4 days ago

The largest footprint file type is videos. Use Video Duplicate Finder tool on Github. Then use Czkawka to deduplicate general types of files. Both are available on Linux.

This will solve atleast 97% of your problems.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ninekeysdown, 4 days ago

Restic

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

biribiri11, 4 days ago

As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Agility0971, 4 days ago

I did not ask for a backup solution, but for a deduplication tool

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

biribiri11, 4 days ago (edited 4 days ago)

Tbf you did start your post with

I’m in the process of starting a proper backup

So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using borg import-tar (docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rotopenguin, 4 days ago

Use rm with the redundant files option.

rm -rf /

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BCsven, 4 days ago

Fs-lint will do some of these things once you configure its actions

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

HumanPerson, 4 days ago

I believe zfs has deduplication built in if you want a separate backup partition. Not sure about its reliability though. Personally I just have a script that keeps a backup and an oldbackup, and they are both fairly small. I keep a file in my home dir called excluded for things like linux ISOs that don’t need backed up.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

GenderNeutralBro, 4 days ago

BTRFS also supports deduplication, but not automatically. duperemove will do it and you can set it up on a cron task if you want.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kylian0087, 4 days ago

Take a look at Borg. It is a very well suited backup tool that has deduplication.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

utopiah, 4 days ago

I don’t actually know but I bet that’s relatively costly so I would at least try to be mindful of efficiency, e.g

use find to start only with large files, e.g > 1Gb (depends on your own threshold)

look for a “cheap” way to find duplicates, e.g exact same size (far from perfect yet I bet is sufficient is most cases)

then after trying a couple of times

find a “better” way to avoid duplicates, e.g SHA1 (quite expensive)

lower the threshold to include more files, e.g >.1Gb

and possibly heuristics e.g

directories where all filenames are identical, maybe based on locate/updatedb that is most likely already indexing your entire filesystems

Why do I suggest all this rather than a tool? Because I be a lot of decisions have to be manually made.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

utopiah, 4 days ago

if you use rmlint as others suggested here is how to check for path of dupes

jq -c ‘.[] | select(.type == “duplicate_file”).path’ rmlint.json

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

utopiah, 4 days ago

FWIW just did a quick test with rmlint and I would definitely not trust an automated tool to remove on my filesystem, as a user. If it’s for a proper data filesystem, basically a database, sure, but otherwise there are plenty of legitimate duplication, e.g ./node_modules, so the risk of breaking things is relatively high. IMHO it’s better to learn why there are duplicates on case by case basis but again I don’t know your specific use case so maybe it’d fit.

PS: I imagine it’d be good for a content library, e.g ebooks, ROMs, movies, etc.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

utopiah, 4 days ago

fclones github.com/pkolaczk/fclones looks great but I didn’t use it so can’t vouch for it.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

paris, 4 days ago

I was using Radarr/Sonarr to download files via qBittorrent and then hardlink them to an organized directory for Jellyfin, but I set up my container volume mappings incorrectly and it was only copying the files over, not hardlinking them. When I realized this, I fixed the volume mappings and ended up using fclones to deduplicate the existing files and it was amazing. It did exactly what I needed it to and it did it fast. Highly recommend fclones.

I’ve used it on Windows as well, but I’ve had much more trouble there since I like to write the output to a file first to double check it before catting the information back into fclones to actually deduplicate the files it found. I think running everything as admin works but I don’t remember.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

boredsquirrel, 4 days ago

btrbk

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

lurch, 4 days ago

make sure to make the first backup before you use deduplication. just in case it goes sideways

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

MalReynolds, 4 days ago

Be aware that halfway decent backup solutions dedupe. Which is not to say you shouldn’t clean your shit up. I vote github.com/qarmin/czkawka.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment

Chunk index: 2760965 19590945