Deduplication tool

I’m in the process of starting a proper backup solution however over the years I’ve had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I’m looking for a deduplication tool.

  • accept a destination directory
  • source locations should be deleted after the operation
  • if files content is the same then delete the redundant copy
  • if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.

Edit: Some comments suggested using btrfs’ feature duperemove. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.

Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I’m not going for this myself as I don’t trust my self to write a bug free implementation.

possiblylinux127,

I use rsync and ZFS snapshots

deadbeef79000,

For backup or for file-level reduplication?

If the latter, how?

slavanap,

1 rsync allows to sync hardlinks correctly 2 zfs has pretty fast (zfs set dedup=edonr,verify) block level duplication where block size is 1MB (zfs set blocksize=1M). 3 in reality I tried to achieve proper data structure but it was way too time consuming so I couldn’t do any work other than that, thus I established zfs as a history backtrack where I can rollback to something very important what I accidentally can delete, thus using ZFS and all aforementioned its benefits

lemmyvore,

Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.

FryAndBender,

Here are the stats from a backup of 1 server with approx 600gig



<span style="color:#323232;">                   Original size      Compressed size    Deduplicated size
</span>

This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB


<span style="color:#323232;">                   Unique chunks         Total chunks
</span>

Chunk index: 2760965 19590945

13meg… nice

geoma,

What about folders? Because sometimes when you have duplicated folders (sometimes with a lot of nested subfolders), a file deduplicator will take forever. Do you know of a software that works with duplicate folders?

Agility0971,
@Agility0971@lemmy.world avatar

What do you mean that a file deduplication will take forever if there are duplicated directories? That the scan will take forever or that manual confirmation will take forever?

geoma,

That manual confirmation will take forever

deadbeef79000,

I have exactly the same problem.

I got as far as using fdupe to identify duplicates and delete the extras. It was slow.

Thinking about some of the other comments… If you use a tool to create hardlinks first, then one could then traverse the entire tree and deleting a file if it has more than one hardlink. The two phases could be done piecemeal and are cancelable and restartable.

Agility0971,
@Agility0971@lemmy.world avatar

That sounds doable. I would however not trust my self to code something bug free on the first go xD

deadbeef79000,

Backup backup backup! If you have btrfs them just take a snapshot first: instantly.

One could do a non-destructive rename first. E.g. prepend deleteme. to the file name, sanity check it, then ‘rollback’ by renaming back without the prefix or commit and delete anything with the prefix.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • linux@lemmy.ml
  • fightinggames
  • All magazines