Comment

fizzle@quokk.au ⁨3⁩ ⁨weeks⁩ ago

Deduplication based on content-defined chunking is used to reduce the number of bytes stored: each file is split into a number of variable length chunks and only chunks that have never been seen before are added to the repository.

A chunk is considered duplicate if its id_hash value is identical. A cryptographically strong hash or MAC function is used as id_hash, e.g. (hmac-)sha256.

To deduplicate, all the chunks in the same repository are considered, no matter whether they come from different machines, from previous backups, from the same backup or even from the same single file.

Compared to other deduplication approaches, this method does NOT depend on:

  - file/directory names staying the same: So you can move your stuff around without killing the deduplication, even between machines sharing a repo.

 - complete files or time stamps staying the same: If a big file changes a little, only a few new chunks need to be stored - this is great for VMs or raw disks.

 - The absolute position of a data chunk inside a file: Stuff may get shifted and will still be found by the deduplication algorithm.

This is what their docs say. Not sure what you mean about diffferent file types but this seems fairly agnostic?

I actually didn’t realise that first point, as in you can move folders and the chunks will still be deduplicated.

source

Sort:hotnew top

wabasso@lemmy.ca ⁨3⁩ ⁨weeks⁩ ago
That’s impressive, I’ll have to give borg a go. Thanks for pulling that up.

I didn’t know it worked like that, so I was thinking file types matter. Like diff (often used with git) has magic based on code-like text files but wouldn’t be efficient at incrementally backing up an encrypted file.

I’m curious how they handle “frame shift” like if a file has a few bytes added in the front, middle, or end. I guess the first time a file appears, Borg indexes the chunks or something.

source