I am working on a game preservation project, where the data set holds 10.4 TiB. It contains 1044 earlier versions of a single game in a multitude of different languages, architectures and stages of development. As you can guess, that means extreme redundancy.
The goals are: - bring the size down - retain good read speed (for further processing/reversing) - easy sharable format - lower end machines can use it
My choice fell on the BTRFS filesystem, since it provides advanced features for deduplication, which is not as resource hungry as ZFS. Once the data is processed, it no longer requires a lot of system resources.
In the first round of deduplication, I used "jdupes -rQL" (yes, I know what -Q does) to replace exact copies of files in different directories via hardlinks to minimize data and metadata. This got it down to roughly 874 GiB already, out of which 866 GiB are MPQ files. That's 99,08%... everything besides is a drop in the bucket.
For those uninitiated: this is an archive format. Representing it as a pseudo-code struct it looks something like this { header, files[], hash_table[], block_table[] } Compression exists, but it is applied to each file individually. This means the same file is compressed the same way in different MPQ archives, no matter the offset it happens to be in.
What is throwing a wrench into my plans of further data deduplication are the following points: - the order of files seems not to be deterministic when MPQ files were created (at least I picked that up somewhere) - altered order of elements (files added or removed at the start) causes shifts in file offsets
I thought for quite some time about this, and I think the smartest way forward is, that I manually hack apart the file into multiple extents at specific offsets. Thus the file would contain of an extent for: - the header - each file individually - the hash table - the block table It will increase the size for each file of course, because of wasted space at the end of the last block in each extent. But it allows for sharing whole extents between different archives (and extracted files of it), as long as the file within is content-wise the same, no matter the exact offset. The second round of deduplication will then be whole extents via duperemove, which should cut down the size dramatically once more.
This is where I am hanging right now: I don't know how to pull it off on a technical level. I already was crawling through documentation, googling, asking ChatGPT and fighting it's hallucinations, but so far I wasn't very successful in finding leads (probably need to perform some ioctl calls).
From what I imagine, there are probably two ways to do this: - rewrite the file with a new name in the intended extent layout, delete the original and rename the new one to take it's place - rewrite the extent layout of an already existing file, without bending over backwards like described above
I need is a reliable way to, without chances of the filesystem optimizing away my intended layout, while I write it. The best case scenario for a solution would be a call, which takes a file/inode and a list of offsets, and then reorganizes it into that extents. If something like this does not exist, neither through btrfs-progs, nor other third party applications, I would be up for writing a generic utility like described above. It would enable me to solve my problem, and others to write their own custom dedicated deduplicaton software for their specific scenario.
If YOU - can guide me into the right direction - give me hints how to solve this - tell me about the right btrfs communities where I can talk about it - brainstorm ideas I would be eternally grateful :) This is not a call for YOU to solve my problem, but for some guidance, so I can do it on my own.
I think that BTRFS is superb for deduplicated archives, and it can really shine, if you can give it a helping hand.
"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.
Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,
Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.
Of course, I guess all of this would be much slower than being able to use inherent btrfs features as the author of the post wishes, but at least it would be independent from btrfs :)
Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.
If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.
i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it. the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries. an archive file might have to be rewritten.
if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.
still would prefer to use a btrfs filesystem tbh ^^ will brainstorm a bit more over the next days - thanks for your input!
- Deduplication? Check.
- Compact format? Check.
- Good read speed? Yep. (Proportional to backing store.)
- Custom purpose-built? Yeah, that's what backup programs are for.
- Custom data layout? Check. (Rabin/BuzHash content-defined chunking, SHA256 dedupe.)
- Adding additional versions? Yes. ("Incremental backups"— See above.)
- Virtual filesystem access? `restic mount`/`borg mount`.
Bees does things at a lower level than just reflinking full files like dupremove (it dedups at the level of extents). Be sure to read the links in the "Recommended Reading" section, of the above link, to hopefully avoid issues.
Random idea (I haven't tested it): If you need to figure out how to cluster together related MPQ files, you could compute merkle trees (or multiple merkle trees shifted by prime number offsets to identify common subsequences at arbitrary offsets), and use selected hashes for similarity search.
the MPQ files range from a few MiB to around 2.5 GiB. since the access should be rather fast, pairing them as an archive file is not an option for me.
thanks about the hint of the merkle trees, i will read up on what that is... always good to know about different approaches to a problem :)
- `zstd --train` a dictionary over all the files, then compress it all using the dictionary. Decompress on demand. Make sure you don't lose the dictionary, or you'll lose all of it.
- OR stick it into a Restic/Borg/Kopia/Deduplicacy repository. These use a rolling hash to deduplicate any subsections of files that are the same, before applying compression. You can try it with `--dry-run` first to see how much space you would save. Use their `mount` command to access the files as a directory.
I would not be surprised if it gets the total size down close-ish to the size of only a single copy of the game.
See the other comment about `precomp` if compression gives you issues.
Personally I would just go with the `restic` route, as that's what it's meant for.
Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.
As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.
This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.
In more detail, to prepare your collection, you would write a script to do these steps:
1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.
2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.
3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.
4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.
To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.
The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.
the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting. thanks for your input.
The difference between my suggestion and how BTRFS dedupes is that the FS does it on a block basis, which might work better for some kinds of files, but for game assets I think doing it on a file basis is good enough if not better.
BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by. my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.
if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such). thanks for your input so far.
Write Once is how to manage an archive.
Displaying a curated subset is a good interface. Good luck.
If the goal is algorithmic erasure of data rather than preservation of games, then “archive” might again create confusion like the type I probably have.
If there is a strong business case for deduplication, I recommend hiring a consultant with expertise and experience in the problem.
To be clear search is the only way to identify redundancy. If you have search, then redundancy is not a problem.
The question began, I am working on a game preservation project.
aka, is the is a "make it easier to search" question or "we need to take up fewer bytes" question (which, perhaps, used to be the same question, but aren't necessarily now?)