Show HN: ZeroFS, the Filesystem That Makes S3 Your Primary Storage

53
35
Eikon
1 day ago
github.com

arch_rust
·
30 minutes ago
·
[ - ]

visualphoenix
·
21 hours ago
·
[ - ]

Seems to be nfs v3 [0] - curious to test it out - the only userspace nfsv4 implementation I’m aware of is in buildbarn (golang) [1]. The example of their nfs v3 implementation disables locking. Still pretty cool to see all the ways the rust ecosystem is empowering stuff like this.

I’m kinda surprised someone hasn’t integrated the buildbarn nfs v4 stuff into docker/podman - the virtiofs stuff is pretty bad on osx and the buildbarn nfs 4.0 stuff is a big improvement over nfs v3.

Anyhow I digress. Can’t wait to take it for a spin.

[0] https://github.com/Barre/zerofs_nfsserve

[1] https://github.com/buildbarn/bb-remote-execution/tree/master...

Eikon
·
13 hours ago
·
[ - ]

BuildBarn's nfsv4 server looks great!

I'm not too sure about NFSv4 for ZeroFS specifically, the benefits don't seem _that_ obvious to me, especially compared to something like 9P, where server-side locking in the ".L" version is native anyway. Especially considering that most ZeroFS users are mounting the fs on the same box as the server is running and so local locks are working.

MacOS is a bit of a blind spot for 9P though, and a client there would be great.

anorwell
·
21 hours ago
·
[ - ]

Seems like a really interesting project! I don't understand what's going on with latency vs durability here. The benchmarks [1] report ~1ms latency for sequential writes, but that's just not possible with S3. So presumably writes are not being confirmed to storage before confirming the write to the client.

What is the durability model? The docs don't talk about intermediate storage. Slatedb does confirm writes to S3 by default, but I assume that's not happening?

[1] https://www.zerofs.net/zerofs-vs-juicefs

Shakahs
·
19 hours ago
·
[ - ]

SlateDB offers different durability levels for writes. By default writes are buffered locally and flushed to S3 when the buffer is full or the client invokes flush().

https://slatedb.io/docs/design/writes/

Eikon
·
17 hours ago
·
[ - ]

The durability profile before sync should be pretty close to a local filesystem. There’s (in-memory) buffering happening on writes, then when fsync is issued or when we exceed the in-memory threshold or we exceed a timeout, data is sync-ed.

anorwell
·
10 hours ago
·
[ - ]

Thanks, makes sense. I found the benchmark src to see it's not fsyncing, so only some of the files will be durable by the time the benchmark is done. The benchmark docs might benefit from discussing this or benchmarking both cases? O_SYNC / fsync before file close is an important use case.

edit: A quirk with the use of NFSv3 here is that there's no specific close op. So, if I understand right, ZeroFS' "close-to-open consistency" doesn't imply durability on close (and can't unless every NFS op is durable before returning), only on fsync. Whereas EFS and (I think?) azure files do have this property.

Eikon
·
10 hours ago
·
[ - ]

There's an NFSv3 COMMIT operation, combined with a "durability" marker on writes. fsync could translate to COMMIT, but if writes are marked as "durable", COMMIT is not called by common clients, and if writes are marked as non-durable, COMMIT is called after every operation, which kind of defeats the point. When you use NFS with ZeroFS, you cannot really rely on "fsync".

I'd recommend using 9P when that matters, which has proper semantics there. One property of ZeroFS is that any file you fsync actually syncs everything else too.

williamstein
·
21 hours ago
·
[ - ]

They have a bunch of claims comparing this to JuiceFS at https://www.zerofs.net/zerofs-vs-juicefs

I am in no way affiliated with JuiceFS, but I have done a lot of benchmarking and testing of it, and the numbers claimed here for JuiceFS are suspicious (basically 5 ops/second with everything mostly broken).

ChocolateGod
·
18 hours ago
·
[ - ]

I have a juicefs setup and get operations in the 1000/s, not sure how they got such low numbers.

JuiceFS also supports multiple concurrent clients making their own connection to the metadata and object storage, allowing near instant synchronization and better performance, where this seems to rely on a single service having a connection and everyone connecting through it with no support for clustering.

Eikon
·
16 hours ago
·
[ - ]

I have no doubt that JuiceFS can perform “thousands of operations per second” across parallel clients. I don't think that's a useful benchmark because the use cases we are targeting are not embarrassingly parallel.

Using a bunch of clients on any system hides the latency profile. You could even get your "thousands of operations per second" on a system where any operation takes 10 seconds to complete.

ChocolateGod
·
10 hours ago
·
[ - ]

I'm referring to one client being able to do 1000s of operations a second. I've not experienced this bad performance with JuiceFS that the article is describing.

This is a JuiceFS setup with 10TB of data, where JuiceFS was specifically chosen because of its minimal latency compared to raw object storage.

Eikon
·
9 hours ago
·
[ - ]

> I'm referring to one client being able to do 1000s of operations a second

Again, I don't see any problems with "thousands of operations a second" if your single client is issuing parallel requests. Try any sequential workload instead.

> because of its minimal latency compared to raw object storage

That's not how it works. JuiceFS hits the object store for every file operation (except metadata, because it's backed by an external database). Latency cannot be lower, especially as JuiceFS doesn't seem to be implementing any buffering.

Above you also claim "allowing near instant synchronization and better performance", but near instant synchronization of what exactly? Metadata? Data blocks? And under what consistency model?

If you wish to run the benchmarks in bench/ on your setup and show better numbers, I'd happily update the JuiceFS benchmarks.

·
16 hours ago
·
[ - ]

selfhoster1312
·
16 hours ago
·
[ - ]

I've never used JuiceFS in prod (or any S3 product for that matter), but i was involved in benchmarking juiceFS+garage for archiving based on torrents, and initial results were promising but qBittorrent quickly produced a pathological case where reads/writes dropped to almost zero (and stayed that way long term).

The data and method can be found here: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1021

Every software's perf is optimized for a usage pattern and maybe TFA's benchmark or torrenting isn't it, but i was certainly disappointed in the performance i found on 12 core / 144GB / NVME (for metadata) / 6x8TB HDD (for data). In the end we didn't move to S3 we stayed with good old ZFS storage.

I'd be curious to reproduce my benchmarks with ZeroFS when i find the time.

Eikon
·
17 hours ago
·
[ - ]

The bench suite is published at the root of the repo in bench/. Stating “a bunch of claims” seems a bit dismissive when it’s that easy to reproduce.

JuiceFS maps operations 1:1 to s3 chunk wise, so the performance profile is entirely unsurprising to me, even untaring an archive is not going to be a pleasant experience.

billev2k
·
19 hours ago
·
[ - ]

I had to laugh out loud: "In practice, you'll encounter other constraints well before these theoretical limits, such as S3 provider limits, performance considerations with billions of objects, or simply running out of money."

siliconc0w
·
22 hours ago
·
[ - ]

Very cool to see NFS, NDB, or even P9 supported.

Looking at those benchmarks, I think you must be using a local disk to sync writes before uploading to S3?

Eikon
·
16 hours ago
·
[ - ]

There’s no on-disk write cache.

moltar
·
20 hours ago
·
[ - ]

Has anyone tried it as a cache dir in CI? I’m concerned about random reads looking pretty slow in the benchmarks.

akshayKMR
·
21 hours ago
·
[ - ]

Incredibly cool! It shows running Ubuntu and Postgres, and also supports full posix operations.

Questions:

- I see there is a diagram running multiple Postgres nodes backed by this store, very similar to horizontally distributed web-server. Doesn't Postgres use WAL replication? Or is it disabled and they are they running on same "views" of the filesystem?

- What does this mean for services that handle geo-distribution on app layer? e.g. CockroachDB?

Sorry if this sounds dumb.

jiggawatts
·
13 hours ago
·
[ - ]

It looks bizarre to see that Azure Storage support was included. That's because there's no need to use fancy SlateDB+LSM tree code to simulate a block device on Azure Storage, it already natively provides "page blobs" that can be mounted as standard disks and used for boot disks, databases, or whatever. These are used as the backing storage for all Azure virtual machine disks and can also be used directly via the HTTP API. For example, SQL Server can store its database files on page blobs, without having to attach them as disks and formatting them as NTFS volumes.

See: https://learn.microsoft.com/en-us/azure/storage/blobs/storag...

Eikon
·
13 hours ago
·
[ - ]

What's so bizarre about it?

Are we prevented from implementing features when the platform itself provides adjacent tech?

Page blobs are 2x+ more expensive ($0.045/GB vs $0.02/GB) and would require Azure-specific code. We don't dream of maintaining separate implementations for each cloud provider.

The approach works identically on S3, GCS, Azure, and local storage. If anything, ZeroFS helps you distance from vendor lock-in, especially if you start to run "usually managed" components on top, say databases.

Speaking of Azure's "native" solutions - we benchmarked ZeroFS against Azure Files. ZeroFS is 35-41x faster for most operations and 38% cheaper. If Azure Files performs that poorly, I don't hold much hope for page blobs either: https://www.zerofs.net/zerofs-vs-azure-files

jiggawatts
·
13 hours ago
·
[ - ]

Fair points.

Note that Azure Files is a significantly different offering than block storage, it is essentially a Windows file server cluster and uses the SMB v3.1.1 protocol. It's slow, but not that slow.

It looks like you benchmarked against the HDD version of it, which few people use. Everyone who cares about its performance uses Azure Files Premium, which is the pure-SSD version of the service. They recently updated it to have flexible performance that can be cheaper than the older HDD storage: https://techcommunity.microsoft.com/blog/azurestorageblog/lo...

You're also mounting it from Linux, which is just... weird. Nobody (for some values of nobody) does this. The sole purpose for this service is to provide backwards-compatible file shares for Windows servers and desktops.

Benchmarking cloud services can be tricky because they have odd quirks such as IOPS-per-GB ratios, so nearly empty storage can be much slower than you'd expect!

Eikon
·
12 hours ago
·
[ - ]

> You're also mounting it from Linux, which is just... weird. Nobody (for some values of nobody) does this. The sole purpose for this service is to provide backwards-compatible file shares for Windows servers and desktops.

That's not what Azure's marketing material says:

Serverless file shares

"Store your files on a distributed file system built from the ground up to be highly available and durable."

"Take advantage of fully managed file shares mounted concurrently by cloud or on-premises deployments of Windows, Linux, and macOS."

"Use integration with Azure Kubernetes Service (AKS) for easy cloud file storage and management of your data using NFS or SMB file shares."

Azure explicitly markets this for Linux and Kubernetes. If it performs terribly on Linux, that's Azure's problem, not mine.

https://azure.microsoft.com/en-us/products/storage/files

> Benchmarking cloud services can be tricky because they have odd quirks such as IOPS-per-GB ratios, so nearly empty storage can be much slower than you'd expect!

These managed services charge premium prices - they should deliver premium performance. At those prices, "quirks" and "you need to fill it up first" aren't acceptable excuses.

jiggawatts
·
12 hours ago
·
[ - ]

I don't believe anything Microsoft's marketing department says either.

I've had similar issues with Amazon's equivalent too, where I tried to use EFS for a tiny website and I got something like 0.1 IOPS!

I do like the design of ZeroFS, and it irks me that competing solutions make local caching and buffering weirdly difficult. For example, Azure Premium SSD v2 disks no longer support local SSD caches! It's also annoyingly hard to combine the local cache disks you do get with remote disks using Windows Server, it insists on trying to detect the "role" of each disk instead of just letting admins specify that.

Using something like ZeroFS could be a nice trick for getting decent price/performance on public cloud platforms. My only concern is integrity and durability: what testing have you done to validate that it doesn't lose data even in corner-cases?

Eikon
·
12 hours ago
·
[ - ]

> Using something like ZeroFS could be a nice trick for getting decent price/performance on public cloud platforms. My only concern is integrity and durability: what testing have you done to validate that it doesn't lose data even in corner-cases?

There's always going to be uncertainty with software that young, but the CI is pretty expensive and we didn't yet receive corruption-related reports. https://github.com/Barre/ZeroFS/tree/main/.github/workflows

aspenmayer
·
16 hours ago
·
[ - ]

How does this compare to rclone?

https://rclone.org/

https://github.com/rclone/rclone

sedawkgrep
·
2 hours ago
·
[ - ]

At a glance, ZeroFS appears to be an overlay filesystem above a VFS. It looks like it abstracts the underlying filesystem configuration/complication and creates a simplified, unified configuration that could allow you to be more dynamic underneath.

It also appears to support a great deal of IO and permission complexity, which makes sense as it refers to itself as a filesystem. It also purports to be extremely fast. In the end it looks to function like an actual filesystem with all that entails.

rclone (my first time hearing about it) appears to be a much simpler solution with different design goals. It looks more like rsync in most instances, or a user-space link to a remote filesystem in others (like ftp or smbclient). The mount capability, while similar to zerofs, looks more like something much simpler and targeted for less-sophisticated needs.

Not saying one's better than the other; merely that they really look to be serving different needs/goals while having some overlap.

monkaiju
·
22 hours ago
·
[ - ]

Finally a way to get more than 2gb of local storage on digital ocean's app platform!

nodesocket
·
21 hours ago
·
[ - ]

2gb? They have storage optimized nvme droplets that at the high end support 4.6tb (though absurdly expensive at $2,000/mo).

jauntywundrkind
·
1 day ago
·
[ - ]

Built stop the excellent SlateDB! Breaks files down into 256k chunks. Encrypted. Much much better posix compatibility than most FUSE alternatives. SlateDB has snapshots & clones, so that could be another great superpower of zerofs.

Incredible performance figures, rocketing to probably the best way to use object storage in an fs like way. There's a whole series of comparisons, & they probably need a logarithmic scale given the scale of the lead slatedb has! https://www.zerofs.net/zerofs-vs-juicefs

Speaks 9p, NFS, or NBD. Some great demos of ZFS with l2arc caches giving a near local performance while having s3 persistence.

Totally what I was thinking of when in the Immich someone mentioned wanting a way to run it on cheap object storage. https://news.ycombinator.com/item?id=45169036

ramses0
·
8 hours ago
·
[ - ]

The concept of "auto"-tiered, transparent storage is _really_ compelling! ...and the use of AGPL is really clever as an "enterprise poison pill".

It still sucks that S3 is ~$20/mo/TB basically "in perpetuity", while random SSD drives are ~$80/TB and I'd feel comfortable effectively amortizing them out at ~$20/yr for local storage instead of $20/mo for S3. :-/

Eikon
·
7 hours ago
·
[ - ]

> It still sucks that S3 is ~$20/mo/TB basically "in perpetuity", while random SSD drives are ~$80/TB and I'd feel comfortable effectively amortizing them out at ~$20/yr for local storage instead of $20/mo for S3. :-/

ZeroFS will work with any S3 implementation that supports conditional writes. Even something a bit premium-priced like CloudFlare R2 goes for around $15/TB and has no egress costs. If you want to go for as cheap as possible, I'd throw minio or similar on a few cheap Hetzner storage servers https://www.hetzner.com/dedicated-rootserver/matrix-sx/

aspenmayer
·
7 hours ago
·
[ - ]

How does this compare to rclone, as I also asked in a top level comment? rclone can mount S3 as a union mount so it seems to have the same feature set? Have you benchmarked ZeroFS against rclone?

https://news.ycombinator.com/item?id=45178548

Eikon
·
7 hours ago
·
[ - ]

Unless I am missing something, rclone will probably compare similarly to https://www.zerofs.net/zerofs-vs-mountpoint-s3