An intro to DeepSeek's distributed file system

623
108
sebg
2 months ago
maknee.github.io

jamesblonde
·
2 months ago
·
[ - ]

Architecturally, it is a scale-out metadata filesystem [ref]. Other related distributed file systems are Collosus, Tectonic (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think PolarFS (Alibaba). They all use different distributed row-oriented DBs for storing metadata. S3FS uses FoundationDB, Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not sure), HopsFS uses RonDB.

What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).

Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.

[ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...

MertsA
·
2 months ago
·
[ - ]

>Tectonic some KV store,

Tectonic is built on ZippyDB which is a distributed DB built on RocksDB.

>What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easier

Tectonic also has a FUSE client built for GenAI workloads on clusters backed by 100% NVMe storage.

https://engineering.fb.com/2024/03/12/data-center-engineerin...

Personally what stands out to me for 3FS isn't just that it has a FUSE client, but that they made it more of a hybrid of FUSE client and native IO path. You open the file just like normal but once you have a fd you use their native library to do the actual IO. You still need to adapt whatever AI training code to use 3FS natively if you want to avoid FUSE overhead, but now you use your FUSE client for all the metadata operations that the native client would have needed to implement.

https://github.com/deepseek-ai/3FS/blob/ee9a5cee0a85c64f4797...

Scaevolus
·
2 months ago
·
[ - ]

Being able to opt-in to the more complex and efficient user-mode IO path for critical use cases is a very good idea.

carlhjerpe
·
2 months ago
·
[ - ]

While not the same, Ceph storage is accessible as object storage, filesystem (both FUSE and kernel) and block storage.

nickfixit
·
2 months ago
·
[ - ]

I've been using JuiceFS since the start for my AI stacks. Similar and used postgresql for the meta.

jamesblonde
·
2 months ago
·
[ - ]

JuiceFS is very good. I didn't have it as a scaleout metadata FS - it supports lots of DBs (single host and distributed DBs).

threeseed
·
2 months ago
·
[ - ]

Tiered storage and FUSE has existed with Alluxio for years.

And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).

None of it is particularly ground breaking just a lot of pieces brought together.

jamesblonde
·
2 months ago
·
[ - ]

The difference is scale-out metadata in the filesystem. Alluxio uses Raft, i believe, for metadata - that has to fit on a single server.

rfoo
·
2 months ago
·
[ - ]

3FS isn't particularly fast in mdbench, though. Maybe our FDB tuning skill is what to blame, or FUSE, I don't know, but it doesn't really matter.

The truly amazing part for me is combining NVMe SSD + RDMA + supports reading a huge batch of random offsets from a few already opened huge files efficiently. This is how you get your training boxes consuming 20~30GiB/s (and roughly 4 million IOPS).

rjzzleep
·
2 months ago
·
[ - ]

FUSE has traditionally been famously slow. I remember there were some changes that supposedly made it faster, but maybe that was just a certain fuse implementation.

jamesblonde
·
2 months ago
·
[ - ]

The block size is 4KB by default, which is a killer. We set it to 1MB or so by default - makes a huge difference.

https://github.com/logicalclocks/hopsfs-go-mount

objectivefs
·
2 months ago
·
[ - ]

There is also ObjectiveFS that supports FUSE and uses S3 for both data and metadata storage, so there is no need to run any metadata nodes. Using S3 instead of a separate database also allows scaling both data and metadata with the performance of the S3 object store.

halifaxbeard
·
2 months ago
·
[ - ]

OFS was a drop in replacement for EFS and tbh it's insanely good value for the problem space.

joatmon-snoo
·
2 months ago
·
[ - ]

nit: Colossus* for Google.

huntaub
·
2 months ago
·
[ - ]

I think that the author is spot on, there are a couple of dimensions in which you should evaluate these systems: theoretical limits, efficiency, and practical limits.

From a theoretical point of view, like others have pointed out, parallel distributed file systems have existed for years -- most notably Lustre. These file systems should be capable of scaling out their storage and throughput to, effectively, infinity -- if you add enough nodes.

Then you start to ask, well how much storage and throughput can I get with a node that has X TiB of disk -- starting to evaluate efficiency. I ran some calculations (against FSx for Lustre, since I'm an AWS guy) -- and it appears that you can run 3FS in AWS for about 12-30% cheaper depending on the replication factors that you choose against FSxL (which is good, but not great considering that you're now managing the cluster yourself).

Then, the third thing you start to ask is anecdotally, are people able to actually configure these file systems into the size of deployment that I want (which is where you hear things like "oh it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen from something like 3FS.

Ultimately, I obviously believe that storage and data are really important keys to how these AI companies operate -- so it makes sense that DeepSeek would build something like this in-house to get the properties that they're looking for. My hope is that we, at Archil, can find a better set of defaults that work for most people without needing to manage a giant cluster or even worry about how things are replicated.

jamesblonde
·
2 months ago
·
[ - ]

Maybe AWS could start by making fast NVMes available - without requiring multi TB disks just to get 1 GB/s. S3FS experiments were run on 14 GB/s NVMe disks - an order of magnitude higher throughput than anything available in AWS today.

SSDs Have Become Ridiculously Fast, Except in the Cloud: https://news.ycombinator.com/item?id=39443679

kridsdale1
·
2 months ago
·
[ - ]

On my home LAN connected with 10gbps fiber between MacBook Pro and server, 10 feet away, I get about 1.5gbps vs the non-network speed of the disks of ~50 gbps. (Bits, not bytes)

I worked this out to the macOS SMB implementation really sucking. I set up a NFS driver and it got about twice as fast but it’s annoying to mount and use, and still far from the disk’s capabilities.

I’ve mostly resorted to abandoning the network (after large expense) and using Thunderbolt and physical transport of the drives.

dundarious
·
2 months ago
·
[ - ]

SMB/CIFS is an incredibly chatty, synchronous protocol. There are/were massive products built around mitigating and working around this when trying to use it over high latency satellite links (US military did/does this).

greenavocado
·
2 months ago
·
[ - ]

Is NFS out of the question?

kridsdale1
·
2 months ago
·
[ - ]

I have set it up but it’s not easy to get drivers working on a Mac.

insaneirish
·
2 months ago
·
[ - ]

What particular drivers are you referring to? NFS is natively supported in MacOS...

olavgg
·
2 months ago
·
[ - ]

That is true, though the implementation is weird.

I mount my nfs shares like this: sudo mount -t nfs -o nolocks -o resvport 192.168.1.1:/tank/data /mnt/data

-o nolocks Disables file locking on the mounted share. Useful if the NFS server or client does not support locking, or if there are issues with lock daemons. On macOS, this is often necessary because lockd can be flaky.

-o resvport Tells the NFS client to use a reserved port (<1024) for the connection. Some NFS servers (like some Linux configurations or *BSDs with stricter security) only accept requests from clients using reserved ports (for authentication purposes).

__turbobrew__
·
2 months ago
·
[ - ]

There are i4i instances in AWS which can get you a lot of IOPS with a smaller disk.

jamesblonde
·
2 months ago
·
[ - ]

Had a look - Baseline disk throughput is 78.12 MB/s. Max throughput (30 mins/day) is 1250 MB/s.

NVMe i bought for 150 dollars with 4 TBs capacity gives me 6000 MB/s sustained

https://docs.aws.amazon.com/ec2/latest/instancetypes/so.html

sgarland
·
2 months ago
·
[ - ]

That’s on the smallest instance. I’m sure there’s a reason they offer it, but I can’t think of why. On the largest instance (which IME is what people use with these), it’s 5000 MBps. The newer i7ie max out at 7500 MBps.

__turbobrew__
·
2 months ago
·
[ - ]

You are incorrect, the numbers you are quoted is EBS volume performance. iX instances have directly attached NVME volumes which are separate from EBS.

> NVMe i bought for 150 dollars

Sure, now cost out the rest of the server, the racks, the colocation space for racks, power, multiple AZ redundancy, a clos network fabric, network peering, the spare hardware for failures, off site backups, supply chain management, a team of engineers to design the system, a team of staff to physically rack new hardware and unrack it, a team of engineers to manage the network, on call rotations for all those teams.

Sure the NVME is just $150 bro.

jamesblonde
·
2 months ago
·
[ - ]

You claim I am incorrect, but you don't provide a reference or numbers, which I couldn't find.

__turbobrew__
·
2 months ago
·
[ - ]

AWS doesn't provide throughput numbers for the NVME on iX instances. You have to look at benchmarks or test it out yourself. Similar to packets per second limits which are not published either and can only be inferred through benchmarks.

ashu1461
·
2 months ago
·
[ - ]

Are these attached directly to your server or hosted separately ?

huntaub
·
2 months ago
·
[ - ]

i-series instances have direct-attached drives

KaiserPro
·
2 months ago
·
[ - ]

the other important thing to note is what is that filesystem designed to be used for?

For example 3FS looks like its optimised for read throughput (which makes sense, like most training workloads its read heavy.) write operations look very heavy.

Can you scale the metadata server, what are the cost of metadata operations? Is there a throttling mechanism to stop a single client sucking all of the metadata server's IO? Does it support locking? Is it a COW filesystem?

robinhoodexe
·
2 months ago
·
[ - ]

I’m interested in how it is compared to seaweedfs[1], which we use for storing weather data (about 3 PB) for ML training.

[1] https://github.com/seaweedfs/seaweedfs

rfoo
·
2 months ago
·
[ - ]

IMO they serve similar at a glance, but actually very different use cases.

SeaweedFS is more about amazing small object read performance because you effectively have no metadata to query to read an object. You just distribute volume id, file id (+cookie) to clients.

3FS is less extreme in this, supports actual POSIX interface, and isn't particularly good at how fast you can open() files. On the other hand, it shards files into smaller (e.g. 512KiB) chunks, demands RDMA NICs and makes reading randomly from large files scary fast [0]. If your dataset is immutable you can emulate what SeaweedFS does, but if it isn't then SeaweedFS is better.

[0] By scary fast I mean being able to completely saturate 12 PCIe Gen 4 NVMe SSD at 4K random reads on a single storage server and you can horizontally scale that.

jszymborski
·
2 months ago
·
[ - ]

I wonder how close to something like 3FS you can get by mounting SeaweedFS with S3FS, which mounts using FUSE.

https://github.com/s3fs-fuse/s3fs-fuse

rfoo
·
2 months ago
·
[ - ]

I'd estimate that there would be two orders of magnitude of difference in 4K random IOPS. If not three.

jszymborski
·
2 months ago
·
[ - ]

I'm assuming in favour of the DeepSeek FS?

huntaub
·
2 months ago
·
[ - ]

My guess is going to be that performance is pretty comparable, but it looks like Seaweed contains a lot more management features (such as tiered storage) which you may or may not be using.

stapedium
·
2 months ago
·
[ - ]

I’m just a small business & homelab guy, so I’ll probably never use one of these big distributed file systems. But when people start talking petabytes, I always wonder if these things are actually backed up and what you use for backup and recovery?

ted_dunning
·
2 months ago
·
[ - ]

It is common for the backup of these systems to be a secondary data center.

Remember that there are two purposes for backup. One is hardware failures, the second is fat fingers. Hardware failures are dealt with by redundancy which always involves keeping redundant information across multiple failure domains. Those domains can be as small as a cache line or as big as a data center. These failures can be dealt with transparently and automagically in modern file systems.

With fat fingers, the failure domain has no natural boundaries other than time. As such, snapshots kept in the file system are the best choice, especially if you have a copy-on-write that can keep snapshots with very little overhead.

There is also the special case of adversarial fat fingering which appears in ransomware. The answer is snapshots, but the core problem is timely detection since otherwise you may not have a single point in time to recover from.

ghugccrghbvr
·
2 months ago
·
[ - ]

Disaster at all?

ted_dunning
·
2 months ago
·
[ - ]

That's a form of hardware failure where the failure domain is a data center or region. It is always good to store enough redundant data outside of whatever failure domains you want to consider in your planning. That might be a device, a server, a rack, a room, a data center, a region or a country/regulatory domain.

Non-local storage is definitely worthwhile there. Snapshots can have additive value there because they may be more coherent states.

shermantanktop
·
2 months ago
·
[ - ]

Backup and recovery is a process with a non-zero failure rate. The more you test it, the lower the rate, but there is always a failure mode.

With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system.

So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup?

There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature.

Eikon
·
2 months ago
·
[ - ]

> why backup?

Because of mistakes and malicious actors...

overfeed
·
2 months ago
·
[ - ]

...and the "Disaster" in "Disaster recovery" may have been localized and extensive (fire, flooding, major earthquake, brownouts due to a faulty transformer, building collapse, a solvent tanker driving through the wall into the server room, a massive sinkhole, etc)

shermantanktop
·
2 months ago
·
[ - ]

Yes, the dreaded fiber vs. backhoe. But if your distributed file system is geographically redundant, you're not exposed to that, at least from an integrity POV. It sucks that 1/3 or 1/5 or whatever of your serving fleet just disappeared, but backup won't help with that.

overfeed
·
2 months ago
·
[ - ]

> But if your distributed file system is geographically redundant

Redundancy and backups are not the same thing! There's some overlap, but treating them as interchangeable will occasionally result in terrible outcomes, like when a config change that results in all 5/5 datacenters fragmenting and failing to create a quorum, then finding out your services have circular dependencies when you are trying to bootstrap foundational services. Local backups would solve this, each DC would load last known good config, but rebuilding consensus necessary for redundancy requires coordination from now-unreachable hosts.

huntaub
·
2 months ago
·
[ - ]

Well, for active data, the idea is that the replication within the system is enough to keep the data alive from instance failure (assuming that you're doing the proper maintenance and repairing hosts pretty quickly after failure). Backup and recovery, in that case, is used more for saving yourself against fat-fingering an "rm -rf /" type command. Since it's just a file system, you should be able to use any backup and recovery solution that works with regular files.

KaiserPro
·
2 months ago
·
[ - ]

Depends on what the data is.

Because of the replication factor here, I assume that this filesystem is optimised for read throughput rather than capacity. Either way, there is a concept of "nearline" storage. Its a storage tier that is designed to be only really accesed by a backupagent. The general idea is that it stores a snapshot of the main file system every n hours.

After that you have as many snapshots as you can afford.

dilyevsky
·
2 months ago
·
[ - ]

> what you use for backup and recovery

Speaking from experience working at a hyperscaler - 1. cross-regional mirroring 2. Good old tape backups

londons_explore
·
2 months ago
·
[ - ]

This seems like a pretty complex setup with lots of features which aren't obviously important for a deep learning workload.

Presumably the key necessary features are PB's worth of storage, read/write parallelism (can be achieved by splitting a 1PB file into say 10,000 100GB shards, and then having each client only read the necessary shards), and redundancy

Consistency is hard to achieve and seems to have no use here - your programmers can manage to make sure different processes are writing to different filenames.

threeseed
·
2 months ago
·
[ - ]

> Consistency is hard to achieve and seems to have no use here

Famous last words.

It is very common when operating data platforms like this at this scale to lose a lot of nodes over time especially in the cloud. So having a robust consistency/replication mechanism is vital to making sure your training job doesn't need to be restarted just because the block it needs isn't on the particular node.

ted_dunning
·
2 months ago
·
[ - ]

Sadly, these are often Famous First words.

What follows is a long period of saying "see, distributed systems are easy for genius developers like me"

The last words are typically "oh shit", shortly followed oxymoronically by "bye! gotta go"

londons_explore
·
2 months ago
·
[ - ]

indeed redundancy is fairly important (although the largest part, the training data, actually doesn't matter if chunks are missing).

But the type of consistency they were talking about is strong ordering - the type of thing you might want on a database with lots of people reading and writing tiny bits of data, potentially the same bits of data, and you need to make sure a users writes are rejected if impossible to fulfil, and reads never return an impossible intermediate state. That isn't needed for machine learning.

sungam
·
2 months ago
·
[ - ]

I wonder whether it may have been originally developed for the quantitive hedge fund

huntaub
·
2 months ago
·
[ - ]

Yes, I think this is probably true. I've worked with a lot of different hedge funds who have a similar problem -- lots of shared data that they need in a file system so that they can do backtesting of strategies with things like kdb+. Generally, these folks are using NFS which is kind of a pain -- especially for scaleability -- so building your own for that specific use case (which happens to have a similar usage pattern for AI training) makes a lot of sense.

ammo1662
·
2 months ago
·
[ - ]

Yes, as I mentioned in other comments. The 3FS was designed in 2019. You can check [0] (Chinese)

[0] https://www.high-flyer.cn/blog/3fs/

randomtoast
·
2 months ago
·
[ - ]

Why not use CephFS instead? It has been thoroughly tested in real-world scenarios and has demonstrated reliability even at petabyte scale. As an open-source solution, it can run on the fastest NVMe storage, achieving very high IOPS with 10 Gigabit or faster interconnect.

I think their "Other distributed filesystem" section does not answer this question.

charleshn
·
2 months ago
·
[ - ]

Because it's actually fairly slow.

Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.

For that you need zero-copy, RDMA etc.

Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.

Achieving such performance would also require many changes to the client (RDMA, etc).

Something like Weka [2] has a much better design for this kind of performance.

[0] https://ceph.io/en/news/crimson/

[1] https://seastar.io/

[2] https://www.weka.io/

__turbobrew__
·
2 months ago
·
[ - ]

With latest ceph releases I am able to saturate modern NVME devices with 2 OSD/NVME. It is kind of a hack to have multiple OSD per NVME, but it works.

I do agree that nvme-of is the next hurdle for ceph performance.

sgarland
·
2 months ago
·
[ - ]

I thought the current recommendation was to not have multiple OSDs per NVMe? Tbf I haven’t looked in a while.

I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.

__turbobrew__
·
2 months ago
·
[ - ]

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.

rthnbgrredf
·
2 months ago
·
[ - ]

> Because it's actually fairly slow.

"We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS."

Source: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.

rfoo
·
2 months ago
·
[ - ]

> I think 15M random read IOPS is actually quite fast

680x NVMe SSDs over 68 storage servers (so 68 CPUs) for just 15M (or 25M, tuned) random read IOPS is pretty underwhelming. The use cases where 3FS (or some other custom designs) shine are more like, 200M random read IOPS with 64 servers each with 8 PCIe gen 4 NVMe SSDs (512x SSDs in total).

skrtskrt
·
2 months ago
·
[ - ]

DigitalOcean uses Ceph underneath their S3 and block volume products. When I was there they had 2 teams just managing Ceph, not even any of the control plane stuff built on top.

It is a complete bear to manage and tune at scale. And DO never greenlit offering anything based on CephFS either because it was going to be a whole other host of things to manage.

Then of course you have to fight with the maintainers (Red Hat devs) to get any improvements contributed, assuming you even have team members with the requisite C++ expertise.

Andys
·
2 months ago
·
[ - ]

Ceph is massively over-complicated, if I had two teams I'd probably try and write one from scratch instead.

skrtskrt
·
2 months ago
·
[ - ]

Most of the legitimate datacenter-scale direct Ceph alternatives unfortunately are proprietary, in part because it takes so much money and human-expertise-hours to even be able to prove out that scale, they want to recoup costs and stay ahead.

Minio is absolutely not datacenter-scale and I would not expect anything in Go to really reach that point. Garbage collection is a rough thing at such enormous scale.

I bet we'll get one in Rust eventually. Maybe from Oxide computer company? Though despite doing so much OSS, they seem to be focused around their specific server rack OS, not general-purpose solutions

steveklabnik
·
2 months ago
·
[ - ]

> I bet we'll get one in Rust eventually. Maybe from Oxide computer company?

Crucible is our storage service: https://github.com/oxidecomputer/crucible

RFD 60, linked in the README, contains a bit of info about Ceph, which we did evaluate: https://rfd.shared.oxide.computer/rfd/0060

sgarland
·
2 months ago
·
[ - ]

Fascinating! I thought you were using RAIDZ3 with some kind of clever wrapper (or just DRBD), but it’s much more complex than that.

steveklabnik
·
2 months ago
·
[ - ]

It's not an area I personally work on, but yeah, there's a lot going on. And there will be more in the future, for example, I believe right now we ensure data integrity ourselves, but if you're running something (like Ceph) that does that on its own, you're paying for it twice. And so giving people options like that is important. It's a pretty interesting part of the space!

zackangelo
·
2 months ago
·
[ - ]

The 3FS chunk engine is written in Rust.

tempest_
·
2 months ago
·
[ - ]

We have a couple ceph clusters.

If my systems guys are telling me the truth is it a real time sink to run and can require an awful lot of babysitting at times.

huntaub
·
2 months ago
·
[ - ]

IMO this is the problem with all storage clusters that you run yourself, not just Ceph. Ultimately, keeping data alive through instance failures is just a lot of maintenance that needs to happen (even with automation).

_joel
·
2 months ago
·
[ - ]

I admin'd a cluster about 10 years back and it was 'ok' then, around bluestore. One issue was definitely my mistake but it wasn't all that bad.

elashri
·
2 months ago
·
[ - ]

CERN use CephFS with ~50PB for different applications and they are happy with it.

dfc
·
2 months ago
·
[ - ]

I thought they used ceph too. But I started looking around and it seems like they have switched to CernVM-FS and in house solution. I'm not sure what changed.

amadio
·
2 months ago
·
[ - ]

CERN is a heavy user of ceph, with about 100PB of data across cephfs, object stores (used as backend for S3), and block storage (mostly for storage for VMs). CVMFS (https://cernvm.cern.ch/fs/) is used to distribute the software stacks used by LHC experiments across the WLCG (Worldwide LHC Computing Grid), and is back by S3 with ceph for its storage needs. Physics data, however, is stored on EOS (https://eos.web.cern.ch) and CERN just recently crossed the 1EB mark of raw disk storage managed by EOS. EOS is also used as the storage solution for CERNBox (https://cernbox.web.cern.ch/), which holds user data. Data analyses use ROOT and read the data remotely from EOS using XRootD (https://github.com/xrootd/xrootd), as EOS is itself based on XRootD. XRootD is very efficient to read data across the network compared to other solutions. It is also used by other experiments beyond high energy physics, for example by LSST in its clustered database called Qserv (https://qserv.lsst.io).

elashri
·
2 months ago
·
[ - ]

They didn't switch, they use both for different needs. EOS (CVMFS) is used mainly for physics data storage and user data. Ceph is used for many other things like infrastructure, selfhosted apps..etc.

snthpy
·
2 months ago
·
[ - ]

Similar to the SeaweedFS question in sibling comment, how does this compare to JuiceFS?

In particular for my homelab setup I'm planning to run JuiceFS on top of S3 Garage. I know garage is only replication without any erasure coding or sharding so it's not really comparable but I don't need all that and it looked at lot simpler to set up to me.

huntaub
·
2 months ago
·
[ - ]

It's a very different architecture. 3FS is storing everything on SSDs, which makes it extremely expensive but also low latency (think ~100-300us for access). JuiceFS stores data in S3, which is extremely cheap but very high latency (~20-60ms for access). The performance scalability should be pretty similar, if you're able to tolerate the latency numbers. Of course, they both use databases for the metadata layer, so assuming you pick the same one -- the metadata performance should also be similar.

snthpy
·
2 months ago
·
[ - ]

Thanks, that's good to know.

For the most part I can live with the latency. For frequently or repeatedly accessed files I believe JuiceFS also has a caching layer to bring down the latency on those. My use cases are quite different from what DeepSeek requires. I was just curious whether I needed to investigate 3FS since it's the new kid on the block.

What appeals to me about JuiceFS is that I can easily mount it into my containers for a distributed FS on a small cluster for some resilience. It looked simpler than Ceph or GlusterFS for that and it's backed by S3 so that I could swap out my S3 Garage setup for a cloud based S3 backend if required.

seethishat
·
2 months ago
·
[ - ]

How easy is it to disable DeepSeek's distributed FS? Say for example a US college has been authorized to use DeepSeek for research, but must ensure no data leaves the local research cluster filesystem?

Edit: I am a DeepSeek newbie BTW, so if this question makes no sense at all, that's why ;)

ikeashark
·
2 months ago
·
[ - ]

I might need more clarification, but if one is paranoid or is dealing with this sensitive of information the DeepSeek model and 3FS are able to be deployed locally offline and not connected to the internet.

seethishat
·
2 months ago
·
[ - ]

Thank you, that answers my question.

ajcp
·
2 months ago
·
[ - ]

DeepSeek is a company. This article is about a distributed file system they have developed. It is a separate, unrelated piece software from their open-weight models(DeepSeek-R1, DeepSeek-V3, etc).

In your example it is likely the US college has been authorized to use a DeepSeek model for research, not the DeepSeek 3FS distributed file system.

dang
·
2 months ago
·
[ - ]

Related. Others?

Understanding Smallpond and 3FS - https://news.ycombinator.com/item?id=43232410 - March 2025 (47 comments)

Smallpond – A lightweight data processing framework built on DuckDB and 3FS - https://news.ycombinator.com/item?id=43200793 - Feb 2025 (73 comments)

Fire-Flyer File System (3FS) - https://news.ycombinator.com/item?id=43200572 - Feb 2025 (101 comments)

nodesocket
·
2 months ago
·
[ - ]

Does this get us closer to fully and performant distributed LLMs?

Instead of spending ludicrous amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive.

Goodbye Nvidia moat!

dheera
·
2 months ago
·
[ - ]

What if mgmtd has a hardware failure? Is there a distributed file system where you can take out any 2 arbitrary nodes and it keeps running intact, waiting for the 2 failed nodes to be replaced?

JSR_FDED
·
2 months ago
·
[ - ]

What happens when you want to add capacity to a distributed file system? Do you have to add the extra drives spread out across the nodes? Do you have to use certain size increments?

vFunct
·
2 months ago
·
[ - ]

Can we replicate this with ZFS drives distributed across multiple machines?

eatonphil
·
2 months ago
·
[ - ]

As far as I'm aware ZFS does not scale out.

https://unix.stackexchange.com/a/99218

db48x
·
2 months ago
·
[ - ]

Yea, it wasn’t designed to scale that way.

In principle you could use fiberchannel to connect a really large number (2²⁴ iirc) of disks to a single server and then create a single ZFS pool using all of them. This lets you scale the _storage_ as high as you want.

But that still limits you to however many requests per second that your single server can handle. You can scale that pretty high too, but probably not by a factor of 2²⁴.

urlshortuk1234
·
2 months ago
·
[ - ]

[dead]

dkkergoog
·
2 months ago
·
[ - ]

[flagged]

mertleee
·
2 months ago
·
[ - ]

What are the odds 3fs is backdoored?

huntaub
·
2 months ago
·
[ - ]

I think that's a pretty odd concern to have. What would you imagine that looks like? If you're running these kinds of things securely, you should be locking down the network access to the hosts (they don't need outbound internet access, and they shouldn't need inbound access from anything except your application).

xpe
·
2 months ago
·
[ - ]

> I think that's a pretty odd concern to have.

Thinking about security risks is an odd concern to have?

huntaub
·
2 months ago
·
[ - ]

I think that worrying that a self-hosted file system has a backdoor to exfiltrate data is an odd concern. Security concerns are (obviously) normal, but you should not be exposing these kinds of services to the public internet (or giving them access to the public internet), eliminating the concern that it's giving your data away.

antonvs
·
2 months ago
·
[ - ]

Backdoors in self-hosted systems are one of the major state-level attack vectors, and there are many known examples of this.

The idea that "you should not be exposing these kinds of services to the public internet (or giving them access to the public internet)" is naive. Aside from the fact that this requires every user to show that level of diligence - a completely unrealistic expectation - if a system is backdoored, it can also have means of exfiltrating data beyond just sending it out on the public internet.

And then there's the obvious point that if you suspect a device is compromised, it's completely irresponsible to use it anyway and assume that you're going to be able to prevent unauthorized access.

As some examples, there are documented concerns about Huawei/ZTE routers, which have been banned in Australia, New Zealand, Japan, Taiwan and the US for that reason. Unauthorized third-party code was found in Juniper Networks firewalls. Fortinet had hardcoded admin credentials in its firewalls and VPNs - probably a self-inflicted mistake, but still useful to attackers. Similarly, Western Digital NAS devices had a hardcoded backdoor account. D-Link routers had such a backdoor in their device web interface. There are many more examples like this.

Snowden revealed some of the US government activities in this area. The US, Russia, China, North Korea and other countries have all been involved in attacks involving BIOS/UEFI firmware, router firmware, NAS, and manufacturing supply chains. Covert exfiltration has been involved in many of these cases, using techniques other than transmitting over the internet.

And of course there was the recent (reported late 2024 Salt Typhoon attack by China on US and other Western telecom networks, which relied on these kinds of techniques, and gained access to large amounts of data, including audio and text of targeted people.

xpe
·
2 months ago
·
[ - ]

> I think that worrying that a self-hosted file system has a backdoor to exfiltrate data is an odd concern.

Great security teams get paid to "worry", by which I mean "make a plan" for a given attack tree.

Your use of "odd" looks like a rhetorical technique to downplay legitimate security considerations. From my point of view, you are casually waving away an essential area of security. See:

https://www.paloaltonetworks.com/cyberpedia/data-exfiltratio...

> but you should not be exposing these kinds of services to the public internet (or giving them access to the public internet), eliminating the concern that it's giving your data away.

Let's evaluate the following claim: "a "properly" isolated system would, in theory, have no risk of data exfiltration. Now we have to define what we mean. Isolated from the network? Running in a sandbox? What happens when that system can interact with a compromised system? What happens when people mess up?

From a security POV, any software could have some component that is malicious, compromised, or backdoored. It might be in the project itself or in the supply chain. And so on.

Defense in depth matters. None of the above concerns are "odd". A good security team is going to factor these in.

P.S. If you mean "low probability" then just say that.

xmprt
·
2 months ago
·
[ - ]

I agree with the parent commenter. I think it's fair to be concerned with backdoors but this is a distributed file system which can be completely isolated from the outside world. If you're so worried, you could run it in an airgapped Faraday cage and do all your training in that environment. Just don't run it on any centrifuges.

I think these kinds of concerns are more valid for storage systems which serve online traffic or have some kind of connection to the outside world.

antonvs
·
2 months ago
·
[ - ]

Attackers, whether state-level, corporate, or criminal, all rely on ignorance like this. It also probably involves denial: you want to believe your systems are safe, so you delude yourself into thinking that minimal protections are sufficient.

I wrote a longer comment about this here: https://news.ycombinator.com/item?id=43745423

xpe
·
2 months ago
·
[ - ]

> If you're so worried, you could run it in an airgapped Faraday cage and do all your training in that environment. Just don't run it on any centrifuges.

Really? More rhetoric that minimizes legitimate security concerns?

Again, if someone wants to make the claim that such concerns are "low probability" that would at least be defensible.

lossolo
·
2 months ago
·
[ - ]

In this case, I think the concern about security is overly paranoid. DeepSeek isn't just some unknown nickname of a random developer on GitHub, it's a legitimate company that has made headlines, has a known CEO, publishes research, and is actively trying to attract talent. They've open-sourced a lot of their work, including 3FS, which is fully available on GitHub. So while a backdoor is theoretically possible (just like an asteroid hitting Earth), I think the original poster's question is exaggerated and likely influenced by the fact that the company is Chinese.

xpe
·
2 months ago
·
[ - ]

> and likely influenced by the fact that the company is Chinese

Let's rephrase this. It is not simply that DeepSeek is a Chinese company. It is because of its links to the CCP [1] [2] and the CCP's cyber operations.

[1]: https://www.scmp.com/news/china/politics/article/3306943/chi...

[2]: https://www.journalofdemocracy.org/online-exclusive/why-deep...

·
2 months ago
·
[ - ]

xpe
·
2 months ago
·
[ - ]

DeepSeek has been caught making backdoors.

https://www.cisecurity.org/insights/blog/deepseek-a-new-play...

xpe
·
2 months ago
·
[ - ]

> DeepSeek isn't just some unknown nickname

> just like an asteroid hitting Earth

So much one-sided rhetoric.

·
2 months ago
·
[ - ]

MaxPock
·
2 months ago
·
[ - ]

By, NSA or Britain's GCHQ which wants all software backdoored?

xpe
·
2 months ago
·
[ - ]

Don't forget to add the CCP to the party. (DeepSeek has ties to the CCP.)

jack_pp
·
2 months ago
·
[ - ]

I don't have direct experience with distributed file systems but it so happens I did a tiny bit of research in the past month and.. there are quite a few open source ones available. Would've been nice for the authors to explain why the already existing solutions didn't work for them.

dboreham
·
2 months ago
·
[ - ]

They have an HFT background so probably it was developed long ago for that workload (which tends to be outside the design envelope for off the shelf solutions).