I want everything local – Building my offline AI workspace

1143
286
mkagenius
3 weeks ago
instavm.io

andylizf
·
3 weeks ago
·
[ - ]

This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.

A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.

(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.

Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."

Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051

doctoboggan
·
3 weeks ago
·
[ - ]

> A vector database for years of emails can easily exceed 50GB.

In 2025 I would consider this a relatively meager requirement.

andylizf
·
3 weeks ago
·
[ - ]

Yeah, that's a fair point at first glance. 50GB might not sound like a huge burden for a modern SSD.

However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.

For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.

The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.

notsylver
·
3 weeks ago
·
[ - ]

You already need very high end hardware to run useful local LLMs, I don't know if a 200gb vector database will be the dealbreaker in that scenario. But I wonder how small you could get it with compression and quantization on top

wafflemaker
·
3 weeks ago
·
[ - ]

I'm no dev either and still set up remote ssh login to be able to use LaTeX at home PC from my laptop.
Also, with many games and dual boot on my gaming PC I still have some space left on my 2TB NVME SSD. And my not enthusiast MOBO could fit two more.

It took so much time to install LaTeX and packages, and also so much space, my 128GB drive couldn't handle it.

mwcz
·
3 weeks ago
·
[ - ]

I've worked in other domains my whole career, so I was astonished this week when we put a million 768-len embeddings into a vector db and it was only a few GB. Napkin math said ~25 GB and intuition said a long list of widely distributed floats would be fairly uncompressable. HNSW is pretty cool.

OneDeuxTriSeiGo
·
3 weeks ago
·
[ - ]

You can already do A LOT with an SLM running on commodity consumer hardware. Also it's important to consider that the bigger an embedding is, the more bandwidth you need to use it at any reasonable speed. And while storage may be "cheap", memory bandwidth absolutely is not.

varenc
·
3 weeks ago
·
[ - ]

> You already need very high end hardware to run useful local LLMs

A basic macbook can run gpt-oss-20b and it's quite useful for many tasks. And fast. Of course Macs have a huge advantage for local LLMs inference due to their shared memory architecture.

derefr
·
3 weeks ago
·
[ - ]

The mid-spec 2025 iPhone can run “useful local LLMs” yet has 256GB of total storage.

(Sure, this is a spec distortion due to Apple’s market-segmentation tactics, but due to the sheer install-base, it’s still a configuration you might want to take into consideration when talking about the potential deployment-targets for this sort of local-first tech.)

felarof
·
2 weeks ago
·
[ - ]

You should definitely checkout BrowserOS! -- https://github.com/browseros-ai/BrowserOS

derefr
·
3 weeks ago
·
[ - ]

Question: would it be possible to invert the problem? I.e., rather than decreasing the size of the RAG — use the RAG to compress everything other than the RAG index itself.

E.g., design a filesystem so that the RAG index is part of / managed internally within the metadata of the filesystem itself; and then, for each FS inode data-extent, give it two polymorphic on-disk representations:

1. extents hold raw data; rag-vectors are derivatives and updated after extent is updated (as today)

2. rag-vectors are canonical; extents hold residuals from a predictive-coding model that took the rag-vectors as input and tried to regenerate the raw data of the extent. When extent is read [or partially overwritten], use predictive-coding model to generate data from vectors and then repair it with residue (as in modern video-codec p-frame generation.)

———

Of course, even if this did work (in the sense of providing a meaningful decrease in storage use), this storage model would only really be practical for document files that are read entirely on open and atomically overwritten/updated (think Word and Excel docs, PDFs, PSDs, etc), not for files meant to be streamed.

But, luckily, the types of files this technique are amenable to are exactly the same types of files that a “user’s documents” RAG would have any hope of indexing in the first place!

PeterStuer
·
3 weeks ago
·
[ - ]

While your aims are undoutably sincere, in practice for the 'local ai' target people building their own rigs usually have. 4TB or more fast ssd storage.

The bottom tier (not meant disparagingly) are people running diffusion models as these do not have the high vram requirements. They generate tons of images or video, going form a one-click instally like Easydiffusion to very sophisticated workflows in comfyui.

For those going the LLM route, which would be your target audience, they quickly run into the problemm that to go beyond toying around, the hardware and software requirements and expertise grows exponential beyong just toying around with small, highly quantized model with small context windows.

Inlight of the typical enthusiast investments in this space, the few TB of fast storage will pale in comparison to the rest of the expenses.

Again, your work is absolutely valuable, it is just that the storage space requirement for the vector store in this particular scenario is not your strongest card to play.

imoverclocked
·
3 weeks ago
·
[ - ]

Everyone benefits from focusing on efficiency and finding better ways of doing things. Those people with 4TB+ of fast storage can now do more than they could before as can the "bottom tier."

It's a breath of fresh air anytime someone finds a way to do more with less rather than just wait for things to get faster and cheaper.

PeterStuer
·
3 weeks ago
·
[ - ]

Of course. And I am not arguing against that at all. Just like if someone makes an inference runtime that is 4% faster, I'll take that win. But would it be the decisive factor in my choice? Only if that was my bottleneck, my true constraint.

All I tried to convey was that for most of the people in the presented scenario (personal emails etc.) , a 50 or even 500GB storage requirement is not going to be that primary constraint. So the suggestion was the marketing for this usecase might be better spotlighting also something else.

ricardobeat
·
3 weeks ago
·
[ - ]

You are glossing over the fact that for RAG you need to search over those 500GB+ which will be painfully slow and CPU-intensive. The goal is fast retrieval to add data to the LLM context. Storage space is not the sole reason to minimize the DB size.

brookst
·
3 weeks ago
·
[ - ]

You’re not searching over 500GB, you’re searching an index of the vectors. That’s the magic of embeddings and vector databases.

Same way you might have a 50TB relational database but “select id, name from people where country=‘uk’ and name like ‘benj%’ might only touch a few MB of storage at most.

ricardobeat
·
3 weeks ago
·
[ - ]

That’s precisely the point I tried to clear up in the previous comment.

The LEANN author proposes to create a 9GB index for a 500GB archive, and the other poster argued that it is not helpful because “storage is cheap”.

·
3 weeks ago
·
[ - ]

brabel
·
3 weeks ago
·
[ - ]

Speak for yourself! If it took me 500GB to store my vectors , on top of all my existing data, it would be a huge barrier for me.

hdgvhicv
·
3 weeks ago
·
[ - ]

A 4tb external drive is £100. A 1TB sd card or usb stick a similar cost.

Maybe Im too old to appreciate what “fast” means, but storage doesnt seem an enormous cost once you stripe it.

mockingloris
·
3 weeks ago
·
[ - ]

This "...doesn't seem an enormous cost once you stripe it." gave me an idea. I KNOW that I will come back to link a blog post about it in the future.

xandrius
·
3 weeks ago
·
[ - ]

Maybe time to update your storage?

mattlutze
·
2 weeks ago
·
[ - ]

The DGX Spark being just $3-4,000 with 4TB of storage, 128GB unified memory, etc (or the Mac Studio tbh) is a great indicator that Local AI can soon be cheap and, along with the emerging routing and expert mixing strategies, incredibly performant for daily needs.

42lux
·
2 weeks ago
·
[ - ]

That's the size of just two or three triple A games nowadays.

snoman
·
3 weeks ago
·
[ - ]

Take whatever you're indexing and make it 16-20x and that’s a good approximation of what the vector db’s total size is going to be.

jononor
·
2 weeks ago
·
[ - ]

Why is it like that, currently? There is no information added by a vector index compared to the original text. And the text is highly redundant and compressible with even lossless functions. Furthermore a vector index is already lossy and approximate. So conceptually it is at least possible to have an index that would be a fraction of the size of what is indexed?

snoman
·
2 weeks ago
·
[ - ]

There is some information added, depending on the vector db and context (some systems will add permissions related metadata so that the LLM won’t pull chunks that the user didn’t have access to).

The vector itself is pretty large (512 dimensions).

The chunks have an overlap (iirc 30% but someone feel free to correct me).

I don’t _think_ the data is typically compressed (not sure why but I assume performance).

mccoyb
·
3 weeks ago
·
[ - ]

That can't be the correct paper...

I think you meant this: https://arxiv.org/abs/2506.08276

johnfn
·
3 weeks ago
·
[ - ]

No no, getting your entire workflow local requires solving P=NP.

antoniojtorres
·
3 weeks ago
·
[ - ]

Wait how?

janderson215
·
3 weeks ago
·
[ - ]

Just some sarcasm. You can safely disregard if you didn’t get a chuckle.

andylizf
·
3 weeks ago
·
[ - ]

Yeah that's it. My bad lol

oblio
·
3 weeks ago
·
[ - ]

It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?

andylizf
·
3 weeks ago
·
[ - ]

Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.

iezepov
·
3 weeks ago
·
[ - ]

Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.

catlifeonmars
·
3 weeks ago
·
[ - ]

Why would the embeddings be higher dimensionally than the data? I imagine the embeddings would contain relatively higher entropy (and thus lower redundancy) than many types of source data.

cm228
·
3 weeks ago
·
[ - ]

depends on the chunk-size used to create the embedding.

yichuan
·
3 weeks ago
·
[ - ]

I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me

brookst
·
3 weeks ago
·
[ - ]

Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.

psychoslave
·
3 weeks ago
·
[ - ]

Why is that considred relevant to get a RAG of people digital traces burdening them in every single interactions they have with a computer?

Having locally distributed similar grounds is one thing. Push everyone to much in its own information bubble, is an other orthogonal topic.

When someone mind recall about that email from years before, having the option to find it again in a few instants can interesting. But when the device is starting to funnel you through past traces, then it doesn't matter much whether it the solution is in local or remote: the spontaneous thought flow is hijacked.

In mindset dystopia, the device prompts you.

solarkraft
·
3 weeks ago
·
[ - ]

Since it’ll be local, this behavior can be controlled. I for one find the option of it digging through my personal files to give me valuable personal information attractive.

wfn
·
3 weeks ago
·
[ - ]

Thank you for the pointer to LEANN! I've been experimenting with RAGs and missed this one.

I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]

Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).

I think I'll give it a try later (using cloud frontier model for LLM though, for now...)

[1]: https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...

OldfieldFund
·
3 weeks ago
·
[ - ]

I'm gonna put it here for visibility: Use patchright instead of Playwright: https://github.com/Kaliiiiiiiiii-Vinyzu/patchright

bamboozled
·
3 weeks ago
·
[ - ]

What problem does patchright solve?

jdelsman
·
3 weeks ago
·
[ - ]

Not being detected by things like bot detection.

wy1346
·
3 weeks ago
·
[ - ]

This looks incredibly useful for making large-scale local AI truly practical.

jychang
·
3 weeks ago
·
[ - ]

This is annoyingly Apple-only though. Even though my main dev machine is a Macbook, this would be a LOT more useful if it was a Docker container.

I'd still take a Docker container over an Apple container, because even though docker is not VM-level-secure, it's good enough for running local AI generated code. You don't need DEFCON Las Vegas levels of security for that.

And also because Docker runs on my windows gaming machine with a fast GPU with WSL ubuntu, and my linux VPS in the cloud running my website, etc etc. And most people have already memorized all the basic Docker commands.

This would be a LOT better if it was just a single docker command we can copy paste, run it a few times, and then delete if necessary.

glhaynes
·
2 weeks ago
·
[ - ]

I’m no expert on these things, but since Apple Containerization uses OCI images, I’d think you’d be able to sub in Docker (or Podman, etc) as the runtime pretty trivially. Like Podman, it uses a very similar command line interface to Docker’s.

Edit: Oh, I see now that Coderunner is Apple Containerization-specific.

sebmellen
·
3 weeks ago
·
[ - ]

I know next to nothing about embeddings.

Are there projects that implement this same “pruned graph” approach for cloud embeddings?

NJL3000
·
3 weeks ago
·
[ - ]

It’s in the works… been meaning to do a show HN moment to see if it flies or I Fall on my face..

unixhero
·
3 weeks ago
·
[ - ]

I have 26tb hardrives, 50gb doesnt scare me. Or should I be?

technocratius
·
3 weeks ago
·
[ - ]

I think you'd want things in RAM for performance reasons but would love to be corrected by people with more knowledge/experience on the subject

unixhero
·
3 weeks ago
·
[ - ]

Oh the number was memory space? That changes the maths a little bit. But I do have 50gb available for a model no problem whatsoever. 384gb is the new 32gb.

com2kid
·
3 weeks ago
·
[ - ]

> Even with help from the "world's best" LLMs, things didn't go quite as smoothly as we had expected. They hallucinated steps, missed platform-specific quirks, and often left us worse off.

This shows how little native app training data is even available.

People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.

Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.

These types of holes in training data are going to be a larger and larger problem.

Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.

Aurornis
·
3 weeks ago
·
[ - ]

You left out the next lines, which add some important context:

> Then we tried wrapping a NextJS app inside Electron. It took us longer than we'd like to admit. As of this writing, it looks like there's just no (clean) way to do it.

> So, we gave up on the Mac app.

They weren't writing a fully native app. They started with a NextJS web app and then tried to put it inside Electron, a cross-platform toolkit.

All the training data in the world about native app development wouldn't have helped here. They were using a recent JS framework and trying to put it in a relatively recent cross-platform tool. The two parts weren't made to work together so training data likely doesn't exist, other than maybe some small amount of code or issues on GitHub discussing problems with the approach.

pbronez
·
3 weeks ago
·
[ - ]

I thought that was odd too. There are lots of ChatGPT clones implemented as native MacOS apps.

The main advancement in TFA is using the new Container Swift API for local tool use. That functionality would probably be a welcome contribution to any of these:

https://github.com/Renset/macai

https://github.com/huggingface/chat-macOS

https://github.com/SidhuK/WardenApp

https://github.com/psugihara/FreeChat

Aurornis
·
3 weeks ago
·
[ - ]

I think they started with what they knew (web app development) and then wanted to wrap it into a standalone app later.

WhyNotHugo
·
3 weeks ago
·
[ - ]

> This shows how little native app training data is even available.

FWIW, we have very few desktop native apps nowadays. Most apps are either mobile, cli or web-based. Heck, I’m sure there’s more material online on writing cli apps than gui apps.

thorncorona
·
3 weeks ago
·
[ - ]

I mean outside of HPC why would you when the browser is the world’s most ubiquitous VM?

esseph
·
3 weeks ago
·
[ - ]

Because the browser is gross and you can reclaim lot of performance and security when you don't need to use it.

moffkalast
·
3 weeks ago
·
[ - ]

Sure but you're also constrained to only one platform. It's like the C++ vs Python argument in ML, yes writing everything in low level high speed highly optimized native code would be perfect, but ain't (almost) nobody got fucking time or skill for that.

esseph
·
3 weeks ago
·
[ - ]

"Lack of skill" is a real problem I've seen grow over the past decade.

No matter the company I'm with or in conversations with others at other places, there just hasn't been a solid intake of junior programmers / sysadmins / network engineers / etc.

Which sucks, because now there's very few junior staff to teach, which makes backfills harder.

Any junior positions that do seem to happen are just a money funnel to offshoring and the results are /mostly/ less than stellar and ultimately aren't setup to solve the knowledge transfer problem in a meaningful, long-term way.

senko
·
3 weeks ago
·
[ - ]

Cross-platform toolkits are (still) a thing.

zelphirkalt
·
2 weeks ago
·
[ - ]

Recently I tried to make a GTK app, but the problem was, for none of the languages I tried the bindings were working well enough. So in the end I decided to make a local first static web app in Python and Django. Everything is rendered server side and state is stored in the database. If I ever finish it, it should be easy to bring it online. And then maybe registrations ...

moffkalast
·
3 weeks ago
·
[ - ]

Yeah they're called Electron now ;)

Qt is such a pain to work with it's almost like it's intentional that people should avoid it.

rubymamis
·
2 weeks ago
·
[ - ]

I can't disagree more. I've written extensively about the joy of programming using Qt in my blog post: https://rubymamistvalove.com/block-editor

esseph
·
3 weeks ago
·
[ - ]

I mean, why aren't the apps on your phone all just webapps, right? (Also, eww)

jakelazaroff
·
3 weeks ago
·
[ - ]

Mostly because native apps can track you far more invasively than web apps can, and companies are hungry for your private data.

esseph
·
3 weeks ago
·
[ - ]

Not sure I agree with that.

It's a lot better on battery life and superior experience, especially if you are traveling or around areas with bad cell service.

Cookies track me around on websites all the time + modern telemetry is pretty crazy.

sillyfluke
·
3 weeks ago
·
[ - ]

>Not sure I agree with that. It's a lot better on battery life...

The parent is talking about privacy and your first counter argument is privacy irrelevant battery life?

The tracking and telemetry abundance in native far exceeds the browser. Nevermind a lot of apps remain running in background because the user forgets or can't be bothered to close them.

Follow the money. Why are random companies begging me to download their mobile app and get ridiculous discounts in the process whenever I use their website? Why are weather apps known to be spyware vectors but weather websites don't have that stigma?

r_lee
·
3 weeks ago
·
[ - ]

The permissions that apps can get on Android even by default are pretty invasive, like querying other apps/processes and etc iirc...

esseph
·
3 weeks ago
·
[ - ]

Chrome being able to scan your network on desktop is still insane to me.

spauldo
·
3 weeks ago
·
[ - ]

A lot of us just don't want to be web developers. I mostly write IEC 61131 code, with sprinkles of BASIC (yuck), C, Perl, and Lisp. I've used JavaScript and quite frankly, you can keep it.

typpilol
·
3 weeks ago
·
[ - ]

Does anyone else think javascript bad? Wow brave!

spauldo
·
3 weeks ago
·
[ - ]

Having personal preferences is brave? I've got tons of those! Maybe I'll go start some bar fights.

wolvesechoes
·
3 weeks ago
·
[ - ]

If you want something better than UI designed for toddlers

anthk
·
3 weeks ago
·
[ - ]

Offices when the performance matters against shitty web apps.

t_mann
·
3 weeks ago
·
[ - ]

Great effort, a strong self-hosting community for LLMs is going to be similarly important as the FLOSS movement imho. But right now I feel the bigger bottleneck is on the hardware side rather than software. The amount of fast RAM that you need for decent models (80b+ params) is just not something that's commonly available for consumer hardware right now, not even gaming machines. I heard that Macs (minis) are great for the purpose, but you don't really get them with enough RAM or at prices that don't really qualify as consumer-grade anymore. I've seen people create home clusters (eg using Exo [0]), but I wouldn't really call it practical (single digit token/sec for large models, and the price isn't exactly accessible either). Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.

[0] https://github.com/exo-explore/exo

Aurornis
·
3 weeks ago
·
[ - ]

With smaller models becoming more efficient and harder continually improving I think the sweet spot for local LLM computing will arrive in a couple years.

So many comments like to highlight that you can buy a Mac Studio with 512GB of RAM for $10K, but that's a huge amount of money to spend on something that still can't compete with a $2/hour rented cloud GPU server in terms of output speed. Even that will be lower quality and slower than the $20/month plan from the LLM provider of your choice.

The only reasons to go local are if you need it (privacy, contractual obligations, regulations) or if you're a hardcore hobbiest who values running it yourself over quality and speed of output.

> Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.

Framework is getting a lot of headlines for their brand recognition but there are a growing number of options with the same AMD Strix Halo part. Here's a random example I found from a Google search - https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-39...

All of these are somewhat overpriced right now due to supply and demand. If the supply situation is alleviated they should come down in price.

They're great for what they are, but their memory bandwidth is still relatively limited. If the 128GB versions came down to $1K I might pick one up, but at the $2-3K price range I'd rather put that money toward upgrading my laptop to an M4 MacBook Pro with 128GB of RAM.

api
·
3 weeks ago
·
[ - ]

Prices are still coming down. Assuming that keeps happening we will have laptops with enough RAM in the sub-2k range in 5 years.

Question is whether models will keep getting bigger. If useful model sizes plateau eventually a good model becomes something at least many people can easily run locally. If models keep usefully growing this doesn’t happen.

The largest ones I see are in the 405g range which quantized fits in 256g RAM.

Long term I expect custom hardware accelerators designed specifically for LLMs to show up, basically an ASIC. If those got affordable I could see little USB-C accelerator boxes being under $1k able to run huge LLMs fast and with less power.

GPUs are most efficient for batch inference which lends itself to hosting not local use. What I mean is a lighter chip made to run small or single batch inference very fast using less power. The bottleneck there is memory bandwidth so I suspect fast RAM would be most of the cost of such a device. Small or single batch inference is memory bandwidth bound.

m-s-y
·
3 weeks ago
·
[ - ]

GPUs are already effectively ASICs for the math that runs both 3D scenes and LLMs, no?

zozbot234
·
3 weeks ago
·
[ - ]

What's the deal with Exo anyway? I've seen it described as an abandoned, unmaintained project.

Anyway, you don't really need a lot of fast RAM unless you insist on getting a real-time usable response. If you're fine with running a "good" model overnight or thereabouts, there are things you can do to get better use of fairly low-end hardware.

pbronez
·
3 weeks ago
·
[ - ]

Jeff Geerling just did a video with a cluster of 4 Framework Desktop main boards. He put a decent amount of work into Exo and concluded it’s a VC Rugpull… abandoned as soon as it won some attention.

He also explored several other open source AI scale out libraries, and reported that they’re generally way less mature than tooling for traditional scientific cluster computing.

https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram...

flanger
·
3 weeks ago
·
[ - ]

The founders of Exo ghosted the dev community and went closed-source. Nobody has heard from them. I wish people would stop recommending Exo (a tribute to their marketing) and check out GPUStack instead. Overall another rug pull by the devs as soon as they got traction.

zozbot234
·
3 weeks ago
·
[ - ]

Why can't that dev community just fork the project under a new name and maintain it properly? Picking up a third-party project is absolutely par for the course in FLOSS development.

fouc
·
3 weeks ago
·
[ - ]

There's a couple of alternatives to exo it seems https://github.com/b4rtaz/distributed-llama and https://github.com/ray-project/ray

m-s-y
·
3 weeks ago
·
[ - ]

It’s functional if your goal is to run models that won’t fit into RAM on a single machine. Functional.

the slow interconnects (yes, even at 40Gbps thunderbolt) severely limit both TtFT and tokens/second.

I tried it extensively for a few days, and ended up getting a single M3 Ultra Mac Studio, and am loving life.

graemep
·
3 weeks ago
·
[ - ]

You still need a lot of RAM though right? so its not going to be that cheap?

What sort of specs do you need?

shaky
·
3 weeks ago
·
[ - ]

This is something that I think about quite a bit and am grateful for this write-up. The amount of friction to get privacy today is astounding.

Aurornis
·
3 weeks ago
·
[ - ]

> The amount of friction to get privacy today is astounding

I don't understand this.

It's easy to get a local LLM running with a couple commands in the terminal. There are multiple local LLM runners to choose from.

This blog post introduces some additional tools for sandboxed code execution and browser automation, but you don't need those to get started with local LLMs.

There are multiple local options. This one is easy to start with: https://ollama.com/

apitman
·
2 weeks ago
·
[ - ]

> It's easy to get a local LLM running

Easy for what percentage of people?

sneak
·
3 weeks ago
·
[ - ]

This writeup has nothing of the sort and is not helpful toward that goal.

frank_nitti
·
3 weeks ago
·
[ - ]

I'd assume they are referring to being able to run your own workloads in a home-built system, rather then surrendering that ownership to the tech giants alone

Imustaskforhelp
·
3 weeks ago
·
[ - ]

Also you get a sort of complete privacy that the data never leaves your home too whereas at best you would have to trust the AI cloud providers that they are not training or storing that data.

Its just more freedom and privacy in that matter.

doctorpangloss
·
3 weeks ago
·
[ - ]

The entire stack involved sends so much telemetry.

frank_nitti
·
3 weeks ago
·
[ - ]

This, in particular, is a big motivator and rewarding factor in getting local setup and working. Turning off the internet and seeing everything run end to end is a joy

doctorpangloss
·
2 weeks ago
·
[ - ]

NVIDIA drivers send detailed telemetry.

Windows and macOS send detailed telemetry.

You have to install the pip packages and the models, which all come from websites, which collect detailed telemetry.

You don’t think Microsoft gathers detailed telemetry on all your interactions with GitHub?

The local setup doesn’t really help with that.

frank_nitti
·
2 weeks ago
·
[ - ]

We might be talking about two different things. Yes, under normal circumstances the setup steps involve software that defaults to using telemetry -- though I'd be surprised if it's not possible anymore to achieve those in an air-gapped env using e.g. offline installers, zipped repos and wheel files, etc.

My comment was referring to runtime workloads having no telemetry (because I unplugged the internet)

wkat4242
·
3 weeks ago
·
[ - ]

> whereas at best you would have to trust the AI cloud providers that they are not training or storing that data.

Yeah, about that. They even illegally torrented entire databases, hide their crawlers. Crawl entire newspaper archives without permission. They didn't respect the rights of big media companies. But they're going to respect the little guy's of course because it says to in the T&Cs. Uh-huh.

Also, openai already admitted that they do store "deleted" content and temporary chats.

Imustaskforhelp
·
3 weeks ago
·
[ - ]

I agree but I was just (repeating?) some argument that I heard that if the companies would actually not follow on their premise that they are actually safe if they said so (think amazon bedrock tos policy which says such)

Then it will cause an insane backlash and nobody would use the product. So it is in their interest to not train/record.

But yes I also agree with you. They are already torrenting :/ So pretty sure if they can do illegal stuff scott free, they might do this too idk,

And yeah this was why I was actually saying that local matters more tbh. You just get rid of such headache.

wkat4242
·
3 weeks ago
·
[ - ]

> Then it will cause an insane backlash and nobody would use the product. So it is in their interest to not train/record.

I don't think there would be that much backlash. People are getting hooked on it and many don't actually care about privacy.

We know about Google, meta and people still use them. Not a big dent in openai usage either since their revelations.

But I understand your point!

noelwelsh
·
3 weeks ago
·
[ - ]

It's the hardware more than the software that is the limiting factor at the moment, no? Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.

ramesh31
·
3 weeks ago
·
[ - ]

>Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.

And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.

colecut
·
3 weeks ago
·
[ - ]

This is rapidly improving

https://simonwillison.net/2025/Jul/29/space-invaders/

Imustaskforhelp
·
3 weeks ago
·
[ - ]

I hope it improves at such a steady rate! Please lets just hope that there is still room for improvement to packing even more improvements in such LLMS which can help the home labbing community in general.

Imustaskforhelp
·
3 weeks ago
·
[ - ]

I think I still prefer local but I feel like that's because that most AI inference is kinda slow or comparable to local. But I recently tried out cerebras or (I have heard about groq too) and honestly when you try things at 1000 tk/s or similar, your mental model really shifts and becomes quite impatient. Cerebras does say that they don't log your data or anything in general and you would have to trust me to say that I am not sponsored by them (Wish I was tho) Its just that they are kinda nice.

But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.

vgb2k18
·
3 weeks ago
·
[ - ]

> Cerebras does say that they don't log your data or anything in general

Unil a judge says they must log everything, indefinitely

jama211
·
2 weeks ago
·
[ - ]

Avoiding cloud dependency seems like a huge amount of effort to make life harder for yourself, just to end up with a situation where you have to rely on other parts of the cloud to do any other part of your work or business anyway. I mean, why stop there? Unplug yourself from the grid so you don’t have to depend on water or electricity just in case the companies that provide it stop working.

Yeah, you could do that, but honestly the only world where you’re able to live off grid safely without being attacked and looted is a world where the rest of society hasn’t broken down and still has a grid to connect to. Similarly, the only world where you could succeed as a software developer is one where the cloud generally still functions so you may as well use cloud services that are convenient and right there.

You’re not the military with military secrets, you don’t have anything to gain from being independent from the cloud.

Interesting as a thought experiment, though.

mkummer
·
3 weeks ago
·
[ - ]

Super cool and well thought out!

I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)

[1] https://hypersonic.chat/

jumploops
·
3 weeks ago
·
[ - ]

I'm a little confused about your product branding vs. blog post?

From the product homepage, I imagine you're running VMs in the cloud (a la Firecracker).

From the blog post though, it looks like you're running Apple-specific VMs for local execution?

As someone who's built the former, I'd love the latter for use with the new gpt-oss releases :)

mkagenius
·
2 weeks ago
·
[ - ]

You are right, the product is almost the same as you described, intended for customers running LLM-generated code in their workflow.

This was something close to our hearts, so we thought of building it for our local use and releasing it for like-minded individuals.

willtemperley
·
3 weeks ago
·
[ - ]

How would this compare to using Apple Foundation Models which execute on device?

https://developer.apple.com/documentation/FoundationModels

navbaker
·
3 weeks ago
·
[ - ]

Open Web UI is a great alternative for a chat interface. You can point to an OpenAI API like vLLM or use the native Ollama integration and it has cool features like being able to say something like “generate code for an HTML and JavaScript pong game” and have it display the running code inline with the chat for testing

solarkraft
·
3 weeks ago
·
[ - ]

I’m all for this. This is the first effort I’ve seen attempting to solve the full stack - most local solutions I’ve seen look so DIY that I don’t have much hope I’ll be able to properly configure and operate them dependably.

I think there’s room for an integrated solution with all the features we’re used to from commercial solutions: Web search (most important to me), voice mode (very handy), image recognition (useful in some cases), the killer feature being RAG on personal files.

jychang
·
3 weeks ago
·
[ - ]

Any way to install this via just a container?

Similar to a `docker compose up -d` that a lot of projects offer. Just download the docker-compose.yml file into a folder, run the command, and you're running. If you want to delete everything, just `docker compose down` and delete the folder, and the container and everything is gone.

Anything similar to that? I don't want to run a random install.sh on my machine that does god knows what.

mkagenius
·
3 weeks ago
·
[ - ]

There are similar commands for coderunner (not the UI frontend):

  container image pull instavm/coderunner

  container run  --name coderunner --detach  instavm/coderunner

(for more comprehensive commands, see from line 51 https://github.com/instavm/coderunner/blob/main/install.sh#L...)

Frontend (coderunner-ui) is not inside a docker as of now.

patmorgan23
·
3 weeks ago
·
[ - ]

I believe a flatpack or appimage is what you're looking for.

cheschire
·
3 weeks ago
·
[ - ]

But you would pump your secrets into a docker AI?

__MatrixMan__
·
3 weeks ago
·
[ - ]

If it was sufficiently locked down, yeah. It's only going to live long enough to give me an answer and then everything it can write to goes away afterwards (besides the answer itself).

What harm can it do?

oblio
·
3 weeks ago
·
[ - ]

Does Docker do that or are you speculating?

Also - podman?

cheschire
·
3 weeks ago
·
[ - ]

I wasn't implying docker itself was the issue.

The previous commenter said that they didn't want to run a shell script that does "god knows what". The implication being that they would not trust the writer of the shell script.

They wanted a docker container that would setup this offline AI workspace for them, presumably so they could interact with the AI and feed "secrets" or otherwise private data into it. Obviously there are other use cases for an offline AI, but folks tend to let their guard down when they think something is offline-only, and they may not be as careful with .env values, or personal information, as they would with a SaaS frontier model.

So I was pointing out that the contents of the docker container would be also doing "god knows what" with their data. Sure they would get the offline user experience but then what happens? More shell scripts? Background data calls? etc. And of course it depends on how they configure their docker container, but if they aren't willing to review an install shell script, they probably aren't looking to do any level of effort for configuring Docker.

Hopefully that clarifies it.

jychang
·
3 weeks ago
·
[ - ]

... I mean, yes? The entire point of local AI is so you can feed your enterprise code into it, that you don't want offloaded to somewhere else.

That's the exact perfect use case for Docker, versus something heavier weight like a VM. What, you expect generic enterprise code to somehow be too dangerous for Docker but acceptable in a VM?

PeterStuer
·
3 weeks ago
·
[ - ]

The link to assistent ui in the article 404's. It should be https://github.com/assistant-ui/assistant-ui

mkagenius
·
3 weeks ago
·
[ - ]

My bad, I typed `-ai` instead of `-ui`. Its fixed now.

jarym
·
3 weeks ago
·
[ - ]

Playing with local LLMs is indeed fun. I use Kasm workspaces[0] to run a desktop session with ollama running on the host. Gives me the isolation and lets me experiment with all manner of crazy things (I tried to make a computer-use AI but it wasn't very good)

[0] https://kasmweb.com/

tcdent
·
3 weeks ago
·
[ - ]

I'm constantly tempted by the idealism of this experience, but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.

As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.

Coupled with the dramatically inferior performance of the weights you would be running in a local environment, it's just not worth it.

I expect this will change in the future, and am excited to invest in a local inference stack when the weights become available. Until then, you're idling a relatively expensive, rapidly depreciating asset.

Aurornis
·
3 weeks ago
·
[ - ]

I think the local LLM scene is very fun and I enjoy following what people do.

However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.

The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.

I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.

I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.

jauntywundrkind
·
3 weeks ago
·
[ - ]

I agree and disagree. Many of the best models are open source, just too big to run for most people.

And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).

Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!

Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.

(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)

Aurornis
·
3 weeks ago
·
[ - ]

> Many of the best models are open source, just too big to run for most people.

You can find all of the open models hosted across different providers. You can pay per token to try them out.

I just don't see the open models as being at the same quality level as the best from Anthropic and OpenAI. They're good but in my experience they're not as good as the benchmarks would suggest.

> $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive.

This is why I only appreciate the local LLM scene from a distance.

It’s really cool that this can be done, but $10K to run lower quality models at slower speeds is a hard sell. I can rent a lot of hours on an on-demand cloud server for a lot less than that price or I can pay $20-$200/month and get great performance and good quality from Anthropic.

I think the local LLM scene is fun where it intersects with hardware I would buy anyway (MacBook Pro with a lot of RAM) but spending $10K to run open models locally is a very expensive hobby.

jstummbillig
·
3 weeks ago
·
[ - ]

> Many of the best models are open source, just too big to run for most people

I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).

But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.

esseph
·
3 weeks ago
·
[ - ]

https://pcisig.com/pci-sig-announces-pcie-80-specification-t...

From 2003-2016, 13 years, we had PCIE 1,2,3.

2017 - PCIE 4.0

2019 - PCIE 5.0

2022 - PCIE 6.0

2025 - PCIE 7.0

2028 - PCIE 8.0

Manufacturing and vendors are having a hard time keeping up. And the PCIE 5.0 memory is.. not always the most stable.

dcrazy
·
3 weeks ago
·
[ - ]

Are you conflating GDDR5x with PCIe 5.0?

esseph
·
3 weeks ago
·
[ - ]

No.

I'm saying we're due for faster memory but seem to be having trouble scaling bus speeds as well (in production) and reliable memory. And the network is changing a lot, too.

It's a neverending cycle I guess.

dcrazy
·
3 weeks ago
·
[ - ]

One advantage of Apple Silicon is the unified memory architecture. You put memory on the fabric instead of on PCIe.

jauntywundrkind
·
3 weeks ago
·
[ - ]

Thanks for the numbers. Valuable contribution for sure!!

There's been a huge lag for PCIe adoption, and imo so so much has boiled down "do people need it"?

In the past 10 years I feel like my eyes have been opened that every high tech company's greatest highest most compelling desire is to slow walk the release out. To move as slow as the market will bear, to do as little as possible, to roll on and on with minor incremental changes.

There are canonball moments where the market is disrupted. Thank the fucking stars Intel got sick of all this shit and worked hard (with many others) to standardized NVMe, to make a post SATA world with higher speeds & better protocol. AMD64 architecture changed the game. Ryzen again. But so much of the industry is about retaining your cost advantage, is about retaining strong market segmentations, by never shipping too many PCIe lane platforms, by limiting consumer vs workstation vs server video card ram and vgpu (and mxgpu) and display out capabilities often entirely artificially.

But there is a fucking fire right now and everyone knows it. Nvlink is massively more bandwidth and massively more efficient and is essential to system performance. The need to get better fast is so on. Seems like for now SSD will keep slow walking their 2x's. But PCIe is facing a real crisis of being replaced, and everyone wants better. And hates hates hates the insane cost. PCIe 8.0 is going to be insane data to push over a differential, insane speed. But we have to.

Alas PCIe is also hampered by relatively generous broader system design. The trace distances are going to shrink, signal requirements increase a lot. But this needing a intercompatible compliance program for any peripheral to work is a significant disadvantage, versus, just make this point to point link work between these two cards.

There's so many energies happening right now in interconnect. I hope we see some actual uptake, some day. We've had so long for Gen-Z (Ethernet phy, gone now), CXL (3.x being switched, still un-arriced), now UltraEthernet and UltraLink. Man I hope we can see some step improvements. Everyone knows we are in deep shit if NV alone can connect systems. Ironically AMD's HyperTransport was open, was a path towards this, but now Infinity Fabric is an internal only thing and as branding & an idea vanishing from the world kind of, feels insufficient.

esseph
·
3 weeks ago
·
[ - ]

All of these extremely high end technologies are so far away from hitting the consumer market.

Is there any desire for most people? What's the TAM?

jauntywundrkind
·
3 weeks ago
·
[ - ]

Classic economics thinking: totally fucked "faster horses" thinking.

The addressable market depends on the advantage. Which right now: we don't know. It's all a guess that someone is going to find it valuable, and no one knows.

But if we find that we didn't actually need $700 NIC's to get shitty bandwidth, if we could have just been putting cables from PCIe shaped slot to PCIe slot (or oculink port!) and getting >>10x performance with >>10x less latency? Yeah bro uhh I think there might be a desire for using the same fucking chip we already use but getting 10x + 10x better out of it.

Faster lower latency cheaper storage? RAM expandability? Lower latency GPU access? There's so much that could make a huge difference for computing, broadly.

justincormack
·
3 weeks ago
·
[ - ]

Thunderbolt tunnels pcie and you can use it as a nic in effect with one cable between devices. Its slower than oculink but more convenient.

esseph
·
2 weeks ago
·
[ - ]

I am very ready for optical bus lfg

nemomarx
·
3 weeks ago
·
[ - ]

Probably small consumer market of enthusiasts (notice Nvidia barely caters to gaming hardware lately) but if you can get better memory throughput on servers isn't that a large industry market?

Rohansi
·
3 weeks ago
·
[ - ]

You'll want to look at benchmarks rather than the theoretical maximum bandwidth available to the system. Apple has been using bandwidth as a marketing point but you're not always able to use that bandwidth amount depending on your workload. For example, the M1 Max has 400GB/s advertised bandwidth but the CPU and GPU combined cannot utilize all of it [1]. This means Strix Halo could actually be better for LLM inference than Apple Silicon if it achieves better bandwidth utilization.

[1] https://web.archive.org/web/20250516041637/https://www.anand...

vFunct
·
3 weeks ago
·
[ - ]

The game changer technology that'll enable full 1TB+ LLM models for cheap is Sandisk's High Bandwidth Flash. Expect devices with that in about 3-4 years, maybe even on cellphones.

jauntywundrkind
·
3 weeks ago
·
[ - ]

I'm crazy excited for High Bandwidth Flash, really hope they pull it off. There is a huge caveat: only having a couple hundred or thousand r/w cycles before your multi $k accelerator stops working!! A pretty big constraint!

But as long as you are happy to keep running the same model, the wins here for large capacity & high bandwidth are sick ! And the affordability could be exceptional! (If you can afford to make flash with a hundred or so channels at a decent price!)

Uehreka
·
3 weeks ago
·
[ - ]

I was talking about this in another comment, and I think the big issue at the moment is that a lot of the local models seem to really struggle with tool calling. Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”

So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”

com2kid
·
3 weeks ago
·
[ - ]

> Like, just straight up can’t do it even though they’re advertised as being able to. Most of the models I’ve tried with Goose (models which say they can do tool calls) will respond to my questions about a codebase with “I don’t have any ability to read files, sorry!”

I'm working on solving this problem in two steps. The first is a library prefilled-json, that lets small models properly fill out JSON objects. The second is a unpublished library called Ultra Small Tool Call that presents tools in a way that small models can understand, and basically walks the model through filling out the tool call with the help of prefilled-json. It'll combine a number of techniques, including tool call RAG (pulls in tool definitions using RAG) and, honestly, just not throwing entire JSON schemas at the model but instead using context engineering to keep the model focused.

IMHO the better solution for local on device workflows would be if someone trained a custom small parameter model that just determined if a tool call was needed and if so which tool.

jauntywundrkind
·
3 weeks ago
·
[ - ]

Agreed that this is a huge limit. There's a lot of examples actually of "tool calling" but it's all bespoke code-it-yourself: very few of these systems have MCP integration.

I have a ton of respect for SGLang as a runtime. I'm hoping something can be done there. https://github.com/sgl-project/sglang/discussions/4461 . As noted in that thread, it is really great that Qwen3-Coder has a tool-parser built-in: hopefully can be some kind useful reference/start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/b...

·
3 weeks ago
·
[ - ]

wizee
·
3 weeks ago
·
[ - ]

Qwen 3 Coder 30B-A3B has been pretty good for me with tool calling.

mxmlnkn
·
3 weeks ago
·
[ - ]

This resonates. I have finally started looking into local inference a bit more recently.

I have tried Cursor a bit, and whatever it used worked somewhat alright to generate a starting point for a feature and for a large refactor and break through writer's blocks. It was fun to see it behave similarly to my workflow by creating step-by-step plans before doing work, then searching for functions to look for locations and change stuff. I feel like one could learn structured thinking approaches from looking at these agentic AI logs. There were lots of issues with both of these tasks, though, e.g., many missed locations for the refactor and spuriously deleted or indented code, but it was a starting point and somewhat workable with git. The refactoring usage caused me to reach free token limits in only two days. Based on the usage, it used millions of tokens in minutes, only rarely less than 100K tokens per request, and therefore probably needs a similarly large context length for best performance.

I wanted to replicate this with VSCodium and Cline or Continue because I want to use it without exfiltrating all my data to megacorps as payment and use it to work on non-open-source projects, and maybe even use it offline. Having Cursor start indexing everything, including possibly private data, in the project folder as soon as it starts, left a bad taste, as useful as it is. But, I quickly ran into context length problems with Cline, and Continue does not seem to work very well. Some models did not work at all, DeepSeek was thinking for hours in loops (default temperature too high, should supposedly be <0.5). And even after getting tool use to work somewhat with qwen qwq 32B Q4, it feels like it does not have a full view of the codebase, even though it has been indexed. For one refactor request mentioning names from the project, it started by doing useless web searches. It might also be a context length issue. But larger contexts really eat up memory.

I am also contemplating a new system for local AI, but it is really hard to decide. You have the choice between fast GPU inference, e.g., RTX 5090 if you have money, or 1-2 used RTX 3090, or slow, but qualitatively better CPU / unified memory integrated GPU inference with systems such as the DGX Spark, the Framework Desktop AMD Ryzen AI Max, or the Mac Pro systems. Neither is ideal (and cheap). Although my problems with context length and low-performing agentic models seem to indicate that going for the slower but more helpful models on a large unified memory seems to be better for my use case. My use case would mostly be agentic coding. Code completion does not seem to fit me because I find it distracting, and I don't require much boilerplating.

It also feels like the GPU is wasted, and local inference might be a red herring altogether. Looking at how a batch size of 1 is one of the worst cases for GPU computation and how it would only be used in bursts, any cloud solution will be easily an order of magnitude or two more efficient because of these, if I understand this correctly. Maybe local inference will therefore never fully take off, barring even more specialized hardware or hard requirements on privacy, e.g., for companies. To solve that, it would take something like computing on encrypted data, which seems impossible.

Then again, if the batch size of 1 is indeed so bad as I think it to be, then maybe simply generate a batch of results in parallel and choose the best of the answers? Maybe this is not a thing because it would increase memory usage even more.

justincormack
·
3 weeks ago
·
[ - ]

You might end up using batching to run multiple queries or branches for yourself in parallel. But yes as you say it is very unclear right now.

wizee
·
3 weeks ago
·
[ - ]

While cloud models are of course faster and smarter, I've been pretty happy running Qwen 3 Coder 30B-A3B on my M4 Max MacBook Pro. It has been a pretty good coding assistant for me with Aider, and it's also great for throwing code at and asking questions. For coding specifically, it feels roughly on par with SOTA models from mid-late 2024.

At small contexts with llama.cpp on my M4 Max, I get 90+ tokens/sec generation and 800+ tokens/sec prompt processing. Even at large contexts like 50k tokens, I still get fairly usable speeds (22 tok/s generation).

1oooqooq
·
3 weeks ago
·
[ - ]

more interesting is the extent apple convinced people a laptop can replace a desktop or server. mind blowing reality distortion field (as will be proven by some twenty comments telling I'm wrong 3... 2... 1).

davidmurdoch
·
3 weeks ago
·
[ - ]

I dropped $4k on an (Intel) laptop a few years ago. I thought it would blow my old 2012 core i7 out of the water. Editing photos in Lightroom and Photoshop often requires heavy sustained CPU work. Thermals in laptops is just not a solved problem. People who say laptops are fine replacements for desktops probably don't realize how much and how quickly thermals limit heavy multi-core CPU workloads.

jki275
·
3 weeks ago
·
[ - ]

That was true until Apple released the M series laptops.

bionsystem
·
3 weeks ago
·
[ - ]

I'm a desktop guy, considering the switch to a laptop-only setup, what would I miss ?

kelipso
·
3 weeks ago
·
[ - ]

For $10k, you too can get the power of a $2k desktop, and enjoy burning your lap everyday, or something like that. If I were to do local compute and wanted to use my laptop, I would only consider a setup where I ssh in to my desktop. So I guess only difference from saas llm would be privacy and the cool factor. And rate limits, and paying more if you go over, etc.

com2kid
·
3 weeks ago
·
[ - ]

$2k laptops now days come with 16 cores. They are thermally limited, but they are going to get you 60-80% the perf of their desktop counterparts.

The real limit is on the Nvidia cards. They are cut down a fair bit, often with less VRAM until you really go up in price point.

They also come with NPUs but the docs are bad and none of the local LLM inference engines seem to use the NPU, even though they could in theory be happy running smaller models.

EagnaIonat
·
3 weeks ago
·
[ - ]

> For $10k, you too can get the power of a $2k desktop,

Even M1 MBP 32GB performance is pretty impressive for its age and you can get them for well <$1K second hand.

I have one.

I use these models: gpt-oss, llama3.2, deepseek, granite3.3

They all work fine and speed is not an issue. The recent Ollama app means I can have document/image processing with the LLM as well.

moron4hire
·
3 weeks ago
·
[ - ]

You'll end up with a portable desktop with bad thermals, impacting performance, battery life, and actually-on-the-lap comfort. Bleeding-edge performance laptops can really only manage an hour, max, on battery, making the form factor much more about moving between different pre-planned, desk-oriented work locations.

I take my laptop back and forth from home to work. At work, I ban them from in-person meetings because I want people to actually pay attention to the meeting. In both locations where I use the computer, I have a monitor, keyboard, and mouse I'm plugging in via a dock. That makes the built-in battery and I/O redundant. I think I would rather have a lower-powered, high-battery, ultra portable laptop remoting into the desktop for the few times I bring my computer to in-person meetings for demos.

I wish the memory bandwidth for eGPUs was better.

aldanor
·
3 weeks ago
·
[ - ]

Huh? Bleeding edge laptops can last a lot more on battery. M3 16'' mbp lasts definitely enough for a full office day of coding. Twice that if just browsing and not doing cpu intensive stuff.

moron4hire
·
3 weeks ago
·
[ - ]

Even the M4 Max is not "bleeding edge". Apple is doing impressive stuff with energy efficient compute, but you can't get top of the line raw compute for any amount of financial of energy budget from them.

aldanor
·
3 weeks ago
·
[ - ]

I'm genuinely interested in what kind of work are you doing if bringing m4 max is not enough? And what kind of bleeding edge laptops are we even talking about (link?) and for what purpose?

baobun
·
3 weeks ago
·
[ - ]

Upgradability, repairability, thermals (translating into widely different performance for the same specs), I/O, connectivity.

jazzypants
·
3 weeks ago
·
[ - ]

I think this would be more interesting if you were to try to prove yourself correct first.

There are extremely few things that I cannot do on my laptop, and I have very little interest in those things. Why should I get a computer that doesn't have a screen? You do realize that, at this point of technological progress, the computer being attached to a keyboard and a screen is the only true distinguishing factor of a laptop, right?

1oooqooq
·
2 weeks ago
·
[ - ]

cool. you can browse the web. that's cool. just stay out of conversation you're not an authority.

motorest
·
3 weeks ago
·
[ - ]

> As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.

Can you explain your rationale? It seems that the worst case scenario is that your setup might not be the most performant ever, but it will still work and run models just as it always did.

This sounds like a classical and very basic opex vs capex tradeoff analysis, and these are renowned for showing that on financial terms cloud providers are a preferable option only in a very specific corner case: short-term investment to jump-start infrastructure when you do not know your scaling needs. This is not the case for LLMs.

OP seems to have invested around $600. This is around 3 months worth of an equivalent EC2 instance. Knowing this, can you support your rationale with numbers?

tcdent
·
3 weeks ago
·
[ - ]

When considering used hardware you have to take quantization into account; gpt-oss-120b for example is running a very new MXFP4 which will use far more than 80GB to fit into the available fp types on older hardware or Apple silicon.

Open models are trained on modern hardware and will continue to take advantage of cutting edge numeric types, and older hardware will continue to suffer worse performance and larger memory requirements.

motorest
·
3 weeks ago
·
[ - ]

You're using a lot of words to say "I believe yesterday's hardware might not run models as as fast as today's hardware."

That's fine. The point is that yesterday's hardware is quite capable of running yesterday's models, and obviously it will also run tomorrow's models.

So the question is cost. Capex vs opex. The fact is that buying your own hardware is proven to be far more cost-effective than paying cloud providers to rent some cycles.

I brought data to the discussion: for the price tag of OP's home lab, you only afford around 3 months worth of an equivalent EC2 instance. What's your counter argument?

kelnos
·
3 weeks ago
·
[ - ]

Not the GP, but my take on this:

You're right about the cost question, but I think the added dimension that people are worried about is the current pace of change.

To abuse the idiom a bit, yesterday's hardware should be able to run tomorrow's models, as you say, but it might not be able to run next month's models (acceptably or at all).

Fast-forward some number of years, as the pace slows. Then-yesterday's hardware might still be able to run next-next year's models acceptably, and someone might find that hardware to be a better, safer, longer-term investment.

I think of this similarly to how the pace of mobile phone development has changed over time. In 2010 it was somewhat reasonable to want to upgrade your smartphone every two years or so: every year the newer flagship models were actually significantly faster than the previous year, and you could tell that the new OS versions would run slower on your not-quite-new-anymore phone, and even some apps might not perform as well. But today in 2025? I expect to have my current phone for 6-7 years (as long as Google keeps releasing updates for it) before upgrading. LLM development over time may follow at least a superficially similar curve.

Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription. I can't justify the cost of a homelab (the capex, plus the opex of electricity, which is expensive where I live), when in a year that hardware might be showing its age, and in two years might not meet my (future) needs. And if I can't justify spending the homelab cost every two years, I certainly can't justify spending that same amount in 3 months for EC2.

motorest
·
3 weeks ago
·
[ - ]

> Fast-forward some number of years (...)

I repeat: OP's home server costs as much as a few months of a cloud provider's infrastructure.

To put it another way, OP can buy brand new hardware a few times per year and still save money compared with paying a cloud provider for equivalent hardware.

> Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription.

OP stated quite clearly their goal was to run models locally.

ac29
·
3 weeks ago
·
[ - ]

> OP stated quite clearly their goal was to run models locally.

Fair, but at the point you trust Amazon hosting your "local" LLM, its not a huge reach to just use Amazon Bedrock or something

motorest
·
3 weeks ago
·
[ - ]

> Fair, but at the point you trust Amazon hosting your "local" LLM, its not a huge reach to just use Amazon Bedrock or something

I don't think you even bothered to look at Amazon Bedrock's pricing before doing that suggestion. They charge users per input tokens + output tokens. In Amazon Bedrock, a single chat session involving 100k tokens can cost you $200. That alone is a third of OP's total infrastructure costs.

If you want to discuss options in terms of cost, the very least you should do is look at pricing.

tcdent
·
3 weeks ago
·
[ - ]

I incorporated the quantization aspect because it's not that simple.

Yes, old hardware will be slower, but you will also need a significant amount more of it to even operate.

RAM is the expensive part. You need lots of it. You need even more of it for older hardware which has less efficient float implementations.

https://developer.nvidia.com/blog/floating-point-8-an-introd...

fredmcawesome
·
3 weeks ago
·
[ - ]

But surely this is short term? Once you get older hardware with FP4 support this shouldn't be a concern.

kelnos
·
3 weeks ago
·
[ - ]

> I expect this will change in the future

I'm really hoping for that too. As I've started to adopt Claude Code more and more into my workflow, I don't want to depend on a company for day-to-day coding tasks. I don't want to have to worry about rate limits or API spend, or having to put up $100-$200/mo for this. I don't want everything I do to be potentially monitored or mined by the AI company I use.

To me, this is very similar to why all of the smart-home stuff I've purchased all must have local control, and why I run my own smart-home software, and self-host the bits that let me access it from outside my home. I don't want any of this or that tied to some company that could disappear tomorrow, jack up their pricing, or sell my data to third parties. Or even use my data for their own purposes.

But yeah, I can't see myself trying to set any LLMs up for my own use right now, either on hardware I own, or in a VPS I manage myself. The cost is very high (I'm only paying Anthropic $20/mo right now, and I'm very happy with what I get for that price), and it's just too fiddly and requires too much knowledge to set up and maintain, knowledge that I'm not all that interested in acquiring right now. Some people enjoy doing that, but that's not me. And the current open models and tooling around them just don't seem to be in the same class as what you can get from Anthropic et al.

But yes, I hope and expect this will change!

jeremyjh
·
3 weeks ago
·
[ - ]

I expect it will never change. In two years if there is a local option as good as GPT-5 there will be a much better cloud option and you'll have the same tradeoffs to make.

c-hendricks
·
3 weeks ago
·
[ - ]

Why would AI be one of the few areas where locally-hosted options can't reach "good enough"?

ac29
·
3 weeks ago
·
[ - ]

Maybe a better question is when will SOTA models be "good enough"?

At the moment there appears to be ~no demand for older models, even models that people praised just a few months ago. I suspect until AGI/ASI is reached or progress plateaus, that will continue be the case.

lexh
·
3 weeks ago
·
[ - ]

The current SOTA closed model providers are also all rolling out access to their latest models with better pricing (e.g. GPT-5 this week), which seems like a confounding factor unique to this moment in the cycle. An API consumer would need to have a very specific reason to choose GPT-4o over GPT-5, given the latter costs less, benchmarks better and is roughly the same speed.

jeremyjh
·
3 weeks ago
·
[ - ]

Yes, this is exactly my point. Thank you for stating it better.

hombre_fatal
·
3 weeks ago
·
[ - ]

For some use-cases, like making big complex changes to big complex important code or doing important research, you're pretty much always going to prefer the best model rather than leave intelligence on the table.

For other use-cases, like translations or basic queries, there's a "good enough".

kelnos
·
3 weeks ago
·
[ - ]

That depends on what you value, though. If local control is that important to you for whatever reason (owning your own destiny, privacy, whatever), you might find that trade off acceptable.

And I expect that over time the gap will narrow. Sure, it's likely that commercially-built LLMs will be a step ahead of the open models, but -- just to make up numbers -- say today the commercially-built ones are 50% better. I could see that narrowing to 5% or something like that, after some number of years have passed. Maybe 5% is a reasonable trade-off for some people to make, depending on what they care about.

Also consider that OpenAI, Anthropic, et al. are all burning through VC money like nobody's business. That money isn't going to last forever. Maybe at some point Anthropic's Pro plan becomes $100/mo, and Max becomes $500-$1000/mo. Building and maintaining your own hardware, and settling for the not-quite-the-best models might be very much worth it.

m11a
·
3 weeks ago
·
[ - ]

Agree, for now.

But the foundation models will eventually hit a limit, and the open-source ecosystem, which trails by around a year or two, will catch up.

bbarnett
·
3 weeks ago
·
[ - ]

I grew up in a time when listening to an mp3 was too computationally expensive and nigh impossible for the average desktop. Now tiny phones can decode high def video realtime due to CPU extensions.

And my phone uses a tiny, tiny amount of power, comparatively, to do so.

CPU extensions and other improvements will make AI a simple, tiny task. Many of the improvements will come from robotics.

oblio
·
3 weeks ago
·
[ - ]

At a certain point Moore's Law died and that point was about 20 years ago but fortunately for MP3s, it happened after MP3 became easily usable. There's no point in comparing anything before 2005 or so from that perspective.

We have long entered an era where computing is becoming more expensive and power hungry, we're just lucky regular computer usage has largely plateaued at a level where the already obtained performance is good enough.

But major leaps are a lot more costly these days.

victorbjorklund
·
3 weeks ago
·
[ - ]

Next two years probably. But at some point we will either hit scales where you really dont need anything better (lets say cloud is 10000 token/s and local is 5000 token/s. Makes no difference for most individual users) or we will hit som wall where ai doesnt get smarter but cost of hardware continues to fall

Aurornis
·
3 weeks ago
·
[ - ]

There will always be something better on big data center hardware.

However, small models are continuing to improve at the same time that large RAM capacity computing hardware is becoming cheaper. These two will eventually intersect at a point where local performance is good enough and fast enough.

kingo55
·
3 weeks ago
·
[ - ]

If you've tried gpt-oss:120b and Moonshot AIs Kimi Dev, it feels like this is getting closer to reality. Mac Studios, while expensive are now offering 512gb of usable RAM as well. The tooling available to running local models is also becoming more accessible than even just a year ago.

kasey_junk
·
3 weeks ago
·
[ - ]

I’d be surprised by that outcome. At one point databases were cutting edge tech with each engine leap frogging each other in capability. Still the proprietary db often have features that aren’t matched elsewhere.

But the open db got good enough that you need to justify not using them with specific reasons why.

That seems at least as likely an outcome for models as they continue to improve infinitely into the stars.

duxup
·
3 weeks ago
·
[ - ]

Maybe, but my phone has become is a "good enough" computer for most tasks compared to a desktop or my laptop.

Seems plausible the same goes for AI.

zwnow
·
3 weeks ago
·
[ - ]

You know there's a ceiling to all this with the current LLM approaches right? They won't become that much better, its even more likely they will degrade. There are cases of bad actors attacking LLMs by feeding it false information and propaganda. I dont see this changing in the future.

withinboredom
·
3 weeks ago
·
[ - ]

I seeded all over the internet that a friend of mine was an elephant with the intention of poisoning the well, so to speak. (with his permission, of course)

That was in 2021. Today if you ask who my friend is, it tells you that he is an elephant, without even doing a web search.

I wouldn’t be surprised if people are doing this with more serious things.

jokethrowaway
·
3 weeks ago
·
[ - ]

Looks like they patched it (tested on Claude, ChatGPT; I assume it's Rob) but your point is very valid.

kvakerok
·
3 weeks ago
·
[ - ]

What is even a point of having a self hosted gpt5 equivalent that's not into petabytes of knowledge?

pfannkuchen
·
3 weeks ago
·
[ - ]

It might change once the companies switch away from lighting VC money on fire mode and switch to profit maximizing mode.

I remember Uber and AirBnB used to seem like unbelievably good deals, for example. That stopped eventually.

oblio
·
3 weeks ago
·
[ - ]

AirBNB is so good that it's half the size of Booking.com these days.

And Uber is still big but about 30% of the time in places I go to, in Europe, it's just another website/app to call local taxis from (medallion and all). And I'm fairly sure locals generally just use the website/app of the local company, directly, and Uber is just a frontend for foreigners unfamiliar with that.

pfannkuchen
·
3 weeks ago
·
[ - ]

Right but if you wanted to start a competitor it would be a lot easier today vs back then. And running one for yourself doesn’t really apply to these but spend magnitude difference wise it’s the same idea.

jeremyjh
·
3 weeks ago
·
[ - ]

This I could see.

bee_rider
·
3 weeks ago
·
[ - ]

Hardware is slower to design and manufacture than we expect as software people.

What I think we’ll see is: people will realize some things that suck in the current first-generation of laptop NPUs. The next generation of that hardware will get better as a result. The software should generally get better and lighter. We’re currently at step -.5 here, because ~nobody has bought these laptops yet! This will happen in a couple years.

Meanwhile, eventually the cloud LLM hosts will run out of investors money to subsidize our use of their computers. They’ll have to actually start charging enough to make a profit. On top of what local LLM folks have to pay, the cloud folks will have to pay:

* Their investors

* Their security folks

* The disposal costs for all those obsolete NVIDIA cards

Plus the remote LLM companies will have the fundamental disadvantage that your helpful buddy that you use as a psychologist in a pinch is also reporting all your darkest fears to Microsoft or whoever. Or your dev tools might be recycling all the work you thought you were doing for your job, back into their training set. And might be turned off. It just seems wildly unappealing.

ActorNightly
·
3 weeks ago
·
[ - ]

>but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.

Its because people are thinking too linearly about this, equating model size with usability.

Without going into too much detail because this may be a viable business plan for me, but I have had very good success with Gemma QAT model that runs quite well on a 3090 wrapped up in a very custom agent format that goes beyond simple prompt->response use. It can do things that even the full size large language models fail to do.

·
3 weeks ago
·
[ - ]

bigyabai
·
3 weeks ago
·
[ - ]

> anything you pick up second-hand will still deprecate at that pace

Not really? The people who do local inference most (from what I've seen) are owners of Apple Silicon and Nvidia hardware. Apple Silicon has ~7 years of decent enough LLM support under it's belt, and Nvidia is only now starting to depreciate 11-year-old GPU hardware in drivers.

If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s. Maybe even faster inference because of MoE architectures or improvements in the backend.

Uehreka
·
3 weeks ago
·
[ - ]

People on HN do a lot of wishful thinking when it comes to the macOS LLM situation. I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

And that’s fine! But then people come into the conversation from Claude Code and think there’s a way to run a coding assistant on Mac, saying “sure it won’t be as good as Claude Sonnet, but if it’s even half as good that’ll be fine!”

And then they realize that the heavvvvily quantized models that you can run on a mac (that isn’t a $6000 beast) can’t invoke tools properly, and try to “bridge the gap” by hallucinating tool outputs, and it becomes clear that the models that are small enough to run locally aren’t “20-50% as good as Claude Sonnet”, they’re like toddlers by comparison.

People need to be more clear about what they mean when they say they’re running models locally. If you want to build an image-captioner, fine, go ahead, grab Gemma 7b or something. If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

EagnaIonat
·
3 weeks ago
·
[ - ]

> I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

I feel like you haven't actually used it. Your comment may have been true 5 years ago.

> If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

You can use a RAG approach (eg. Milvus) and also LoRA templates to dramatically improve the accuracy of the answer if needed.

Locally you can run multiple models, multiple times without having to worry about costs.

You also have the likes of Open WebUI which builds numerous features on top of an interface if you don't want to do coding.

I have a very old M1 MBP 32GB and I have numerous applications built to do custom work. It does the job the fine and speed is not an issue. Not good enough to do a LoRA build but I have a more recent laptop for that.

I doubt I am the only one.

bigyabai
·
3 weeks ago
·
[ - ]

I agree completely. My larger point is that Apple and Nvidia's hardware has depreciated less slowly, because they've been shipping highly dense chips for a while now. Apple's software situation is utterly derelict and it cannot be seriously compared to CUDA in the same sentence.

For inference purposes, though, compute shaders have worked fine for all 3 manufacturers. It's really only Nvidia users that benefit from the wealth of finetuning/training programs that are typically CUDA-native.

Aurornis
·
3 weeks ago
·
[ - ]

> If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s.

I think this is the difference between people who embrace hobby LLMs and people who don’t:

The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.

And also, the first Apple M1 chip was released less than 5 years ago, not 7.

bigyabai
·
3 weeks ago
·
[ - ]

> Any time I go to a local model it feels like I’m writing e-mails back and forth

Do you have a good accelerator? If you're offloading to a powerful GPU it shouldn't feel like that at all. I've gotten ChatGPT speeds from a 4060 running the OSS 20B and Qwen3 30B models, both of which are competitive with OpenAI's last-gen models.

> the first Apple M1 chip was released less than 5 years ago

Core ML has been running on Apple-designed silicon for 8 years now, if we really want to get pedantic. But sure, actual LLM/transformer use is a more recent phenomenon.

SteveJS
·
3 weeks ago
·
[ - ]

AFAICT, the RTX 4090 I bought in 2023 has actually appreciated rather than depreciated.

alliao
·
3 weeks ago
·
[ - ]

really depends on whether local model satisfies your own usage right? if it works locally well enough, just package it up and be content? as long as it's providing value now at least it's local...

isaacremuant
·
3 weeks ago
·
[ - ]

Everything you're saying is FUD. There's immense value in being able to do local or remote as you please and part of it is knowledge.

Also, at the end of the day is about value creates and AI may allow some people to generate more stuff but overall value still tends to align with who is better at the craft pre AI. Not who pays more.

cyanydeez
·
3 weeks ago
·
[ - ]

Anything you build in the LLM cloud will be. Must be. Rug pulled either via locking success or utter bankruptcy or just a model context prompt change.

Unless you're a billionaire with pull, you're building tools you cant control, cant own and are ephermap wisps.

That's even if you can even trust these large models in consistency.

ekianjo
·
3 weeks ago
·
[ - ]

once the models behind API start monetization of their results, their outputs will get much worse. Its just a matter of time.

washadjeffmad
·
3 weeks ago
·
[ - ]

It's not that bad. If you're an adult making a living wage, and you're literate in some IT principles and AGI operations know-how, it's not a major onetime investment. And you can always learn. I'm sure your argument deterred a lot of your parents' generation from buying computers, too. Where would most of us be if not for that? This is a second transistor moment, right in our lifetime.

Life is about balance. If you Boglehead everything and then die before retirement, did you really live?

braooo
·
3 weeks ago
·
[ - ]

Running LLMs at home is a repeat of the mess we make with "run a K8s cluster at home" thinking

You're not OpenAI or Google. Just use pytorch, opencv, etc to build the small models you need.

You don't need Docker even! You can share over a simple code based HTTP router app and pre-shared certs with friends.

You're recreating the patterns required to manage a massive data center in 2-3 computers in your closet. That's insane.

frank_nitti
·
3 weeks ago
·
[ - ]

For me, this is essential. On priciple, I won't pay money to be a software engineer.

I never paid for cloud infrastructure out of pocket, but still became the go-to person and achieved lead architecture roles for cloud systems, because learning the FOSS/local tooling "the hard way" put me in a better position to understand what exactly my corporate employers can leverage with the big cash they pay the CSPs.

The same is shaping up in this space. Learning the nuts and bolts of wiring systems together locally with whatever Gen AI workloads it can support, and tinkering with parts of the process, is the only thing that can actually keep me interested and able to excel on this front relative to my peers who just fork out their own money to the fat cats that own billions worth of compute.

I'll continue to support efforts to keep us on the track of engineers still understanding and able to 'own' their technology from the ground up, if only at local tinkering scale

jtbaker
·
3 weeks ago
·
[ - ]

Self hosting my own LLM setup in the homelab was what really helped me learn the fundamentals of K8s. If nothing else I'm grateful for that!

Imustaskforhelp
·
3 weeks ago
·
[ - ]

So I love linux and would wish to learn devops one day in its entirety to be an expert to actually comment on the whole post but

I feel like they actually used docker for just the isolation part or as a sandbox (technically they didn't use docker but something similar to it for mac (apple containers) ) I don't think that it has anything to do with k8s or scalability or pre shared cert or http router :/

meta_ai_x
·
3 weeks ago
·
[ - ]

This is especially true since AI is a large multiplicative factor to your productivity.

If Cloud LLMs have 10 IQ points > local LLM, within a month, you'll notice you'll be struggling behind the dude who just used Cloud LLM.

LocalLlama is for hobbies or your job depends on running locallama.

This is not one-time upfront setup cost vs payoff later tradeoff. It is a tradeoff you are making every query which compounds pretty quickly.

Edit : I expect nothing better than downvotes from this crowd. How HN has fallen on AI will be a case study for the ages

luke14free
·
3 weeks ago
·
[ - ]

you might want to check out what we built -> https://inference.sh supports most major open source/weight models from wan 2.2 video, qwen image, flux, most llms, hunyan 3d etc.. works in a containerized way locally by allowing you to bring your own gpu as an engine (fully free) or allows you to rent remote gpu/pool from a common cloud in case you want to run more complex models. for each model we tried to add quantized/ggufs versions to even wan2.2/qwen image/gemma become possible to execute with as little as 8gb vram gpus. mcp support coming soon in our chat interface so it can access other apps from the ecosystem.

pwn0
·
3 weeks ago
·
[ - ]

The website is very confusing. Where can I download the application? Is there a GitHub repository?

retrocog
·
3 weeks ago
·
[ - ]

Its all about context and purpose, isn't it? For certain lightweight uses cases, especially those concerning sensitive user data, a local implementation may make a lot of sense.

accrual
·
3 weeks ago
·
[ - ]

My thoughts exactly. The recent GPT-OSS 20B parameter model was a nice upgrade, it really feels like having a local mini ChatGPT.

kenny239
·
1 week ago
·
[ - ]

We need more projects like this maybe I'll help write some part in the near future.

woadwarrior01
·
3 weeks ago
·
[ - ]

> LLMs: Ollama for local models (also private models for now)

Incidentally, I decided to try to Ollama macOS app yesterday, and the first thing it tries to do upon launch is try to connect to some google domain. Not very private.

https://imgur.com/a/7wVHnBA

Aurornis
·
3 weeks ago
·
[ - ]

Automatic update checks https://github.com/ollama/ollama/blob/main/docs/faq.md

eric-burel
·
3 weeks ago
·
[ - ]

But can be audited which I'd buy everyday. It's probably not to hard to find network calls in a codebase if this task must be automated on update.

woadwarrior01
·
3 weeks ago
·
[ - ]

This is the macOS GUI, which IIUC is closed source.

abtinf
·
3 weeks ago
·
[ - ]

Yep, and I’ve noticed the same thing with in vscode with both the cline plugin and the copilot plugin.

I configure them both to use local ollama, block their outbound connections via little snitch, and they just flat out don’t work without the ability to phone home or posthog.

Super disappointing that Cline tries to do so much outbound comms, even after turning off telemetry in the settings.

sabareesh
·
3 weeks ago
·
[ - ]

Here is my rig, running GLM 4.5 Air. Very impressed by this model

https://sabareesh.com/posts/llm-rig/

https://huggingface.co/zai-org/GLM-4.5

nikolayasdf123
·
3 weeks ago
·
[ - ]

local/edge is the most under-valued space at the moment. incredible computing power that dwarfs datacenters, zero latency, zero cost, private, distributed and resilient

oblio
·
3 weeks ago
·
[ - ]

I guess you imagine a world like Skype supernodes (Skype gave that up more than a decade ago) or Tor nodes (Tor is used by a tiny fraction of internet users).

Not saying it can't be done, but the effort is humongous.

nikolayasdf123
·
3 weeks ago
·
[ - ]

no, I mean I saw multiple companies at this point with their entier K8S cluster... is smaller than single new macbook pro :/

now, if you have 100,000 users with latest iPhone, say you use 10GB RAM in each, using A16 chip with 1.9 TFLOPS, each with 5G connection

this is 1 Peta-Byte RAM + 0.25 Peta-FLOPs GPU + 4 TB / second bandwidth

at zero cost (no-upfront, no-maintenance, users pay for, upgrade, and maintain their phones working, pay for internet, charging with electricity, cooling? - thanks!)

... it goes even wilder if you use macbooks

... and if you consider say mid-size town in China with population of 15 million, you go Exa-scale

and consider that for now iPhones are just sitting idle. for now.

oblio
·
3 weeks ago
·
[ - ]

Things don't work like that.

First of all iPhones have more like 6-8GB of RAM, 1-2 of which are already taken up by the system and system apps. Add some resident apps and maybe 1-2GB are already taken. Then of course during peak times, which are predictable but not guaranteed, 5-10% is maybe available. So out of your 10GB estimated per device, you actually average maybe 3GB.

Similar story for the CPU and GPU.

Then, availability: dead battery, no cell reception, airplane mode, etc, etc.

And on top of that, in the context of battery charge and long term wear and tear, you're assuming people will just let you run Bitcoin mining nodes on them.

You need a really solid incentive for people to loan you end user computing power for legitimate reasons.

eric-burel
·
3 weeks ago
·
[ - ]

An llm on your computer is a fun hobby, an llm in your SME for 10 people is a business idea. There are not enough resources on this topic at all and the need is growing extremely fast. Local LLMs are needed for many use cases and business where cloud is not possible.

EagnaIonat
·
3 weeks ago
·
[ - ]

You can get good models that run fine on M1 32GB laptops just using Ollama App.

Or if you want numerous features on top of your local LLMS then Open WebUI would be my choice.

https://docs.openwebui.com

rshemet
·
3 weeks ago
·
[ - ]

if you ever end up trying to take this in the mobile direction, consider running on-device AI with Cactus –

https://cactuscompute.com/

Blazing-fast, cross-platform, and supports nearly all recent OS models.

b0ner_t0ner
·
3 weeks ago
·
[ - ]

Is this your site? It's missing a <title> tag.

ahmedbaracat
·
3 weeks ago
·
[ - ]

Thanks for sharing. Note that the GitHub at the end of the article is not working…

mkagenius
·
3 weeks ago
·
[ - ]

Thanks for the heads up. It's fixed now -

Coderunner-UI: https://github.com/instavm/coderunner-ui

Coderunner: https://github.com/instavm/coderunner

b0ner_t0ner
·
3 weeks ago
·
[ - ]

To OP, your link for https://github.com/assistant-ui/assistant-ui does not work.

xt00
·
3 weeks ago
·
[ - ]

Yea in an ideal world there would be a legal construct around AI agents in the cloud doing something on your behalf that could not be blocked by various stakeholders deciding they don't like the thing you are doing even if totally legal. Things that would be considered fair use, or maybe annoying to certain companies should not be easy for companies to just wholesale block by leveraging business relationships. Barring that, then yea, a local AI setup is the way to go.

kalasoo
·
2 weeks ago
·
[ - ]

Agree

I agree on this in every aspect

AI or any technology serve users locally will eventually empower users in a great manner because users can fully understand what they want.

Like a paper and pencil, which was not "cheap" in early history but eventually "local". AI or any technology will function the same way eventually.

why?

1. free to run and create (free == cheap, free == uncensored) 2. ambient everywhere

mark_l_watson
·
3 weeks ago
·
[ - ]

That is fairly cool. I was talking about this on X yesterday: another angle however, I use a local web scraper and search engine via meilisearch the main tech web sites I am interested in. For my personal research I use three web search APIs, but there is some latency. Having a big chuck of the web that I am interested in available locally with close to zero latency is nice when running local models, my own MCP services that might need web search, etc.

shekhargulati
·
3 weeks ago
·
[ - ]

I tried to port it to Docker and wrote a blog here https://shekhargulati.com/2025/08/09/making-coderunner-ui-wo.... I used Claude Code to do the port. We used Datalayer Jupyter MCP Server instead of coderunner which uses Apple containers.

mockingloris
·
3 weeks ago
·
[ - ]

At least, you are honest about augmenting the porting process. It's amazing what one can accomplish when they realize that with proper time, planning and a good grounding on building code/systems, that a lot more is possible.

The takeaway for me is that because these tools are fast doesn't mean the task also needs to move as fast. At least till AGI, a sound human reasoning before hitting enter goes a long way.

Thanks for sharing

ljosifov
·
3 weeks ago
·
[ - ]

In the same boat. I love running things localhost. It's been great fun, and I learned tons I didn't know before. I know remote models API-s are a must for any serious work where tons is to be done, produced fr. Still it warms my heart every time llama-server runs on, and serves my aging mbp. Recent MoEs run great on macs with loads of v/ram, and the power efficiency is scarcely believable.

hollowonepl
·
3 weeks ago
·
[ - ]

Yep, that is something I do also actively experiment with in home projects. Local NAS (Synology) with 28TB of RAIDed storage, local containers and VMs on it and local gitea and other devops and productivity tools. All that talks to my mac which runs editing, compiling, etc and lmstudio with local agent. Not best always with AI, I lack enough RAM but close to imagine how I will work in the future, end-to-end

cheesedoodle
·
3 weeks ago
·
[ - ]

I’m trying to do something similar but hyper fine tune a model of choice for my specific local data source. For example, use existing code models to answer dquestions with code examples based on my private source files and documentation.

I tried doing it with using Huggingface and Unsloth but keep getting OOM errors.

Have anyone done this that runs locally against your own data?

codazoda
·
3 weeks ago
·
[ - ]

I was just writing up a plan this morning. I use local models a fair amount, especially on trips.

My plan is to build a $150 AI bot, host it in my bedroom, give it access to all my writing, and let the world access it.

sneak
·
3 weeks ago
·
[ - ]

Halfway through he gives up and uses remote models. The basic premise here is false.

Also, the term “remote code execution” in the beginning is misused. Ironically, remote code execution refers to execution of code locally - by a remote attacker. Claude Code does in fact have that, but I’m not sure if that’s what they’re referring to.

thepoet
·
3 weeks ago
·
[ - ]

The blog says more about keeping the user data private. The remote models in the context are operating blind. I am not sure why you are nitpicking, almost nobody reading the blog would take remote code execution in that context.

vunderba
·
3 weeks ago
·
[ - ]

The MCP aspect (for code/tool execution) is completely orthogonal to the issue of data privacy.

If you put a remote LLM in the chain than it is 100% going to inadvertently send user data up to them at some point.

e.g. if I attach a PDF to my context that contains private data, it WILL be sent to the LLM. I have no idea what "operating blind" means in this context. Connecting to a remote LLM means your outgoing requests are tied to a specific authenticated API key.

testuseraugust
·
3 weeks ago
·
[ - ]

It would be nice to have something more modest like a local offline foreign language translator.

Basically I'd like to be able to have an emacs "M-x translate-french-to-english" function. This should be easier than a full chat app but doesn't exist as far as I know.

brbcompiling
·
3 weeks ago
·
[ - ]

Local AI is awesome, but without beefy hardware it’s like trying to run a marathon in flip-flops.

mathiaspoint
·
3 weeks ago
·
[ - ]

If you have good flipflops you can walk miles without having to take a break. The other year Walmart had some really good George brand flipflops I used to wear everywhere.

bling1
·
3 weeks ago
·
[ - ]

On a similar vibe, we developed app.czero.cc to run an LLM inside your chrome browser on your machine hardware without installation (you do have to download the models). Hard to run big models, but it doesnt get more local than that without having to install anything.

kaindume
·
3 weeks ago
·
[ - ]

Self hosted and offline AI systems would be great for privacy but the hardware and electricity cost are much too high for most users. I am hoping for a P2P decentralized solution that runs on distributed hardware not controlled by a single corporation.

user3939382
·
3 weeks ago
·
[ - ]

I’d settle for homomorphic encryption but that’s a long way off if ever

vunderba
·
3 weeks ago
·
[ - ]

Infra notwithstanding - I'd be interested in hearing how much success they actually had using a locally hosted MCP-capable LLM (and which ones in particular) because the E2E tests in the article seem to be against remote models like Claude.

adsharma
·
3 weeks ago
·
[ - ]

https://github.com/adsharma/ask-me-anything

Supports MLX on Apple silicon. Electron app.

There is a CI to build downloadable binaries. Looking to make a v0.1 release.

k__
·
3 weeks ago
·
[ - ]

Half-OT: Anything useful that runs reasonably fast on a regular Intel CPU/GPU?

oblio
·
3 weeks ago
·
[ - ]

I did a bunch of research and basically no. Unless you can work with sending a request in the evening and getting the result in the morning.

And you'd need a lot of regular RAM because otherwise you start swapping at which point I think response times end up in days.

This tech is in the Wild West days, for it to be usable by the average person on consumer hardware, I think we'll need to be in 2030+.

ethan_smith
·
3 weeks ago
·
[ - ]

For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit quantization. GGUF models with INT4 quantization typically need ~2GB RAM per billion parameters, so even older machines can handle smaller models.

akawry
·
2 weeks ago
·
[ - ]

Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp

CPU performance is much better than mainline llama, as well as having more quantization types available

dmezzetti
·
3 weeks ago
·
[ - ]

I built TxtAI with this philosophy in mind: https://github.com/neuml/txtai

dcreater
·
3 weeks ago
·
[ - ]

Then using ollama is not the right choice.

https://news.ycombinator.com/item?id=44814607

ruler88
·
3 weeks ago
·
[ - ]

At least you won't be needing a heater for the winter

synergy20
·
3 weeks ago
·
[ - ]

a PC with rtx3090 is able to run many models locally with decent speed. or rtx4090 though it's more expensive(and power hungry)

·
3 weeks ago
·
[ - ]

yichuan
·
3 weeks ago
·
[ - ]

That's my vision, hope it can help. I think that if we combine all our personal data and organize it effectively, we can be 10 times more efficient. Long-term AI memory, all you speak and see will secretly be loaded to your own personal AI, and that can solve many difficulties, I think. https://x.com/YichuanM/status/1953886817906045211

mikeyanderson1
·
3 weeks ago
·
[ - ]

We have this in closed alpha right now getting ready to roll out to our most active builders in the coming weeks at ThinkAgents.ai

kalasoo
·
2 weeks ago
·
[ - ]

Agree

I agree on this in every aspect

AI or any technology serve users locally will eventually empower users in a great manner because users can fully understand what they want.

Like a paper and pencil, which was not "cheap" in early history but eventually local. "AI" or any technology will function the same way eventually.

why?

1. free to run and create (free == cheap, free == uncensored) 2. ambient everywhere

LastTrain
·
3 weeks ago
·
[ - ]

I get it but I can’t get over the irony that you are using a tool that only works precisely because people don’t do this.

anupshinde
·
3 weeks ago
·
[ - ]

What is the Apple hardware being used here? I see Apple Silicon but not the configuration.. what did I miss

thebruce87m
·
2 weeks ago
·
[ - ]

Was looking for that too. Need to know whether I already own the hardware or can’t afford it.

eyespasm
·
3 weeks ago
·
[ - ]

To be honest, I just want to make porn. My own porn, the way I want it. That’s what I’m waiting for. Why the heck do I need to scroll through pages of boring, vanilla, pedestrian porn on Pornhub or RedGIfs or XNXX when I can create exactly what I want? That’ll be a huge killer app when I can do it locally and in the privacy of my own home.

btbuildem
·
3 weeks ago
·
[ - ]

I didn't see any mention of the hardware OP is planning to run this on -- any hints?

Woodi
·
3 weeks ago
·
[ - ]

So there are models to download but:

a) on what data that things was trained ?

b) any reproducible builds projects ? ;)

unboxingelf
·
3 weeks ago
·
[ - ]

In addition to self hosting, anonymous access to hosted inference is another interesting path.

There’s a “A Decentralised LLM Routing Marketplace” being built out on nostr that leverages ecash.

https://www.routstr.com/

gen2brain
·
3 weeks ago
·
[ - ]

People are talking about AI everywhere, but where can we find documentation, examples, and proof of how it works? It all ends with chat. Which chat is better and cheaper? This local story is just using some publicly available model, but downloaded? When is this going to stop?

zakki
·
3 weeks ago
·
[ - ]

Curious with the hardware used in this article.

felarof
·
2 weeks ago
·
[ - ]

You should definitely checkout BrowserOS! -- https://github.com/browseros-ai/BrowserOS

_the_inflator
·
3 weeks ago
·
[ - ]

The socialist EU allows only AI that serves the governance purpose. The EU has rightfully acknowledged that freedom of AI is essentially freedom of speech.

Hacking officially stopped being non-political in EU.

https://artificialintelligenceact.eu/

Enjoy understanding this here: https://artificialintelligenceact.eu/article/3/

Measures of Innovations rank at... Article 57! https://artificialintelligenceact.eu/ai-act-explorer/

I bet that soon, anyone involved with sophisticated AI systems will be system-checked and require a license.

God bless you all out there and have phun!

oblio
·
3 weeks ago
·
[ - ]

The EU isn't socialist, what are you going on about?

And AI - if true AI - can be "end of times" type tech, you think it won't be regulated? This is not hackers playing with breadboards in the 60s, it's Project Manhattan in the 40s.

nenadg
·
3 weeks ago
·
[ - ]

did this by running models in chroot

josephwegner
·
3 weeks ago
·
[ - ]

See

nikolayasdf123
·
3 weeks ago
·
[ - ]

local is so hot right now

hoppp
·
3 weeks ago
·
[ - ]

Local is important for compliance with GDPR and closed source software

I hate sending my code to openAI or my client's code.

I find local llms to be usable for short snippets but still too slow for a lot of things.

I just spent hours debugging code mistral ai gave me and had multiple errors, rtfm is still most of the times better than relying on an llm

evrennetwork
·
3 weeks ago
·
[ - ]

[dead]

jeffWrld
·
3 weeks ago
·
[ - ]

[dead]

123sereusername
·
3 weeks ago
·
[ - ]

[dead]

techlatest_net
·
3 weeks ago
·
[ - ]

[dead]

pyman
·
3 weeks ago
·
[ - ]

Mr Stallman? Richard, is that you?