Assuming you want to do this iteratively (at least for the first time) should only need to run:
ccmake .
And toggle the parameters your hardware supports or that you want (e.g. if CUDA if you're using Nvidia, Metal if you're using Apple etc..), and press 'c' (configure) then 'g' (generate), then: cmake --build . -j $(expr $(nproc) / 2)
Done.If you want to move the binaries into your PATH, you could then optionally run cmake install.
In that case, the steps to building llama.cpp are:
1. Clone the repo.
2. Run `make`.
To start chatting with a model all you need is to:
1. Download the model you want in gguf format that will fit into your hardware (probably the hardest step, but readily available on HuggingFace)
2. Run `./llama-server -m model.gguf`.
3. Visit localhost:8080
Meanwhile Cmake is like two lines and somehow it's the backup fallback option? I don't get it. And well on linux it's literally just one line with make.
Building most stuff on Windows is ludicrous, that's not something uncommon. I've chosen MSYS there as it's the easiest and least paninful way of installing deps for Vulkan build.
Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.
Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.
Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?
For reference, this is how I run it:
$ cat ~/.config/systemd/user/llamafile@.service
[Unit]
Description=llamafile with arbitrary model
After=network.target
[Service]
Type=simple
WorkingDirectory=%h/llms/
ExecStart=sh -c "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf --server --host '::' --port 8081 --nobrowser --log-disable"
[Install]
WantedBy=default.target
And then systemctl --user start llamafile@whatevermodel
but you can just run that ExecStart command directly and it works.(That said, after I explained it to SecOps they did tell me I would need to “consult legal” if I wanted to use a local LLM, but I’ll give them the benefit of the doubt there…)
Small models, like Llama 3.2, Qwen and SmolLM are really good right now, compared to few years ago
I think you still need to calibrate your expectations for what you can get from consumer grade hardware without a powerful GPU. I wouldn't look to a local LLM as a useful store of factual knowledge about the world. The amount of stuff that it knows is going to be hampered by the smaller size. That doesn't mean it can't be useful, it may be very helpful for specialized domains, like coding.
I hope and expect that over the next several years, hardware that's capable of running more powerful models will become cheaper and more widely available. But for now, the practical applications of local models that don't require a powerful GPU are fairly limited. If you really want to talk to an LLM that has a sophisticated understanding of the world, you're better off using Claude or Gemeni or ChatGPT.
https://llmware-ai.github.io/llmware/
https://github.com/llmware-ai/llmware
There's also Simon Wilson's llm cli:
https://llm.datasette.io/en/stable/
Both might help getting started with some LLM experiments.
I can imagine some use cases where you'd really want to use llama.cpp directly, and there are of course always people who will argue that all wrappers are bad wrappers, but for myself I like the combination of ease of use and flexibility offered by Ollama. I wrap it in Open WebUI for a GUI, but I also have some apps that reach out to Ollama directly.
* Ollama provides a very convenient way to download and manage models, which llama.cpp didn't at the time (maybe it does now?).
* Last I checked, with llama.cpp server you pick a model on server startup. Ollama instead allows specifying the model in the API request, and it handles switching out which one is loaded into vram automatically.
* The Modelfile abstraction is a more helpful way to keep track of different settings. When I used llama.cpp directly I had to figure out a way to track a bunch of model parameters as bash flags. Modelfiles + being able to specify the model in the request is a great synergy, allowing clients to not have to think about parameters at all, just which Modelfile to use.
I'm leaving off some other reasons why I switched which I know have gotten better (like Ollama having great Docker support, which wasn't always true for llama.cpp), but some of these may also have improved with time. A glance over the docs suggests that you still can't specify a model at runtime in the request, though, which if true is probably the single biggest improvement that Ollama offers. I'm constantly switching between tiny models that fit on my GPU and large models that use RAM, and being able to download a new model with ollama pull and immediately start using it in Open WebUI is a huge plus.
However, I still find it useful to know how to do that manually.
#!/bin/sh
export OLLAMA_MODELS="/mnt/ai-models/ollama/"
printf 'Starting the server now.\n'
ollama serve >/dev/null 2>&1 &
serverPid="$!"
printf 'Starting the client (might take a moment (~3min) after a fresh boot).\n'
ollama run llama3.2 2>/dev/null
printf 'Stopping the server now.\n'
kill "$serverPid"
And it just works :-)i had already burned an evening trying to debug and fix issues getting nowhere fast, until i pulled ollama and had it working with just two commands. it was a shock. (granted, there is/was a crippling performance problem with sky/kabylake chips but mitigated if you had any kind of mid-tier GPU and tweaked a couple settings)
anyone who tries to contribute to the general knowledge base of deploying llamacpp (like TFA) is doing heaven's work.
I find that running PyTorch is easier to get up and running. For quantization, AWQ models work and it's just a "pip install" away.
sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential
# add yourself to the video and render groups
sudo usermod -aG video,render $USER
# reboot to apply the group changes
# download a model
wget --continue -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true
# build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b3267
HIPCXX=clang++-17 cmake -S. -Bbuild \
-DGGML_HIPBLAS=ON \
-DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" \
-DCMAKE_BUILD_TYPE=Release
make -j8 -C build
# run llama.cpp
build/bin/llama-cli -ngl 32 --color -c 2048 \
--temp 0.7 --repeat_penalty 1.1 -n -1 \
-m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf \
--prompt "Once upon a time"
I think this will also work on Rembrandt, Renoir, and Cezanne integrated GPUs with Linux 6.10 or newer, so you might be able to install the HWE kernel to get it working on that hardware.With that said, users with CDNA 2 or RDNA 3 GPUs should probably use the official AMD ROCm packages instead of the built-in Ubuntu packages, as there are performance improvements for those architectures in newer versions of rocBLAS.
Spoiler: Vulkan with MSYS2 was indeed the easiest to get up and running.
I actually tried w64devkit first and it worked properly for llama-server, but there were inexplicable plug-in problems with llama-bench.
Edit: I tried w64devkit before I read this write-up and I was left wondering what to try next, so the timing was perfect.
With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?
... you can't, because llama.cpp is the inference engine - and it's bundled llama-cpp-server binary only provides relatively basic server functionality - it's really more of demo/example or MVP.
Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.
Ollama provides a server and client for interfacing and packaging models, such as:
- Hot loading models (e.g. when you request a model from your client Ollama will load it on demand).
- Automatic model parallelisation.
- Automatic model concurrency.
- Automatic memory calculations for layer and GPU/CPU placement.
- Layered model configuration (basically docker images for models).
- Templating and distribution of model parameters, templates in a container image.
- Near feature complete OpenAI compatible API as well as it's native native API that supports more advanced features such as model hot loading, context management, etc...
- Native libraries for common languages.
- Official container images for hosting.
- Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients.
- Support for both an official and self hosted model and template repositories.
- Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently.
- Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID.
In addition to the llama.cpp engine, Ollama are working on adding additional model backends (e.g. things like exl2, awq, etc...).Ollama is not "better" or "worse" than llama.cpp because it's an entirely different tool.
Given Ollama currently has llama.cpp, mllama and safetensors backends, there's far more Ollama code and functionality than code that calls llama.cpp
the amount of code does not dictate whether or not it is a wrapper.
Unless you want to tinker.
ollama didn’t have the issue, but it’s less configurable.
One thing I’m unsure of is how to pick a model. I downloaded the 7B one from Huggingface, but how is anyone supposed to know what these models are for, or if they’re any good?
I've listed some good starting models at the end of the post. Usually, most LLM models like Qwen or Llama are general-purpose, some are fine-tuned for specific stuff, like CodeQwen for programming.
What do you use Llama.cpp for?
I get you can ask it a question in natural language and it will spit out sort of an answer, but what would you do with it, what do you ask it?
I am all for local models, but this is massively overselling what they are capable of on common consumer hardware (32GB RAM).
If you are interested in what your hardware can pull off, find the top-ranking ~30b models on lmarena.ai and initiate a direct chat with them on the same site. Pose your common questions and see if they are answered to your satisfaction.
2) You can run much larger models on Apple Silicon with surprisingly decent speed.
Not so sure about the reducing costs argument anymore though, you'd have to use LLMs a ton to make buying brand new GPUs worth it (models are pretty reasonably priced these days).
I don't take everything Marc Andreessen said in his recent interview with Joe Rogan at face value, but I don't dismiss any of it either.
I have a snapshot of Wikipedia as well (well, not the whole of Wikipedia, but 90GB worth).
I guess that means English ... and maxi? As I say, was something around 90GB or so.
Putting them into a local LLM is rather efficient, however.
I'm less concerned about privacy for whatever reason.
[1] https://engineersneedart.com/blog/offlineadvocate/offlineadv...
meaning you learn more
i wonder what was there, i like spicy comments...
Jikes