- It doesn't have a website
- It doesn't have a download page, you have to build it yourself
I'd wager that anyone capable enough to run a command line tool like Ollama should also be able to download prebuilt binaries from the llama.cpp releases page[1]. Also, prebuilt binaries are available on things like homebrew[2].
No this isn't. There are plenty of end user GUI apps that make it far easier than Ollama to download and run local LLMs (disclaimer: I build one of them). That's an entirely different market.
IMO, the intersection between the set of people who use a command line tool, and the set of people who are incapable of running `brew install llama.cpp` (or its Windows or Linux equivalents) is exceedingly small.
When I read the llama.cpp repo and see I have to build it, vs ollama where I just have to get it, the choice is already made.
I just want something I can quickly run and use with aider or mess around with. When I need to do real work I just use whatever OpenAI model we have running on Azure PTUs
Can you `brew install llama.cpp`?
The same reason I use apt install instead of compiling from source. I can definitely do that, but I don't, because it's just a way to get things installed.
Still it's not immediate obvious from README that there is an option to download it. There are instructions on how to build it, but not how to download it. Or maybe I'm blind, please correct me.
Discussions under https://news.ycombinator.com/item?id=40693391
(I recommend doing a search yourself first)
Basically, if you know how to use a computer, you can use Ollama (almost). You can't say the same thing about llama.cpp. Not everyone knows how to build from source, or even what "build" means.
Looking at the repo of llama.ccp it’s still not obvious to me how to use it without digging in - I need to download models from huggingface it seems and configure stuff etc - with ollama I type ollama get or something and it works.
Tbh I don’t just that stuff a lot or even seriously, maybe once per month to try out new local models.
I think having an easy to use quickstart would go a long way for llama.ccp - but maybe it’s not intended for casual (stupid?) users like me…
I owned an RTX2070, and followed the llama instructions to make sure it was compiling with GPU enabled. I then hand-tweaked settings (numgpulayers) to try to make it offload as much as possible to the GPU. I verified that it was using a good chunk of my GPU ram (via nvidia-smi), and confirmed that with-gpu was faster than cpu-only. It was still pretty slow, and influenced my decision to upgrade to an RTX3070. It was faster, but still pretty meh...
The first time I used ollama, everything just worked straight out of the box, with one command and zero configuration. It was lightning fast. Honestly if I'd had ollama earlier, I probably wouldn't have felt the need to upgrade GPU.
Maybe you were running a different model?
llama.cpp - Read the docs, with loads of information and unclear use cases. Question if it has API compatibility and secondary features that a bunch of tools expect. Decide it's not worth your effort when `ollama` is already running by the time you've read the docs
llama-server --model-url "https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-32B-IQ4_XS.gguf"
Will get you up and running in one single command.For work, we are given Macs and so the GPU can't be passed through to docker.
I wanted a client/server where the server has the LLM and runs outside of Docker, but without me having to write the client/server part.
I run my model in ollama, then inside the code use litellm to speak to it during local development.
Because apparently you can take unethical business practices, add AI, and suddenly it's a whole new thing that no one can judge!
You'll find that Ollama is also distributed under an MIT license[2]. It's fine to disagree with their priorities and lack of transparency. But trying to argue how they use code from other repositories that permit such a thing is tilting at windmills, IMHO.
[0] https://github.com/ggerganov/llama.cpp/blob/master/LICENSE
As someone who built apps for Windows, Linux, macOS, iOS and Android, it is not trivial to ensure your new features or updates work on all platforms, and you have to deal with deprecations.
Sorry I'm not sure what's the relationship exactly between the two projects. This is a genuine questions, not a troll question.
I have no idea why they have been ignoring it.
Ollama is just a friendly front end for llama.cpp. It doesn't have to do any of those things you mentioned. Llama.cpp does all that.
Shouldn't it just call Llama.cpp and let Llama.cpp handle the flags internally within Llama.cpp? I'm thinking from an abstraction layer perspective.
If I can ask one more question, why don't Ollama use binaries of pre-built llama.cpp with Vulkan support directly?
However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.
First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.
Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.
Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.
Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.
Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.
Granted, they could be a lot more helpful in providing information on how you do this. But this feature exists, at least.
Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.
Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):
{"content":"സ","type":"text"},"chunk_token_usage":1
{"content":"്മ","type":"text"},"chunk_token_usage":2
{"content":"ാ","type":"text"},"chunk_token_usage":1
{"content":"ർ","type":"text"},"chunk_token_usage":1
{"content":"ട","type":"text"},"chunk_token_usage":1
{"content":"്ട","type":"text"},"chunk_token_usage":1
{"content":"്","type":"text"},"chunk_token_usage":1
which totals to 8, as expected for DeepSeek V3's tokenizer.- The distilled models are also provided by DeepSeek;
- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.
I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.
--
Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.
Those README changes only served to provide greater transparency to would-be users.
Ulterior motives, indeed.
I think they want their project to be smart enough to just 'figure out what to do' on behalf of the user.
That appeals to a lot of people, but I think them stuffing all backends into one binary and auto-detecting at runtime which to use and is actually a step too far towards simplicity.
What they did to support both CUDA and ROCm using the same binary looked quite cursed last time I checked (because they needed to link or invoke two different builds of llama.cpp of course).
I have only glanced at that PR, but I'm guessing that this plays a role in how many backends they can reasonably try to support.
In nixpkgs it's a huge pain that we configure quite deliberately what we want Ollama to do at build time, and then Ollama runs off and does whatever anyways, and users have to look at log output and performance regressions to know what it's actually doing, every time they update their heuristics for detecting ROCm. It's brittle as hell.
This PR is #1 on their repo based on multiple metrics (comments, iterations, what have you)
Every time I look at it, it seems like it's a worse llama.cpp that removes options to make things "easier".
With ollama I type brew install ollama and then ollama get something, and I have it already running. With llama.ccp it’s seems i have to build it first, then manually download models somewhere - this is an instant turnoff, i maybe have 5 minutes of my life to waste on this
You'd have Llama, Mistral, Gemma, Phi, Yi.
You'd have Llama, Llama 2, Llama 3, Llama 3.2...
And those offer with 8B, 13B or 70B parameters
And you can get it quantised to GGUF, AWQ, exl2...
And quantised to 2, 3, 4, 6 or 8 bits.
And that 4-bit quant is available as Q4_0, Q4_K_S, Q4_K_M...
And on top of that there are a load of fine-tunes that score better on some benchmarks.
Sometimes a model is split into 30 files and you need all 30, other times there's 15 different quants in the same release and you only need a single one. And you have to download from huggingface and put the files in the right place yourself.
ollama takes a lot of that complexity and hides it. You run "ollama run llama3.1" and the selection and download all gets taken care of.
When I see people bring up the sketchiness most of the time the creator responds with the equivalent of shrugs, which imo increases the sketchiness.
Because you don't execute untrusted code in your machine without containerization/virtualization. Don't you?
There are a lot of open-source tools that we have to trust to get anything done on a daily basis.
Don’t you need at least 2 GPUs in that case and put kernel level passthrough?
But defaulting to a 671b model is also evil.
But yes, I have been "yelled" at on reddit for telling people you need vram in the hundreds of GB.
Letting people download any amount of bytes just to find out they got something else isn't optimal. So what to do? Highlight the differences when you reference them so people understand.
Tweets like these: https://x.com/ollama/status/1881427522002506009
> DeepSeek's first-generation reasoning models are achieving performance comparable to OpenAI's o1 across math, code, and reasoning tasks! Give it a try! 7B distilled: ollama run deepseek-r1:7b
Are really misleading. Reading the first part, you think the second part is that model that gives "performance comparable to OpenAI's o1" but it's not, it's a distilled model with way worse performance. Yes, they do say it's the distilled model, but I hope I'm not alone in seeing how people less careful would confuse the two.
If they're doing this on purpose, I'd leave a very bad taste in my mouth. If they're doing this accidentally, it also gives me reason to pause and re-evaluate what they're doing.
[edit] Oh I see, here's an issue about it: https://github.com/ggerganov/llama.cpp/issues/8010
Why would I care about Vulkan?
I agree they should also support OpenVINO, but compared to Vulkan OpenVINO is a tiny market.
If you run your local llm in the least performant way possible on tour overly expensive GPU, then you are not making value of your purchase.
Vulkan is a fallback option is all.
I even see people running on their CPU because some apps dont support their hardware and llama.cpp made it even possible. It is still a really bad idea.
Its just goes to show there’s still much to do.
Vulkan is the API right now in the graphics world. It's very well supported and actively being improved on. Everyone is pouring resources into making Vulkan better.
OpenVINO feels barely developed. Intel never made it a proper backend for Pytorch like AMD did with ROCm. It's hard to see where it is going, or if it is going anywhere at all. Between Sycl and OneApi it's hard to see how much interest Intel has developing it.
> Vulkan is the API right now in the graphics world.
YUP, Vulkan is all the rage in the graphics world, and for good reasons. But we arent discussing graphics now are we?
Vulkan is a general graphics API with some computing capabilities.
OpenVINO is a toolkit for inference neural networks, by intel built to make use of their GPUs and NPUs for this specific task.
Using vulkan, first you need to translate your payload to shaders, then they need to be compiled to SPIR-V, then they can use a subset of the cards capabilities.
How could this even remotely match something written specifically for the task?
Also, it is dead easy to benchmark if you still think otherwise.
Or just read up on it..
Vulkan is used by millions and huge money goes into optimizing it.
My money is on Vulkan.
as far as my intel goes it's a mozilla project shouldered mostly by one 10x programmer. i found ollama through hn and last time i didn't notice any lack of trust or suspected sketchiness ... so what changed?
In contrast, the ollama dev team is doing useful work (creating an easy interface) but otherwise mostly piggybacking off the already existing infrastructure
That's completely off the mark.
So where's the non-sketchy, non-for-profit equivalent? Where's the nice frontend for llama.cpp that makes it trivial for anyone who wants to play around with local LLMs without having to know much about their internals? If Ollama isn't doing anything difficult, why isn't llama.cpp as easy to use?
Making local LLMs accessible to the masses is an essential job right now—it's important to normalize owning your data as much as it can be normalized. For all of its faults, Ollama does that, and it does it far better than any alternative. Maybe wait to trash it for being "just" a wrapper until someone actually creates a viable alternative.
Ollama solves the problem of how I run many models without having to deal with many instances of infrastructure.
There are many flaws in Ollama but it makes many things much easier esp. if you don’t want to bother building and configuring. They do take a long time to merge any PRs though. One of my PRs has been waiting for 8 months and there was this another PR about KV cache quantization that took them 6 months to merge.
[1]: https://msty.app
I guess you have a point there, seeing as after many months of waiting we finally have a comment on this PR from someone with real involvement in Ollama - see https://github.com/ollama/ollama/pull/5059#issuecomment-2628002106 . Of course this is very welcome news.
And the actual patch is tiny..
I think it's about time for a bleeding-edge fork of ollama. These guys are too static and that is not what AI development is all about.
this is such a low hanging fruit that it's silly how they are acting.
And again, they're doing those contortions to make it easy for people. Making it easy involves trade-offs.
Yes, Ollama has flaws. They could communicate better about why they're ignoring PRs. All I'm saying is let's not pretend they're not doing anything complicated or difficult when no one has been able to recreate what they're doing.
Maybe I should clarify that I'm not saying that the effort to enable a new backend is substantial, I'm saying that my understanding of that comment (the one you acknowledged made a good argument) is that the maintenance burden of having a new backend is substantial.
But now suddenly what I said is not just an argument you disagree with but is also incorrect. I've been genuinely asking for several turns of conversation at this point why what I said is incorrect.
Why is it incorrect that the maintenance burden of maintaining a Vulkan backend would be a sufficient explanation for why they don't want to merge it without having to appeal to some sort of conspiracy with Nvidia?
(I mean, the missing work should not be much, but it still has to be done)
Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.
That, coupled with the increasing scale of the internet, make it harder and harder for smaller groups to do these kinds of things. At least until we get some good content addressed distributed storage system.
Cloudflare R2 has unlimited egress, and AFAIK, that's what ollama uses for hosting quantized model weights.
Not an equivalent yet, sorry.
llama.cpp, kobold.cpp, oobabooga, llmstudio, etc. There are dozens at this point.
And while many chalk the attachment to ollama up to a "skill issue", that's just venting frustration that all something has to do to win the popularity contest is to repackage and market it as an "app".
I prefer first-party tools, I'm comfortable managing a build environment and calling models using pytorch, and ollama doesn't really cover my use cases, so I'm not it's audience. I still recommend it to people who might want the training wheels while they figure out how not-scary local inference actually is.
ICYMI, you might want to read their terms of use:
None of these three are remotely as easy to install or use. They could be, but none of them are even trying.
> lmstudio
This is a closed source app with a non-free license from a business not making money. Enshittification is just a matter of when.
Which part of the user experience did you have problems with when using it?
Like many in FOSS I care about making the experience better for everyone. Slightly weird question, why do you care that I care?
> what motivation do you have to tell them to deliberately restrict their audience?
I don't have any motivation to say any such thing, and I wouldn't either. Is that really your take away from reading that issue?
Stating something like "Ollama is a daemon/cli for running LLMs in your terminal" on your website isn't a restriction whatsoever, it's just being clear up front what the tool is. Currently, the website literally doesn't say what Ollama actually is.
Yes. You went to them with a definition of who they're trying to serve and they wrote back that they didn't agree with your relatively narrow scope. Now you're out in random threads about Ollama complaining that they didn't like your definition of their target audience.
Am I missing something?
Basically the only information on the website right now is "Get up and running with large language models.", do you think that's helping people? Could mean anything.
It's made a lot of progress in that the README [0] now at least has instructions for how to download pre-built releases or docker images, but that requires actually reading the section entitled "Building the Project" to realize that it provides more than just building instructions. That is not accessible to the masses, and it's hard for me to not see that placement and prioritization as an intentional choice to be inaccessible (which is a perfectly valid choice for them!)
And that's aside from the fact that Ollama provides a ton of convenience features that are simply missing, starting with the fact that it looks like with llama.cpp I still have to pick a model at startup time, which means switching models requires SSHing into my server and restarting it.
None of this is meant to disparage llama.cpp: what they're doing is great and they have chosen to not prioritize user convenience as their primary goal. That's a perfectly valid choice. And I'm also not defending Ollama's lack of acknowledgment. I'm responding to a very specific set of ideas that have been prevalent in this thread: that not only does Ollama not give credit, they're not even really doing very much "real work". To me that is patently nonsense—the last mile to package something in a way that is user friendly is often at least as much work, it's just not the kind of work that hackers who hang out on forums like this appreciate.
As for not merging the PR - why are you entitled to have a PR merged? This attitude of entitlement around contributions is very disheartening as oss maintainer - it’s usually more work to review/merge/maintain a feature etc than to open a PR. Also no one is entitled to comments / discussion or literally one second of my time as an OSS maintainer. This is imo the cancer that is eating open source.
I didn’t get entitlement vibes from the comment; I think the author believes the PR could have wide benefit, and believes that others support his position, thus the post to HN.
I don’t mean to be preach-y; I’m learning to interpret others by using a kinder mental model of society. Wish me luck!
As someone who doesn’t follow this space, it’s hard to tell if there’s actually something sketchy going on with ollama or if it’s the usual reactionary negativity that happens when a tool comes along and makes someone’s niche hobby easier and more accessible to a broad audience.
We need to know a few things:
1) Show me the lines of code that log things and how it handles temp files and storage.
2) No remote calls at all.
3) No telemetry at all.
This is the feature list I would want to begin trusting. I use this stuff, but I also don’t trust it.
The question is: Why is ollama considered “sketchy” but llama.cpp is not, given that both are open source?
I’m not trying to debate it. I’m trying to understand why people are saying this.
Lately they seem to be contributing mostly confusion to the conversation.
The #1 model the entire world is talking about is literally mislabeled their side. There is no such thing as R1-1.5b. Quantization without telling users also confuses noobs as to what is possible. Setting up an api different from the thing they're wrapping adds chaos. And claiming each feature added llama.cpp as something "ollama now supports" is exceedingly questionable especially when combined with the very sparse acknowledgement that it's a wrapper at all.
Whole thing just doesn't have good vibes
https://ollama.com/library/deepseek-r1
That may have been ok if it was just same model at different sizes but they're completely different things here & it's created confusion out of thin air for absolutely no reason other than ollama being careless.
I tried llama-cpp with the Vulkan backend and doubled the amount of tokens per second. I was under the impression ROCm is superior to Vulkan, so I was confused about the result.
In any case, I've stuck with llama-cpp.
It took very long for them to support KV cache quantisation too (which drastically reduces the amount of VRAM needed for context!). Even though the underlying llama.cpp had offered it for ages. And they had it handed to them on a platter, someone had developed everything and submitted a patch.
The developer of that patch even was about to give up as he had to constantly keep it up to date with upstream even though he was constantly being ignored. So he had no idea if it would ever be merged.
They just seem to be really hesitant to offer new features.
Eventually it was merged and it made a huge difference to people with low VRAM cards.
I am the least qualified person to comment on that, but honestly their response made me to raise an eyebrow.
Vulkan backends are existential for running LLMs on consumer hardware (iGPUs especially). It's sad to see Ollama miss this opportunity.
As usual, the real work seems to be appropriated by people who do the last little bit — put an acceptable user experience and some polish on it — and they take all the money and credit.
It’s shitty but it also happens because the vast majority of devs, especially in the FOSS world, do not understand or appreciate user experience. It is bar none the most important thing in the success of most things in computing.
My rule is: every step a user has to do to install or set up something halves adoption. So if 100 people enter and there are two steps, 25 complete the process.
For a long time Apple was the most valuable corporation on Earth on the basis of user experience alone. Apple doesn’t invent much. They polish it, and that’s where like 99% of the value is as far as the market is concerned.
The reason is that computers are very confusing and hard to use. Computer people, which most of us are, don’t see that because it’s second nature to us. But even for computer people you get to the point where you’re busy and don’t have time to nerd out on every single thing you use, so it even matters to computer people in the end.
I know there will be people that disagree with this, that's ok. This is my personal experience with Python in general, and 10x worse when I need to figure out all compatible packages with specifc ROCm support for my GPU. This is madness, even C and C++ setup and build is easier than this Python hell.
I'd agree that Python packaging is generally bad, and that within an LLM context it's a disastrous mess (especially for ROCm), but that doesn't appear to be how RamaLama is using it at all.
There's really no major python dependancy problems people have been running this on many Linux distros, macOS, etc.
We deliberately don't use python libraries because of the packaging problems.
Side note: `uv` is a new package manager for python that replaces the pips, the virtualenvs and more. It's quite good. https://github.com/astral-sh/uv
https://github.com/ollama/ollama/pulls/ericcurtin
They merged a one-line change of mine, but you can't get any significant PRs in.
Great to see this.
PS. Have you got feedback on whether this works on Windows? If not, I can try to create a build today.
For that matter, some people are still having issues building and running it, as seen from the latest comments on the linked GitHub page. It's not clear that it's even in a fully reviewable state just yet.
https://github.com/9cb14c1ec0/ollama-vulkan
I successfully ran Phi4 on my AMD Ryzen 7 PRO 5850U iGPU with it.