I'm also interested to see if that small minority of people are willing to pay for a service like this.
The solution is having proper governance for OSS projects that matter with independent organizations made up of developers, companies, and users taking care of the governance. A lot of projects that have that have last for decades and will likely survive for decades more.
And part of that solution is to also steer clear of projects without that. I've been burned a couple of times now getting stuck with OSS components where the license was changed and the companies behind it had their little IPOs and started serving share holders instead of users (elastic, redis, mongo, etc). I only briefly used Mongo and I got a whiff of where things were going and just cut loose from it. With Elastic the license shenenigans started shortly after their IPO and things have been very disruptive to the community (with half using Opensearch now). With Redis I planned the switch to Valkey the second it was announced. Clear cut case of cutting loose. Valkey looks like it has proper governance. Redis never had that.
Ollama seems relatively OK by this benchmark. The software (ollama server) is MIT licensed and there appears to be no contributor license agreement in place. But it's a small group of people that do most of the coding and they all work for the same vc funded company behind ollama. That's not proper governance. They could fail. They could relicense. They could decide that they don't like open source after all. Etc. Worth considering before you bet your company on making this a foundational piece of your tech stack.
He (almost) single-handedly brought LLMs to the masses.
With the latest news of some AI engineers' compensation reaching up to a billion dollars, feels a bit unfair that Georgi is not getting a much larger slice of the pie.
I think he's happy doing his own thing.
But then, if someone came in with a billion ... who wouldn't give it a thought?
$50M, now thats just perfect. you're retired, nor burdened with a huge responsibility
> ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.
Now I am going to go and write a wrapper around llamacpp, that is only open source, truly local.
How can I trust ollama to not to sell my data.
You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.
- Speed
- Cost
- Reliability
- Feature Parity (eg: context caching)
- Performance (What quant level is being used...really?)
- Host region/data privacy guarantees
- LTS
And that's not even including the decision of what model you want to use!
Realistically if you want to use an OSS model instead of the big 3, you're faced with evalutating models/providers across all these axes, which can require a fair amount of expertise to discern. You may even have to write your own custom evaluations. Meanwhile Anthropic/OAI/Google "just work" and you get what it says on the tin, to the best of their ability. Even if they're more expensive (and they're not that much more expensive), you are basically paying for the priviledge of "we'll handle everything for you".
I think until providers start standardizing OSS offerings, we're going to continue to exist in this in-between world where OSS models theoretically are at performance parity with closed source, but in practice aren't really even in the running for serious large scale deployments.
> The order, embedded under and issued on Might 13, 2025, by U.S. Justice of the Peace Decide Ona T. Wang
Is this some meme where “may” is being replaced with “might”, or some word substitution gone awry? I don’t get it.
It is maybe not coincidental that "may" and "might" mean nearly the same thing which bolsters the case for auto correct gone awry.
If I use local/OSS models it's specifically to avoid running in a country with no data protection laws. It's a big close miss here.
All things considered though, Europe is getting confusing. They have GDPR but now pushing to backdoor encryption within the EU? [1]
At least there isn't a strong movement in the US trying to outlaw E2E encryption.
[1] https://www.eff.org/deeplinks/2025/06/eus-encryption-roadmap-makes-everyone-less-safe
Which brings up the point are truly private LLMs possible? Where the input I provide is only meaningful to me, but the LLM can still transform it without gaining any contextual value out of it? Without sharing a key? If this can be done, can it be done performantly?
Yes, there is gonna be a new discussion for it on October 15, but I've already seen section of governments being against their own government position on the bill (Swedish Military for example).
I guarantee that nobody cares about or will be surveilling your private AI use unless you're doing other things that warrant surveillance.
The reason big providers suck, as OpenAI is so nicely demonstrating for us, is that they retain everything, the user is the product, and court cases, other situations can unmask and expose everything you do on a platform to third parties. This country seriously needs a digital bill of rights.
The biggest game in town has been managing platforms that give owners an information advantage. But at least the world generally trusts the USA to abide by laws and user agreements, which is why, to my mind, the USA retains the near monopoly on information platforms.
I personally wouldn’t trust a UK platform for example, being a Brit native. The top echelon talent pool is so small and incestuous I don’t believe I would experience a fair playing field if a business of mine passed a certain size of national reach/importance.
EDIT: from ChatGPT, new money entrepreneurs with no inheritence/political ties by economic region, USA ~63%, UK/HongKong/Singapore ~45%, Emerging Markets ~35%, EU ~22%, Russia ~10%
https://www.anthropic.com/pricing - $0 / $17 (if billed annually) / $20 (if billed monthly) / $100 / $25 (team) / custom enterprise pricing / on-demand API pricing
Sounds like tiers to me.
Thankfully, this may just leave more room for other open source local inference engines.
This isn't Anaconda, they didn't do a bait and switch to screw their core users. It isn't sinful for devs to try and earn a living.
If you earn a living using something someone else built, and expect them not to earn a living, your paycheck has a limited lifetime.
“Someone” in this context could be a person, a team, or a corporate entity. Free may be temporary.
It was always just a wrapper around the real well designed OSS, llama.cpp. Ollama even messes up the names of models by calling distilled models the name of the actual one, such as DeepSeek.
Ollama's engineers created Docker Desktop, and you can see how that turned out, so I don't have much faith in them to continue to stay open given what a rugpull Docker Desktop became.
There are areas we will make money, and I wholly believe if we follow our conscious we can create something amazing for the world while making sure we can keep it fueled to keep it going for the long term.
Some of the ideas in Turbo mode (completely optional) is to serve the users who want a faster GPU, and adding in additional capabilities like web search. We loved the experience so much that we decided to give web search to non-paid users too. (Again, it's fully optional). Now to prevent abuse and make sure our costs don't go out of hand, we require login.
Can't we all just work together and create a better world? Or does it have to be so zero sum?
For Turbo mode I understand the need for paying but the main poing of running a local model with web search is browsing from my computer without using any LLM provider. Also I want to get rid of the latency to US servers from Europe.
If ollama can't do it, maybe a fork.
Wait until it makes significant amounts of money. Suddenly the priorities will be different.
I don’t begrudge them wanting to make some money off it though.
I'm not sure which package we use that is triggering this. My guess is llama.cpp based on what I see on social? Ollama has long shifted to using our own engine. We do use llama.cpp for legacy and backwards compatibility. I want to be clear it's not a knock on the llama.cpp project either.
There are certain features we want to build into Ollama, and we want to be opinionated on the experience we want to build.
Have you supported our past gigs before? Why not be more happy and optimistic in seeing everyone build their dreams (success or not).
If you go build a project of your dreams, I'd be supportive of it too.
Docker Desktop? One of the most memorable private equity rugpulls in developer tooling?
Fool me once shame on you, fool me twice shame on me
llama.cpp isn't (just) a C++ library/codebase -- it's a CLI application, server application (llama-server), etc.
Developers continue to be blind to usability and UI/UX. Ollama lets you just install it, just install models, and go. The only other thing really like that is LM-Studio.
It's not surprising that the people behind it are Docker people. Yes you can do everything Docker does with Linux kernel and shell commands, but do you want to?
Making software usable is often many orders of magnitude more work than making software work.
So does the original llama.cpp. And you won't have to deal with mislabeled models and insane defaults out of the box.
Then the only thing that's missing seems to be a canonical way for clients to instantiate that, ideally in some OS-native way (systemd, launchcd etc.), and a canonical port that they can connect to.
No inference engine does all of:
- Model switching
- Unload after idle
- Dynamic layer offload to CPU to avoid OOM
All companies that raise outside investment follow this route.
No exceptions.
And yes this is how ollama will fall due to enshittification, for lack of a better word.
if i could have consistent and seamless local-cloud dev that would be a nice win. everyone has to write things 3x over these days depending on your garden of choice, even with langchain/llamaindex
Always had a bad feeling when they didn't give ggerganov/llama.cpp their deserved credit for making Ollama possible in the first place, if it were a true OSS project they would have, but now makes more sense through the lens of a VC-funded project looking to grab as much marketshare as possible to avoid raising awareness for alternatives in OSS projects they depend on.
Together with their new closed-source UI [1] it's time for me to switch back to llama.cpp's cli/server.
[1] https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas_new_gui_is_closed_source/
The Ollama app using the signed-in-only web search tool is really pretty good.
If you want to see where the actual developers do the actual hard work, go use llama.cpp instead.
Sure, llama.cpp is the real thing, ollama is a wrapper... I would never want to use something like ollama in a production setting. But if I want to quickly get someone less technical up to speed to develop an LLM-enabled system and run qwen or w/e locally, well then its pretty nice that they have a GUI and a .dmg to install.
Since the new multimodal engine, Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library, and ask hardware partners to help optimize it.
Ollama might look like a toy and what looks trivial to build. I can say, to keep its simplicity, we go through a deep amount of struggles to make it work with the experience we want.
Simplicity is often overlooked, but we want to build the world we want to see.
I knew a startup that deployed ollama on a customers premises and when I asked them why, they had absolutely no good reason. Likely they did it because it was easy. That's not the "easy to use" case you want to solve for.
We can obviously disagree with their priorities, their roadmap, the fact that the client isn't FOSS (I wish it was!), etc but no one can say that ollama doesn't work. It works. And like mchiang said above: its dead simple, on purpose.
(any differences are small enough that they either shouldn't cause the human much work or can very easily be delegated to AI)
Then you want to swap models on the fly. llama-swap you say? You now get to learn a new custom yaml based config file syntax that does basically nothing that the Ollama model file already does so that you can ultimately... have the same experience as Ollama but now you've lost hours just to get back to square one.
Then you need it to start and be ready with the system reboot? Great, now you get to write some systemd services, move stuff into system-level folders, create some groups and users and poof, there goes another hour of your time.
Why does this matter? For this specific release, we benchmarked against OpenAI’s reference implementation to make sure Ollama is on par. We also spent a significant amount of time getting harmony implemented the way intended.
I know vLLM also worked hard to implement against the reference and have shared their benchmarks publicly.
GGML library is llama.cpp. They are one and the same.
Ollama made sense when llama.cpp was hard to use. Ollama does not have value preposition anymore.
The models are implemented by Ollama https://github.com/ollama/ollama/tree/main/model/models
I can say as a fact, for the gpt-oss model, we also implemented our own MXFP4 kernel. Benchmarked against the reference implementations to make sure Ollama is on par. We implemented harmony and tested it. This should significantly impact tool calling capability.
Im not sure if im feeding here. We really love what we do, and I hope it shows in our product, in Ollama’s design and in our voice to our community.
You don’t have to like Ollama. That’s subjective to your taste. As a maintainer, I certainly hope to have you as a user one day. If we don’t meet your needs and you want to use an alternative project, that’s totally cool too. It’s the power of having a choice.
Is there a schedule for adding additional models to Turbo mode plan, in addition to gpt-oss 20/120b? I wanted to try your $20/month Turbo plan, but I would like to be able to experiment with a few other large models.
GGML is llama.cpp. It it developed by the same people as llama.cpp and powers everything llama.cpp does. You must know that. The fact that you are ignoring it very dishonest.
Nope…
Where can I learn more about this? llama.cpp is an inference application built using the ggml library. Does this mean, Ollama now has it's own code for what llama.cpp does?
We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.
If you can't get access to "real" datacenter GPUs for any reason and essentially do desktop, clientside deploys, it's your best bet.
It's not a common scenario, but a desktop with a 4090 or two is all you can get in some organizations.
For Draw Things provided "Cloud Compute", we don't retain any data too (everything is done in RAM per request). But that is still unsatisfactory personally. We will soon add "privacy pass" support, but still not to the satisfactory. Transparency log that can be attested on the hardware would be nice (since we run our open-source gRPCServerCLI too), but I just don't know where to start.
[full disclosure I am working on something with actual privacy guarantees for LLM calls that does use a transparency log, etc.]
Edit: emailed the address on the site in your profile, got an inbox does not exist error.
It is completely compromised, especially if it is an AI company.
How do you think ollama was able to provide the open source AI models to everyone for free?
I am pretty sure ollama was losing money on every pull of those images from their infrastructure.
Those that are now angry at ollama charging money or not focusing on privacy should have been angry when they raised money from investors.
https://github.com/ollama/ollama/issues/5245
If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama.
However the approach to model swapping is not 'ollama compatible' which means all the OSS tools supporting 'ollama' Ex Openwebui, Openhands, Bolt.diy, n8n, flowise, browser-use etc.. aren't able to take advantage of this particularly useful capability as best I can tell.
This allows you to try out some open models and better assess if you could buy a dgx box or Mac Studio with a lot of unified memory and build out what you want to do locally without actually investing in very expensive hardware.
Certain applications require good privacy control and on-prem and local are something certain financial/medical/law developers want. This allows you to build something and test it on non-private data and then drop in real local hardware later in the process.
I feel like they're competing against Hugging Face or even Colaboratory then if this is the case.
And for cases that require strict privacy control, I don't think I'd run it on emergent models or if I really have to, I would prefer doing so on an existing cloud setup already that has the necessary trust / compliance barriers addressed. (does Ollama Turbo even have their Trust center up?)
I can see its potential once it gets rolling, since there's a lot of ollama installations out there.
I pay $20 to Anthropic, so I don’t think I’d get enough use out of this for the $20 fee. But being able to spin up any of these models and use as needed (and compare) seems extremely useful to me.
I hope this works out well for the team.
Agreed, though there are already several providers of these new OpenAI models available, so I'm not sure what ollama's value add is there (there are plenty of good chat/code/etc interfaces available if you are bringing your own API keys).
Usage-based pricing would put them in competition with established services like deepinfra.com, novita.ai, and ultimately openrouter.ai. They would go in with more name-recognition, but the established competition is already very competitive on pricing
In a universe where everything you say can be taken out of context, things like OpenAi will be a data leak nightmare.
Need this soon:
It's very unfortunate that the local inference community has aggregated around Ollama when it's clear that's not their long term priority or strategy.
Its imperative we move away ASAP
Is it bad to fairly charge money for selling GPUs that cost us money too, and use that money to grow the core open-source project?
At one point, it just has to be reasonable. I'd like to believe by having a conscientious, we can create something great.
What I'm referring to is a broader pattern that I (and several) others have been seeing. Of the top of my head: not crediting llama.cpp previously, still not crediting llama.cpp now and saying you are using your own inference engine when you are still using ggml and the core of what Georgi made, most importantly why even create your own version - is it not better for the community to just contribute to llama.cpp?, making your own propreitary model storage platform disallowing using weights with other local engines requiring people to duplicate downloads and more.
I dont know how to regard these other than being largely motivated out of self interest.
I think what Jeff and you have built have been enormously helpful to us - Ollama is how I got started running models locally and have enjoyed using it for years now. For that, I think you guys should be paid millions. But what I fear is going to happen is you guys will go the way of the current dogma of capturing users (at least in mindshare) and then continually squeezing more. I would love to be wrong, but I am not going to stick around to find out as its risk I cannot take.
They have very detailed quick start docs on it: https://docs.openwebui.com/getting-started/quick-start/starting-with-llama-cpp
I do also need an API server though. The one built into OpenWebUI is no good because it always reloads the model if you use it first from the web console and then run an API call using the same model (like literally the same model from the workspace). Very weird but I avoid it for that reason.
I moved away from ollama in favor of llama-server a couple of months ago and never missed anything, since I'm still using the same UI.
Assuming you have llama-server installed, you can download + run a hugging face model with something like
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja
And access http://localhost:8080Ollama does not use llama.cpp anymore; we do still keep it and occasionally update it to remain compatible for older models for when we used it. The team is great, we just have features we want to build, and want to implement the models directly in Ollama. (We do use GGML and ask partners to help it. This is a project that also powers llama.cpp and is maintained by that same team)
That is interesting, did Ollama develop its own proprietary inference engine or did you move to something else?
Any specific reason why you moved away from llama.cpp?
> We do use GGML
Sorry, but this is kind of hiding the ball. You don't use llama.cpp, you just ... use their core library that implements all the difficult bits, and carry a patchset on top of it?
Why do you have to start with the first statement at all? "we use the core library from llama.cpp/ggml and implement what we think is a better interface and UX. we hope you like it and find it useful."
% diff -ru ggml/src llama.cpp/ggml/src | grep -E '^(\+|\-) .*' | wc -l
1445
i.e. as of time of writing +/- 1445 lines between the two, on about 175k total lines. a lot of which is the recent MXFP4 stuff.Ollama is great software. It's integral to the broader diffusion of LLMs. You guys should be incredibly proud of it and the impact its had. I understand the current environment rewards bold claims, but the sense I get from some of your communications is "what's the boldest, strongest claim we can make that's still mostly technically true". As a potential user, taking those claims as true until closer evaluation reveals the discrepancy feels pretty bad, and keeps me firmly in the 'potential' camp.
Have the confidence in your software and the respect for your users to advertise your system as it is.
But the takeaway is pretty clearly that `llama.cpp`, `GGML`/`GGUF`, and generally `ggerganov`'s single-handedly Carmacking it when everyone thought it was impossible is all the value. I think a lot of people made Docker containers with `ggml`/`gguf` in them and one was like "we can make this a business if we realllllly push it".
Ollama as a hobby project or even a serious OSS project? With a cordial upstream relationship and massive attribution labels everywhere? Sure. Maybe even as a commercial thing that has a massive "Wouldn't Be Possible Without" page for it's OSS core upstream.
But like: startup company for making money that's (to all appearances) completely out of reach for the principles to ever do without totally `cp -r && git commit` repeatedly? It's complicated, a lot of stuff starts as a fork and goes off in a very different direction, and I got kinda nauseous and stopped paying attention at some point, but near as I can tell they're still just copying all the stuff they can't figure out how to do themselves on an ongoing basis without resolving the upstream drama?
It's like, in bounds barely I guess. I can't point to it being "this is strictly against the rules or norms", but it's bending everything to the absolute limit. It's not a zone I'd want to spend a lot of time in.
> [ggml] is all the value
That’s what gets me about Ollama - they have real value too! Docker is just the kernel’s cgroups/chroots/iptables/… but it deserves a lot of credit for articulating and operating those on behalf of the user. Ollama deserves the same. But they’re consistently kinda weird about owning just that?
thanks for the feedback address :)
The models are on HuggingFace and downloading them is `uvx huggingface-cli`, the `GGUF` quants were `TheBloke` (with a grant from pmarca IIRC) for ages and now everyone does them (`unsloth` does a bunch of them).
Maybe I've got it twisted, but it seems to be that the people who actually do `ggml` aren't happy about it, and I've got their back on this.
Why? If the tool works then use it. They’re not forcing you to use the cloud.
Its a tale we seen played out many times. Redis is the most recent example.
For the users who want GPUs, which cost us money, we will charge money for it. Completely optional.
For one of the top local open model inference engines of choice - only supporting OSS out of the gate feels like an angle to just ride the hype knowing OSS is announced today "oh OSS came out and you can use Ollama Turbo to use it"
The subscription based pricing is really interesting. Other players offer this but not for API type services. I always imagine that there will be a real pricing war with LLMs with time / as capabilities mature, and going monthly pricing on API services is possibly a symptom of that
What does this mean for the local inference engine? Does Ollama have enough resources to maintain both?
I guess their target audience values convenience and easy of use above all else so that could play well there maybe.
Doesn't look that much better than a ChatGPT Plus subscription.
> Turbo is a new way to run open models using datacenter-grade hardware.
What? Why not just say that it is a cloud-based service for running models? Why this language?
Is it because they developed s new ollama which isn't open and which doesn't use llama.cpp?
There is obviously some connection to Llama (the original models giving rise to llama.cpp which Ollama was built on) but the companies have no affiliation.
ps: looking for most economic one to play around with as long as it a decent enough experience (minimal learning curve). buy, happy to pay too
Also realistically, Vulkan Compute support mostly helps iGPU's and older/lower-end dGPU's, which can only bring a modest performance speed up in the compute-bound preprocessing phase (because modern CPU inference wins in the text-generation phase due to better memory bandwidth). There are exceptions such as modern Intel dGPU's or perhaps Macs running Asahi where Vulkan Compute can be more broadly useful, but these are also quite rare.
Ollama is in a very sad state. The project is dysfunctional.
> OpenAI and Ollama partner to launch gpt-oss