Why would I use those models on your cloud instead of using Google's or Anthropic's models? I'm glad there are open models available and that they get better and better, but if I'm paying money to use a cloud API I might as well use the best commercial models, I think they will remain much better than the open alternatives for quite some time.
Fast forward to now, open models are quickly catching up, and at a significantly lower price point for most and can be customized for specific tasks instead of being general purpose. For general purpose models, absolutely the closed models are currently dominating.
Or spend 20 a month for models even a 5090 couldn't run. And not have to spend your own electricity, hardware, maintenance, updates etc.
This is why everyone needs to get every flavour and speedrun building all the tools they need when the infinite money faucets are turned off.
At some point companies will start raising prices or moving towards per-token pricing (Which is sustainable, but expensive).
And with that in mind, i definetly dont use more than a couple of bucks a month in API refils. (not that i really am a power user or anything)
So if you consider the 20 bucks to be balanced between poer and non power users, and with the existing rate limits, its probably not that far off being profitable, at least on the pure inference side.
This is almost exactly how duckdb/motherduck functions and I think theyre doing an excellent job.
EDIT: grammar and readability
I tried it a while back, I was very surprised to find that simply running `uvx ramalama run deepseek-r1:1.5b` just worked. I'm on Fedora Silverblue with nothing layered on the ostree. Before RamaLama, getting llama.cpp working with my GPU was a major PITA.
* Work with somebody like System76 or Framework to create great hardware systems come with their ecosystem preinstalled.
* Build out a PaaS, perhaps in partnership with an existing provider, that makes it easy for anybody to do what Ollama search does. I'm more than half certain I could convince our cash strapped organization to ditch elastic search for that.
* Partner with Home Assistant, get into home automation and wipe the floor with Echo and its ilk (yeah basically resurrect Mycroft but add whole-house automation to it).
Each of those are half-baked, but it also took me 7 minutes to come up with them, and they seem more in line with what Ollama tries to represent than a pure cloud play using low-power models.
This is the play. Its only a matter of time till they do it. Investors will want their returns
And are they VC funded? Are they funded by Y-combinator or anything else..
I just thought it was a project by someone to write something similar to docker but for LLM's and that was its pitch for a really really long time I think
Gotta pay those VC juicy returns somehow.
Ollama is beloved by people who know how to write 5 lines of python and bash to do API calls, but can't possibly improve the actual app.
Qwen3 235b
Deepseek 3.1 671b (thinking and non thinking)
Llama 3.1 405b
GPT OSS 120b
Those are hardly "small inferior models".
What is really cool is that you can set Codex up to use Ollama's API and then have it run tools on different models.
I was thinking about trying ChatGPT Pro, but I seem to have completely missed that they bumped the price from $100 to $200. It was $100 just a while ago, right? Before GPT-5, I assume.
Like I had Codex + gpt-5-codex (20€ tier) build me a network connectivity monitor for my very specific use case.
It worked, but had some really weird choices. Gave it to Claude Code (20€ tier again) and it immediately found a few issues and simplifications.
Here's a good example. For summarization of a page of content. Content is maybe pulled down by an agentic crawler, so using a local model to summarize is great. It's fast, doesn't cost anything (or much) and I can run it without guardrails as it doesn't represent a cost risk if it ran out of control.
1. Access to specific large open models (Qwen3 235b, Deepseek 3.1 671b, Llama 3.1 405b, GPT OSS 120b)
2. Having them available via the Ollama API LOCALLY
3. The ability to set up Codex to use Ollama's API for running tools on different models
I mean, really, nothing else is even close at this point and I would rather eat a bug than use Microsoft's cloud.
At some level it's also more of a principle that I could run something locally that matters rather than actually doing it. I don't want to become dependent on technology that someone could take away from me.
Not particularly. Indexes are sort of like railroads. They're costly to build and maintain. They have significant external costs. (For railroads, in land use. For indexes, in crawler pressure on hosting costs.)
If you build an index, you should be entitled to a return on your investment. But you should also be required to share that investment with others (at a cost to them, of course).
If many thousands of people care about having a free / private / distributed search engine, wouldn't it make sense for them to donate 1% of their CPU/storage/network to an indexer / db that they they then all benefit from?
How do you make it trustless. How do you fetch/crawl the index when it's scattered across arbitrary devices. How do you index the decentralized index. What is actually stored on nodes. When you want to do something useful with the crawled info, what does that look like.
You'd figure out a replication strategy based on observed reliability (Lindy effect + uptime %).
It would be less "5 million flaky randoms" and more "5,000 very reliable volunteers".
Though for the crawling layer you can and should absolutely utilize 5 million flaky randoms. That's actually the holy grail of crawling. One request per random consumer device.
I think the actual issue wouldn't be the technical issue but the selection. How do you decide what's worth keeping.
You could just do it on a volunteer basis. One volunteer really likes Lizard Facts and volunteers to host that. Or you could dynamically generate the "desired semantic subspace" based on the search traffic...
In 2015, I was working at a startup incubator hosted inside of an art academy.
I took a nap on the couch. I was the only person in the building, so my full attention was devoted to the strange sounds produced by the computers.
There were dozens of computers there. They were all on. They were all wasting hundreds of watts. They were all doing essentially nothing. Nothing useful.
I could feel the power there. I could feel, suddenly, all the computers in a thousand mile radius. All sitting there, all wasting time and energy.
> Dear API user, We’re excited to launch the Perplexity Search API — giving developers direct access to the same real-time, high-quality web index that powers Perplexity’s answers.
Crucially, I want to understand the license that applies to the search results. Can I store them, can I re-publish them? Different providers have different rules about this.
The search results are yours to own and use. You are free to do what you want with it. Of course you are bound by local laws of the legal jurisdiction you are in.
We will continue to monitor what's good to improve the output quality and results. Sometimes it could be the combination of providers to yield even better results. If I say one combination right now, and realize another combination is better, and make changes, I wouldn't need to broadcast it each time or risk misrepresenting the feature, which is to have amazing search and research capabilities that can augment models for a superior output.
On making the search functionality locally -- we made considerations and gave it a try but had trouble around result quality and websites blocking Ollama for making a crawler. Using a hosted API, we can get results for users much faster. I'd want us to revisit this at some point. I believe in having the power of local.
Thanks! please do!
Brave: https://api-dashboard.search.brave.com/terms-of-service "Licensee shall not at any time, and shall not permit others to: store the results of the API or any derivative works from the results of the API"
Exa: https://exa.ai/assets/Exa_Labs_Terms_of_Service.pdf "You may not [...] download, modify, copy, distribute, transmit, display, perform, reproduce, duplicate, publish, license, create derivative works from, or offer for sale any information contained on, or obtained from or through, the Services, except for temporary files that are automatically cached by your web browser for display purposes"
Many of the things I want to do with a search API are blocked by these rules! So I need to know which rules I am subject to.
Especially when I'm building databases that I want other organizations to be able to use.
Fun fact: many geocoding APIs have restrictions on what you can do with the data you get back from that geocoder - including how long you can store it and whether you are allowed to re-syndicate to other people. That's one of the reasons I like OpenCage: https://opencagedata.com/guides/how-to-compare-and-test-geocoding-services#datausability
That's admittedly a pretty foolish behaviour on their part and doesn't instill trust in Ollama as a service provider, but you as the end-user should be in the clear.
It's OK to pirate a massive amount of books if you're not reading or sharing, but rather just training an AI.
And by the way I prefer Google's approach in this particular case
Zuckerberg strikes me as far too adaptive, too fair weather
Or is it about sharing the domains of mirrors?
It makes me wonder if they’ve partnered with another of their VC’s peers who’s recently had a cash injection, and they’re being used as a design partner/customer story.
Exa would be my bet. YC backed them early, and they’ve also just closed a $85M Series B. Bing would be too expensive to run freely without Microsoft partnership.
Get on that privacy notice soon, Ollama. You’re HQ’d in CA, you’re definitely subject to CCPA. (You don’t need revenue to be subject to this, just being a data controller for 50,000 Californian residents is enough.)
https://oag.ca.gov/privacy/ccpa
I can imagine the reaction if it turns out the zero-retention provider backing them ended up being Alibaba.
Caching is a problem with many geocoding APIs (which I happen to be familiar with) and a good reason to prefer e.g. Opencage over the Google or Here geocoders because unlike most geocoder terms and conditions, Opencage actually encourages you to cache and store things; because it's all open data. The Here geocoder requires you to tell them how much data you store and will try to charge you extra for the privilege of storing and keeping data around. Because it's their data and the conditions under which they license it to you are limiting what you can and cannot do. Search APIs are very similar. Technically geocoding is a form of search (given a query, return a list of stuff).
Dead on arrival. Thanks for playing, Ollama, but you've already done the leg work in obsoleting yourself.
From where I'm standing, there's not enough money in B2C GPU hosting to make this sort of thing worthwhile. Features like paid search APIs this really hammer home how difficult it is to provide value around that proposition.
I like using ollama locally and I also index and query locally.
I would love to know how to hook ollama up to a traditional full-text-search system rather than learning how to 'fine tune' or convert my documents into embeddings or whatnot.
https://github.com/mjochum64/mcp-solr-search
A slightly heavier lift, but only slightly, would be to also use solr to also store a vectorized version of your docs and simultaneously do vector similarity search, solr has built in knn support fort it. Pretty good combo to get good quality with both semantic and full-text search.
Though I’m not sure if it would be relatively similar work to do solr w/ chromadb, for the vector portion, and marry the result stewards via llm pixie dust (“you are the helpful officiator of a semantic full-text matrimonial ceremony” etc). Also not sure the relative strengths of chromadb vs solr on that- maybe scales better for larger vector stores?
Is https://ollama.com/blog/tool-support not it?
Looking forward to try it with a few shell scripts (via the llm-ollama extension for the amazing Python ‘llm’) or Raycast (the lack of web search support for Ollama has been one of my biggest reasons for preferring cloud-hosted models).
I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case.
My rough idea for architecture:
- Crawler: A lightweight scraper that visits each site periodically.
- Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh.
- Storage: Store raw HTML and text locally, maybe compress older snapshots.
- Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings.
I would do periodic updates and build a small web UI to browse.
Anyone tried it or are there similar projects?
Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop.
Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.)
Crawling was tricky. Something like stackoverflow will stop returning pages when it detects that you're crawling, much sooner than you'd expect.
For starter, this is completely optional. It can be completely local too for you to publish your own models to ollama.com that you can share with others.
Even with heavy ai usage I'm only at like 400/1000 for the month
Like a full search engine that can visit pages on your behalf. Is anyone building this?
It takes lots of servers to build a search engine index, and there’s nothing to indicate that this will change in the near future.
Many sites have hidden sitemaps that cannot be found unless submitted to google directly. (Not even listed in robots txt most of the time). There is no way a local LLM can keep up with up to date internet.
I wonder how they plan to monetize their users. Doesn't sound promising.
I personally found Ollama to be an easy way to try out local LLMs and appreciate them for that (and I still use it to download small models on my laptop and phone (via termux)), but I've long switched to llama.cpp + llama-swap[2] on my dev desktop. I download whatever ggufs I want from hugging face and just do `git pull` and `cmake --build build --config Release` from my llama.cpp directory whenever I want to update.
1: https://www.ycombinator.com/companies/ollama 2: https://github.com/mostlygeek/llama-swap
This is a new umbrella project for llama.cpp and whisper.cpp. The author, Georgi Gerganov, also announced he’s forming a company for the project as he raised money from Nat Friedman (CEO GitHub) and Daniel Gross (ex-YC AI, ex-Apple ML).
Not sure if this is just a good faith support.
openAI, xAI, gemini all suffer from not being allowed on respective competitor sites.
this searched works for me with some quick tests well on YT videos, which OpenAI web search can't access. It kind of failed on X but sometimes returned ok relevant results. Definitely hit and miss but on average good
However I found that Google gives better results, so I switched to that. (I forget exactly but I had to set up something in a Google dev console for that.)
I think the DDG one is unofficial, and the Google one has limits (so it probably wouldn't work well for deep research type stuff).
I mostly just pipe it into LLM apis. I found that "shove the first few Google results into GPT, followed by my question" gave me very good results most of the time.
It of course also works with Ollama, but I don't have a very good GPU, so it gets really slow for me on long contexts.
https://programmablesearchengine.google.com/controlpanel/create
And then it's just a GET:
import os
import json
from req import get
url = "https://customsearch.googleapis.com/customsearch/v1"
def search(query):
data = {
"q": query,
"cx": os.getenv('GOOGLE_SEARCH_API_KEY'),
"key": os.getenv('GOOGLE_SEARCH_API_ID')
}
results_json = get(url, data)
results = json.loads(results_json)
results = results["items"]
return results
pip install transformers
transformers chat Qwen/Qwen2.5-0.5B-Instruct
During the preview period we want to start offering a $20 / month plan tailored for individuals - and we are monitoring the usage and making changes as people hit rate limits so we can satisfy most use cases, and be generous.
Or is this just someone trying to monetize Meta open source models?
What's a good Ollama alternative (for keeping 1-5x RTX 3090 busy) if you want to run things like open-webui (via an OpenAI compatible API) where your users can choose between a few LLMs?
For smaller models, it can augment it with the latest data by fetching it from the web, solving the problem of smaller models lacking specific knowledge.
For larger models, it can start functioning as deep research.