Not sure if that's related to it ignoring quotes and operators though. I'd imagine that to be a cost saving measure (and very rarely used, considering it keeps accusing me of being a robot when I do...)
From what I understand, that good old Google from the 2000s was built entirely without any kind of machine learning. Just a keyword index and PageRank. Everything they added since then seems to have made it worse (though it did also degrade "organically" from the SEO spam).
Seriously, how? Iam pretty sure you have to have a very different approach than google had in its best times. The web is a very different place now
I don't see AI in any form becoming the next awesome.
What are your thoughts on things like bluesky/nostr and (matrix) too.
Bluesky does seem centralized in its current stage but its idea of (pds?) makes it fundamentally hack proof in the sense that if you are on a server which gets hacked, then your account is still safe or atleast that's the plan, not sure about its current implementation.
I also agree with AI not being the next awesome. Maybe for coding sure, but not in general yeah. But even in coding man, I feel like its good enough and its hard to catch more progress from now on and its just not worth it but honestly that's just me.
The weakness of Mastodon (and the Fediverse IMO), is that you can join one of many instances, and it becomes easier to form an echo chamber. Your feed will the the Fediverse hose (lots of irrelevant content), your local instance (an echo chamber), or your subscriptions (curating them takes effort). Nevertheless, that might be as well a strength I'm not truly appreciating.
I doubt it to happen because of its decentralized-enough nature.
I also agree with the subscriptions curation part the last time I checked, but I didn't use mastodon as often as I used lemmy and it was a less of an issue on lemmy.
Still, I feel like bluesky as an technology is goated and doesn't feel like it can be enshittened.
Nostr on the other hand does seem to me as an echo chamber of crypto bros but honestly, that's the most decentralization as you can ever get. Shame that we are going to get mostly nothing meaningful out of it imo. Which in that case bluesky seems to me as good enough but things like search etc. / the current bluesky is definitely centralized but honestly the same problems kept coming up on fediverse too, lemmy.world was getting too bloated with too many members and even mastodon had only one really famous home server afaik iirc mastodon.social right?
Also I may be wrong, I usually am but iirc mastodon only allows you to comment/ interact with posts on your own server like, I wanted to comment on mastodon.social from some other server but I don't remember being able to do so, maybe skill issue from my side.
To reiterate: Google search results are shit because shit ad-laden results make them more money in the short term.
That's it. And it's sad that so many people continue to give them the benefit of the doubt when there is no doubt.
I'm new to networking..
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
Really great idea about the federated search index too! YaCy has it but it's really heavy and never really gave good results for me.
I wish more people showed their whole exploded stack like that and in an elegant way
Really well done writeup!
Two months in, bing still hasn't crawled the fav icon. Google finally did after a month. I'm still getting outranked by tangentially related services, garbage national lead collection sites, yelp top 10 blog spam, and even exact service providers from 300 miles away that definitely don't serve the area.
Something is definitely wrong with pagerank and crawling in general.
Do you have any backlinks? If not, it’s working as intended?
Feels like it's more and more about consuming data & outputting the desired result.
What are you thinking in terms of improving [and using] the knowledge graph beyond the knowledge panel on the side? If I'm reading this correctly, it seems like you only have knowledge panel results for those top results that exist in Wikipedia, is that correct?
Just out of interest, I sent a query I've had difficulties getting good results for with major engines: "what are some good options for high-resolution ultrawide monitors?".
The response in this engine for this query at this point seems to have the same fallacy as I've seen in other engines. Meta-pages "specialising" in broad rankings are preferred above specialist data about the specific sought-after item. It seems that the desire for a ranking weighs the most.
If I were to manually try to answer this query, I would start by looking at hardware forums and geeky blogs, pick N candidates, then try to find the specifications and quirks for all products.
Of course, it is difficult to generically answer if a given website has performed this analysis. It can be favourable to rank sites citing specific data higher in these circumstances.
As a user, I would prefer to be presented with the initial sources used for assembling this analysis. Of course, this doesn't happen because engines don't perform this kind of bottom-to-top evaluation.
The whole premise of what makes a good search engine has been based on the idea of surfacing those results that most likely contain good information. If that was not the case Google would not have risen to such dominance in the first place.
I still have questions:
* How long do you plan to keep the live demo up?
* Are you planning to make the source code public?
* How many hours in total did you invest into this "hobby project" in the two months you mentioned in your write-up?
If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?
If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.
But you are pretty much the only people who can save the web from AI bots right now.
The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.
Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people?
Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too.
We carefully preserve robots.txt permissions in robots.txt, in http headers, and in html meta tags.
We do publish 2 different url indexes, if you wanted to recrawl for some reason.
https://commoncrawl.org/terms-of-use
In it, (a), (d), and (g) have had overly-political interpretations in many places. (h) is on Reddit where just offering the Gospel of Jesus Christ got me hit with "harassment" once. The problem is whether what our model can be or is uses for incurs liability under such a license. Also, it hardly seems "open" if we give up our autonomy and take on liability just to use it.
Publishing a crawl, or the URL's, under CC-0, CC-by, BSD, or Apache would make them usable without restrictions or any further legal analyses. Does CC have permissively-licensed crawls somewhere?
Btw, I brought up URL's because transfering crawled content may be a copyright violation in U.S., but sharing URL's isn't. Are the URL's released under a permissive license that overrides the Terms of Use?
Alternatively, would Common Crawl simply change their Terms so that it doesn't apply to the Crawled Content and URL databases? And simply release them under a permissive license?
If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.
Models make it cheap to replicate and perform what tech companies do. Their insurmountable moats are lowering as we speak.
Nerdsnipe?
I wonder if OpenAI uses this as a honeypot to get domain-specific source data into its training corpus that it might otherwise not have access to.
> Your data is your data. As of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models (unless you explicitly opt in to share data with us).
To your point, pretty sure it's off by default, though
Edit: From https://platform.openai.com/settings/organization/data-controls/sharing
Share inputs and outputs with OpenAI
"Turn on sharing with OpenAI for inputs and outputs from your organization to help us develop and improve our services, including for improving and training our models. Only traffic sent after turning this setting on will be shared. You can change your settings at any time to disable sharing inputs and outputs."
And I am 'enrolled for complimentary daily tokens.'
Is this the drug dealer scheme? Get you hooked later jack up prices? After all, the alternative would be regenerating all your embeddings no?
One effective old technique for ranking is to capture the search-to-click relationship by real users. It's basically the training data by human mapping the search terms they entered to the links they clicked. With just a few of clicks, the ranking relevance goes way up.
May be feeding the data into a neural net would help ranking. It becomes a classification problem - given these terms, which links have higher probabilities being clicked. More people clicking on a link for a term would strengthening the weights.
That's not very effective. Ever heard of clickbait?
Like I've said uncountable times before, the only effective technique to clean out the search results of garbage is to use a point system that penalises each 3rd party advertisement placed on the page.
The more adverts, the lower the rank.
And the reason that will work is because you are directly addressing the incentive for producing garbage - money!
The result should be "when two sites have the same basic content, in the search results promote the one without ads over the ones with ads".
Until this is done, search engines will continue serving garbage, because they are rewarding those actors who are producing garbage.
That's fine; those who want to search for news articles can use any number of existing search engines that don't penalise ads.
[1] https://www.clearview.ai/post/how-we-store-and-search-30-billion-faces
Kudos wilsonzlin. I'd love to chat sometime if you see this. It's a small space of people that can build stuff like this e2e.
I know the post primarily focuses on neural search, but I’m wondering you tried integrating hybrid BM-25 + embeddings search and if this led to any improvements. Also, what reranking models did you find most useful and cost efficient?
It doesn't seem that far in diatance from a commercial search engine? Maybe even Google?
50k to run is a comically small number. I'm tempted to just give you that money to seed.
But seriously what an amazing write up, plus animations, analysis etc etc. Bravo.
It was also ironic to see AWS failing quite a few use cases here. Stuff to think about.
> SQS had very low concurrent rate limits that could not keep up with the throughput of thousands of workers across the pipeline.
I could not find this perhaps the author meant Lambda limits?
> services like S3 have quite low rate limits — there are hard limits, but also dynamic per-account/bucket quotas
You have virtually unlimited throughput with prefix partitions
I definitely wanted these to "just work" out of the box (and maybe I could've worked more with AWS/OCI given more time), as I wanted to focus on the actual search.
The talent displayed here is immense. I challenge you to do better.
This type of attitude is not constructive however, as if we followed this logic, we would not have coaches and athletes, as the coaches likely cannot do better than the athletes, but that does not mean they are useless.
Of course a spammer could try to include one sentence with a very close embedding for each query they want to rank for, but this would require combinatorially more effort than keyword stuffing where including two keywords also covers queries including both together.
The latter point you pick up on was indeed my point, that you can tweak your SEO spam to give you the embeddings you want to rank for. This actually isn't that difficult given you can run embedding models like SBERT in reverse adversairly to generate text that gives you the best embedding that you want to target (similar to adversarial attacks in image models where you can make a picture of the most zebra like zebra, see the work of Ilia Shumailov former oxford now google deepmind). This is rather cheap and more importantly far far easier to game that ranking high on google where the cost function is unknown. If using an off the shelf embedding like SBERT then the attacker here has the cost function known, and can optimise for it.
https://openwebsearch.eu/
It would be fantastic if someone could provide a not-for-profit decent quality Web search engine.Was your experience with ivfpq not good? I’ve seen big recall drops compared to hnsw, but wow, takes some hardware to scale.
Also did you try sparse embeddings like SPLADE? I have no idea how they scale at this size, but seems like a good balance between keyword and semantic searches.
Google still gave me a better result: https://towardsdatascience.com/sbert-deb3d4aef8a4/
Nevertheless this project looks great and I'd love to see it continue to improve.
Didn't you run into Cloudflare blocks? Many sites are using things like browser fingerprinting. I'd imagine this would be an issue with news sites particularly, as many of them will show the full content only to Google Bot, but not anyone else. Which I have long thought of as an underappreciated moat that Google has in the search market. I was surprised that this topic wasn't mentioned at all in your article. Was it not an issue, or did you just prefer to leave it out?
And you also mentioned nothing about URL de-duplication. Things like "trailing slash or no trailing slash", "query params or no query params", "www or no www". Did you have your crawlers just follow all URLs as they encountered them, and handled duplication only on the content level (e.g. using trigrams)? It sound like that would be wasteful, as you might end up making requests to potentially 2x or more the number of URLs that you'd need to.
Thanks.