Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.
For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.
Open weights, multilingual, 32k context.
I have ~50 million sentences from english project gutenberg novels embedded with this.
Back in June and August i wrote some llm assisted blog posts about a few of the experiments.
They are here: sjsteiner.substack.com
(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)
I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.
Are there any solid models that can be downloaded client-side in less than 100MB?
Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.
Source is at https://github.com/afiodorov/hn-search
I tried "Who's Gary Marcus" - HN / your thing was considerably more negative about him than Google.
However, it’s out there- and you have no idea where, so there’s not really a moral or feasible way to get rid of it everywhere. (Please don’t nuke the world just to clean your rep.)
Maybe commenters face threats to safety. Maybe commenters didn’t think AI companies profiting off of their non-commercial conversations would ever exist and wouldn’t have put data out there if that was disclosed ahead of time.
Corporations have an unlimited right to bully and threaten to take down embarrassing content and hide their mistakes, they have greatly enhanced leverage over copyright enforcement compared to individuals, but then if individuals do a much less egregious thing to try and take down their content they don’t even get paid for it’s immoral.
This community financially benefits YCombinator and its portfolio companies. Without our contributions, readership, and comments, their ability to hire and recruit founders is diminished. They don’t provide a delete button for profit-motivated reasons, and privacy laws like GDPR guard against that.
(As you might guess, I am personally quite against HN’s policy forbidding most forms of content deletion. Their policy and solution involving manual modifications via the moderation team makes no sense - every other social media platform lets you delete your content)
[1] https://www.ycombinator.com/legal/#tou
> Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site.
> The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.
From [1] Terms of Use | Intellectual Property Rights: > Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined below) that you legally upload to the Site.
> In connection with your use of the Site you will not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods.Sure and some people would want a "gun prosthesis" as an aid to quickly throw small metallic objects, and it wouldn't be allowed either.
These modern brain prosthetics are darn good.
For a forum of users that's supposed to be smarter than Reddit users, we sure do make our selves out to be just as unsmart as those Reddit users are purported. To not be able to understand the intent/meaning of "for commercial use" is just mind boggling to the point it has to be intentional. The purpose is what I'm still unclear though
> I hired a company called OpenAI to do it for me.
>>> If it's OK to encode it in your natural neural net, why is it not OK to put it in your artificial one?
Well I guess that lines up. With that line of reasoning I have zero issue believing you outsourced your reading to them. You clearly aren't getting your money's worth.I am not sure if it is that clear cut.
Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.
With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».
Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.
Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:
- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;
- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;
- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.
Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.
Where it gets more complicated.
If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.
Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».
[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.
[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.
Data of non-european users who just click the "delete" button in their user profile? Completely different beast.
I've never been convinced that my data will be deleted from any long term backups. There's nothing preventing them from periodically restoring data from a previous backup and not doing any kind of due diligence to ensure hard delete data is deleted again.
Who in the EU is actually going in and auditing hard deletes? If you log in and can no longer see the data because the soft delete flag prevents it from being displayed and/or if any "give me a report of data you have on me" reports empty because of soft delete flag, how does anyone prove their data was not soft deleted only?
If anyone owns this comment it's me IMO. So I don't see any reason why HN should be able to sue anyone for using this freely available information.
(Violation of HN Terms & Conditions || Violation of copyright) = Potential penalty
So the equation still balances for them to not give a damn
I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.
Semantic threads or something would be the general idea... Pretty cool concept actually...
This is funny to me in a number ways. I doubt anyone would be interested in post-2023 data dumps for fear it would be too contaminated with content produced from LLMs. It's also funny that the archive was hosted by huggingface which just removes any sliver of doubt they scarped (sic) the site.
No it didn't. As the top comment in that thread points out, there were a large number of false positives.
Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.
Around 4 millions of web pages as markdown is like 1-2GB
wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant
This is different, and everyone pretending it is not, is being intentionally ignorant or genuinely ignorant, neither good. I did not give so much to the public internet for the benefit of commercial AI models, simple as that. This breaks the relationship I had with the public internet, and like many others I will change my behaviour online to suit.
Maybe my tune will change once there's a commercial collapse and the only remaining models are open source, free for all to use. But even then it would be begrudgingly, my thoughts parading as some model's abilities doesn't sit right.
To make another example, even though my reddit history is public (or was until recently because I didn't have a choice) I would still feel uneasy if I realized someone deliberately snooped through all of it. And I would be SUUUUPER uncomfortable if someone did that with my Discord history.
It's not against the rules or anything, I just think it's rude.
Making the content queryable by a database engine is merely a technical optimisation of the efficiency with which that content may be consumed. The same outcome could have been accomplished by capturing a screenshot of every web page on the internet, or by copying and pasting the said content laboriously by an imaginary army of Mechanical Turks.
A private network may, of course, operate under an entirely different access model and social contract.
It's two clicks to get to that page from this page. Say the wrong thing here and some troll will go through it and find something you said years ago that contradicts something you're saying today. If the mere thought of that bothers you, I don't know what to tell you other than to warn you of the possibility.
It used to be about helping strangers in some small way. Now it's helping people I don't like more than people I do like.
I've definitely heard that one before... Explain link rot to me then, or why the internet archive even exists?
As an archive that supplements my personal archive, and the archives of many others. Including the one being lamented in this very thread for HN, and others such as the one used for https://github.com/afiodorov/hn-search
The way to eliminate your comments would be to take over world government, use your copy of the archives of the entire internet in order to track down the people who most likely have created their own copies, and to utilize worldwide swat teams with trained searchers, forensics experts and memory-sniffing dogs. When in doubt, just fire missiles at the entire area. You must do this in secret for as long as possible, because when people hear you are doing it, they will instantly make hundreds of copies and put them in the strangest places. You will have to shut down the internet. When you are sure you have everything, delete your copy. You still may have missed one.
tldr: both of these things can be true.
"Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data. You can find it here: https://huggingface.co/datasets/OpenPipe/hacker-news and, honestly, I’m not really sure how this was obtained, if using scarping or if HN makes this data public in some way."
For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.
https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...
>With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein.
>By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose
ah, if only I knew about this small little legal detail when I made my account...
The law. And the license agreed when I made the account.
The law? I don't know, copyright law I guess?
Copyright gives you a bundle of rights over your expressive works, but when you give them to someone else for republication, as you are here, you’re licensing them. By licensing according to the terms of service, which is a binding contract, you are relinquishing those rights. As long as there is a term in the terms of service that allows the publisher to convey your expression to a third party, you don’t get any say into what happens next. You gave your consent by submitting your content, and there’s no backsies. (Subject to GDPR and other applicable laws, of course.)
And these days, no web service that accepts user generated content and has a competent lawyer is going to forget to have that sort of term in their ToS.
I don't know why they continue to stand by this massive breach of privacy.
Citizens of any country should have the right to sue to remove personal information from any website at any time, regardless of how it got there.
Right to be forgotten should he universal.
It's worse than that, it's an obvious GDPR violation. But it hasn't been tested in a (european) court yet. One day, it will be, and much rejoicing would be had then.
It's also a shitty provision that it's not made clear when signing up for HN, as it is a pretty uncommon one.
You write like a Bible.