Vector searching had strange quirks where searching for "cat" would return mostly a lot of paragraphs unrelated to the word. I was using 3072 length for OAI text-embedding-3-large. Each entry was roughly 1-2 paragraphs. For my recent project, I found that PGroonga was more reliable for full text document lookup (with some fuzzy matching support).
This could of course be a bug, but it's worth noting that vector searching in general is a semantic ("meaning based") search technique, but it doesn't operate on the actual text, it operates on the resulting embedding so it may well make sense to include a (possibly fuzzy) full text search as well to catch literal instances of the text if that's what you actually need.

That said, the particular example you give may be either a configuration problem or just that "cat" is a particularly bad choice of word to use for a vector search in your corpus.

HNSW[1] is an approximate nearest neighbour search technique. So if you visualise the embedding as distilling your documents down to a set of numbers, this vector forms the coordinates of a point in a high-dimensional vector space. HNSW is going to take the word "cat" in your example, embed that, and find what are probably the closest other vectors (representing documents) in that space by your distance metric.[2]

Now, it's easy to imagine the problem - if your document corpus is about something very unrelated to cats (like say your corpus is a bunch of programming books), the results of this search are going to be basically just random, because you have a big blob of embedded documents in your corpus which are relatively close together and your search term embeds way off in empty space and therefore the distance from the search term to any given document is extremely large. If you were to search a topic that is reflected in your corpus, your search term could embed right in the middle of that blob and the distances would be much more meaningful meaning the results are likely to be more intuitive and useful.

[1] https://arxiv.org/abs/1603.09320 is I think the original paper that introduced the technique

[2] The probably/approximately get out is what allows the computational complexity of HNSW to stay reasonable even if the corpus is large

yeah usually much better results getting a list of ~10 fuzzy keyword search results, 10 semantic/embeddings results, and using something like Cohere rerank (or just a cheap GPT model) to choose the best 5-10 results from the pile
General text embeddings aren't that great for keyword search. The best embedding match for a paragraph is going to be another paragraph with the same meaning. "Cats are natural carnivores." should be very similar to "Felines eat meat.", and somewhat similar to "Tigers eat meat."

"cat" is not a sentence - the embedding should be close to other contextless single words, and maybe just slightly closer to paragraphs about cats than paragraphs about non-cats.

A simple way to use embeddings for search is to generate text of the same kind you expect your users to input - e.g. if you expect users to ask questions in natural language, have a cheap LLM turn paragraphs like "Cats are natural carnivores." into questions like "What do cats eat?" "What is natural behaviour for cats?" and then index the embeddings for those.

I wonder what kind of sentence should be the best match when using embeddings to search for things similar to the following grammatically correct sentence:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

Well… I guess the most relevant results would simply be text from that Wikipedia article about this sentence, or from comments on HN and Reddit about this specific sentence, and then various texts about grammar that talk about similar concepts to this.

Augmenting your search with fuzzy matching is a good idea. You might also try embedding with smaller chunk sizes (5-8 sentences) at a time. The paragraph breaks will usually not be a problem. The bigger the chunk text, the more likely that the attention in llm embeddings can downplay the significance of a word. You can also use individual sentences with something like FastText to do very rapid embeddings with a smaller vector length and great quality (imho) with higher precision. Also much easier to run in production without paying for a GPU server or API tokens.
bm25 + vector search works better than vector search alone.
Very interesting breakdown, OP have you deep dived in pgvectorscale as well?
Thank you! Haven’t done it yet, but afaik the pgvectorscale uses StreamingDiskANN which should have different layout than HNSW.
  • ·
  • 4 months ago
  • ·
  • [ - ]
I wanted to read this article. Gave up because of absolutely missing contrast. Please, if you publish something, use black (#000) for text and almost white for background and not darker grey on a lighter grey background.
Thanks for the feedback, we slightly increased the contrast if it helps.
  • ·
  • 4 months ago
  • ·
  • [ - ]