My ideal is that turbopuffer ultimately is like a Polars dataframe where all my ranking is expressed in my search API. I could just lazily express some lexical or embedding similarity, boost with various attributes like, maybe by recency, popularity, etc to get a first pass (again all just with dataframe math). Then compute features for a reranking model I run on my side - dataframe math - and it "just works" - runs all this as some kind of query execution DAG - and stays out of my way.
You mean like a fluent API like `data.transform().filter()...` , that sort of thing?
The documentation is great, I really appreciate them putting the roadmap front and centre.
It doesn't have to be that way.
At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.
Sometimes you can reach the goal faster with less complexity by removing the part with the 20x markup.
It's not particularly useful to compare the cost of raw unorganized information medium on a single node, to highly organized information platform. It's like saying "this CPU chip is expensive, just look at the price of this sand".
Except that it does prompt you to ask what you could do to use that cheap compute and RAM. In the case of Hetzner that might be large caches that allow you to apply those resources on remote data whilst minimizing transfer and API costs.
> $3600.00/TB/month (incumbents)
> $70.00/TB/month (turbopuffer)
That's still 3x cheaper than your number and it's a SaaS API, not just a piece of rented hardware.
No, that's not what I'm saying. Their "Storage Costs" table shows costs to rent storage from some provider (AWS?). It's clear that those are costs that the user has to pay for infrastructure needed for certain types of software (e.g. Turbopuffer is designed to be running on "S3 + SSD Cache", while other software may be designed to run on "RAM + 3x SSD").
I'm comparing RAM costs from that table with RAM costs in the real world.
The idea backed by that table is "RAM is so expensive, so we need to build software to run it on cheaper storage instead".
My statement is "RAM is that expensive only on that provider, there are others where it is not; on those, you may just run it in RAM and save on software complexity".
You will still need some software for your SaaS API to serve queries from RAM, but it won't need the complexity of trying to make it fast when serving from a higher-latency storage backend (S3).
This is irking me. pg_vector has existed from before that, doesn't require in-memory storage and can definitely handle vector search for 100m+ documents in a decently performant manner. Did they have a particular requirement somewhere?
You only need enough memory to load the index, definitely not the whole collection. A typical index would most likely fit within a few GBs. And even if you need dozens of GBs of RAM it won’t cost nearly as much as $20k/month as the article surmises.
You have multiple parameters to tweak, that affect retrieval performance as well as the memory footprint of your indexes. Here's a rundown on that: https://tembo.io/blog/vector-indexes-in-pgvector
The biggest difference at a low level is that turbopuffer records have unique primary keys, and can be updated, like in a normal database. Old records that were overwritten won't be returned in searches. The LSM tree storage engine is used to achieve this. The LSM tree also enables maintenance of global indexes that can be used for efficient retrieval without any time-based filter.
Quickwit records are immutable. You can't overwrite a record (well, you can, but overwritten records will also be returned in searches). The data files it produces are organized into a time series, and if you don't pass a time-based filter it has to look at every file.
- it does not do vector search. It can rank docs using BM25, but usually people just want to sort by timestamp.
- its does not use an SSD cache. Quickwit reads directly into the object storage.
- it is append-only (you can't modify documents)
- it scales really well and typically shines on the 1TB .. 100PB range
- it has a Elastic search compatible API.
Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.
I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.
That’s… the whole point. That’s how Parquet files are supposed to be used. They’re an improvement over CSV or JSON because clients can read small subsets of them efficiently!
For comparison, I’ve tried a few other client products that don’t use Parquet files properly and just read the whole file every time, no matter how trivial the query is.
Duckdb can query a remote duckdb database too, in that case it looks like there is caching. Which might be better.
I wonder if anyone actually worked on a specific file format for this use case (relatively high latency random access) to minimize reads to as little blocks as possible.
Simon Willison wrote about it: https://simonwillison.net/2022/Aug/10/sqlite-http/
Whole idea makes sense but I feel like the file format should be specifically tuned for this use case. Otherwise you end up with a lot of range requests because it was designed for disk access. I wondered if anything was actually designed for that.
A lot of requests in themselves shouldn't be that horrible with Cloudfront nowadays, as you both have low latency and with HTTP2 a low-overhead RPC channel.
There are some potential remedies, but each come with significant architetural impact:
- Bigger range queries; For smallish tables, instead of trying to do point-based access for individual rows, instead retrieve bigger chunks at once and scan through them locally -> Less requests, but likely also more wasted bandwidth
- Compute the specific view live with a remote DuckDB -> Has the downside of having to introduce a DuckDB instance that you have to manage between the browser and S3
- Precompute the data you are interested into new parquest files -> Only works if you can anticipate the query patterns enough
I read in the sibling comment that your main issue seems to be re-reading of metadata. DuckDB is AFAIK able to cache the metadata, but won't across instances. I've seen someone have the same issue, and the problem was that they only created short-lived DuckDB in-memory instances (every time the wanted to run a query), so every time the fresh DB had to retrieve the metadata again.
I did some tests, querying "where col = 'x'". If the database was a remote duckdb native db, it would issue a bunch of http range requests and the second exact call would not trigger any new requests. Also, querying for col = foo and then col = foob would yield less and less requests as I assume it has the necesary data on hand.
Doing it on parquet, with a single long running duckdb cli instance, I get the same requests over and over again. The difference though, I'd need to "attach" the duckdb database under a schema name but would query the parquet file using "select from 'http://.../x.parquet'" syntax. Maybe this causes it to be ephemeral for each query. Will see if the attach syntax also works for parquet.
I think this is pretty much what AWS Athena is.
Right now one of the main performance problems is that Clickhouse does not cache index metadata yet, so you still have to scan files rather than keeping the metadata in memory. ClickHouse does this for native MergeTree tables. There are a couple of steps to get there but I have no doubt that metadata caching will be properly handled soon.
Disclaimer: I work for Altinity, an enterprise provider for ClickHouse software.
Having witnessed some very large Elasticsearch production deployments, being able to throw everything into S3 would be incredible. The applicability here isn't only for vector search.
Warehouse BigQuery, Snowflake, Clickhouse ≥1s Minutes
For ClickHouse, it should be: read latency <= 100ms, write latency <= 1s.Logging, real-time analytics, and RAG are also suitable for ClickHouse.
I left that category out for simplicity (plenty of others that didn't make it into the taxonomy, e.g. queues, nosql, time-series, graph, embedded, ..)
Seems like a topic I need to delive into a bit more.
Am I alone in this?
In any case this seems like a pretty interesting approach. Reminds me of Warpstream which does something similar with S3 to replace Kafka.