Why a hybrid? Vector databases are useful for similarity queries, while graph databases are useful for relationship queries. Each stores data in a way that’s best for its main type of query (e.g. key-value stores vs. node-and-edge tables). However, many AI-driven applications need both similarity and relationship queries. For example, you might use vector-based semantic search to retrieve relevant legal documents, and then use graph traversal to identify relationships between cases.
Developers of such apps have the quandary of needing to build on top of two different databases—a vector one and a graph one—plus you have to link them together and sync the data. Even then, your two databases aren't designed to work together—for example, there’s no native way to perform joins or queries that span both systems. You’ll need to handle that logic at the application level.
Helix started when we realized that there are ways to integrate vector and graph data that are both fast and suitable for AI applications, especially RAG-based ones. See this cool research paper: https://arxiv.org/html/2408.04948v1. After reading that and some other papers on graph and hybrid RAG, we decided to build a hybrid DB. Our aim was to make something better to use from a developer standpoint, while also making it fast as hell.
After a few months of working on this as a side project, our benchmarking shows that we are on par with Pinecone and Qdrant for vectors, and our graph is up to three orders of magnitude faster than Neo4j.
Problems where a hybrid approach works particularly well include:
- Indexing codebases: you can vectorize code-snippets within a function (connected by edges) based on context and then create an AST (in a graph) from function calls, imports, dependencies, etc. Agents can look up code by similarity or keyword and then traverse the AST to get only the relevant code, which reduces hallucinations and prevents the LLM from guessing object shapes or variable/function names.
- Molecule discovery: Model biological interactions (e.g., proteins → genes → diseases) using graph types and then embed molecule structures to find similar compounds or case studies.
- Enterprise knowledge management: you can represent organisational structure, projects, and people (e.g., employee → team → project) in graph form, then index internal documents, emails, or notes as vectors for semantic search and link them directly employees/teams/projects in the graph.
I naively assumed when learning about databases for the first time that queries would be compiled and executed like functions in traditional programming. Turns out I was wrong, but this creates unnecessary latency by sending extra data (the whole written query), compiling it at run time, and then executing it. With Helix, you write the queries in our query language (HelixQL), which is then transpiled into Rust code and built directly into the database server, where you can call a generated API endpoint.
Many people have a thing against “yet another query language” (doubtless for good reason!) but we went ahead and did it anyway, because we think it makes working with our database so much easier that it’s worth a bit of a learning curve. HelixQL takes from other query languages such as Gremlin, Cypher and SQL with some extra ideas added in. It is declarative while the traversals themselves are functional. This allows complete control over the traversal flow while also having a cleaner syntax. HelixQL returns JSON to make things easy for clients. Also, it uses a schema, so the queries are type-checked.
We took a crude approach to building the original graph engine as a way to get an MVP out, so we are now working on improving the graph engine by making traversals massively parallel and pipelined. This means data is only ever decoded from disk when it is needed, and parts of reads are all processed in parallel.
If you’d like to try it out in a simple RAG demo, you can follow this guide and run our Jupyter notebook: https://github.com/HelixDB/helix-db/tree/main/examples/rag_d...
Many thanks! Comments and feedback welcome!
Would love to talk to you about it and make sure we capture all of the pain points if you're open to it? :)
I notice that in your core vector type (`HVector`), you choose to store the vector data as a `Vec<f64>`. Given what I have seen from most embedding endpoints, they return `f32`s. Is there a particular reason for picking `f64` vs `f32` here? Is the additional precision a way to avoid headaches down the line or is it something I am missing context for?
Really cool project, gonna keep reading the code.
Feel free to point me to docs / code if these are lazy questions :)
For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.
We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.
For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.
In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.
You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...
With regard to the graph db, we mostly use our laptops to test it and haven't run into an issue with performance yet on any size dataset.
If you wanna chat DM me on X :)
Furthermore, the vectors is capped at 4k dimensions which although may be enough most of the time, is a problem for some of the users we've spoken to. Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI. They are on the right track, but there are a lot of holes that we are hoping to fill :)
Edit: AND, it is super memory intensive. People have had problems using extremely small datasets and have had memory overflows.
Currently the road block for that is the LMDB storage engine. We have on our own storage engine on our roadmap, which we want to include WASM support with. If you wanna talk about it reach out to my twitter: https://x.com/georgecurtiss
Not sure if it's possible. But why not use fjall, if it is? [0]
I wonder if you'd like to share your thoughts on GQL becoming an ISO standard? Also, have you looked into how Neptune Analytics handles vector embeddings?
I mentioned in another comment that you can provide a grammar with constrained decoding to force the LLM to generate tokens that comply with the grammar. This ensures that only valid syntactic constructs are produced.
Can I sidestep the DSL? I want my LLMs to generate queries and using a new language is going to make that hard or expensive.
We're working on putting our grammar in llama's cpp code so that it only outputs grammatically correct HQL. But, even without that it shouldn't be hard or expensive to do. I wrote a Claude wrapper that had our docs in its context window, it did a good job of writing queries most of the time.
Does Helix support much of the graph algorithm world? For things like GrapgRAG.
Either way, I'd be all over it if there was a python SDK witch worked with the generated types!
It’s built in Rust with native vector support. The open-source version is in-memory, but the commercial version supports disk-based scaling (we tested it with a 3TB graph on an M1 MacBook + insert all 100x faster than existing GraphDBs).
We have a python SDK already! What do you mean by generated types though?
I.e: You have to re-index all of the vectors when you make an update to them.
What other papers did you get inspiration from?
Graph DBs have been plagued with exploding complexity of queries as doing things like allowing recursion or counting paths isn't as trivial as it may sound. Do you have benchmarks and comparisons against other engines and query languages?
Could you share any information on the pricing model?
We chose AGPL to make sure someone can't make a cloud hosted version of our product, think MongoDB on AWS a few years back.
I'm surprised none in the team searched crates.io once before picking the name. Good luck!
https://github.com/helix-editor/helix/discussions/7038
That being said, when I saw `helix-db` I was thrown too. "What's a text editor doing writing a vector-graph database, I thought they were working on plugins?"
We didn't think of getting people to use it until we found it was solving a real pain point for people, so weren't worried about trademarks or names. There was no other helix db so that was good enough for us at the time.
Does that answer your question properly?
> Built for performance we're currently 1000x faster than Neo4j, 100x faster than TigerGraph
That is just heresy though, am interested myself now and will run some proper benchmarks
We've built SQL and PGVector ones already, just waiting for someone who could make use of other ones before we build them.
Let us know! Twitter in my bio
My friend who I worked on this with is putting together a technical blog on those graph optimisations so I'll link it here when he's done
Congratulations on the launch! This is a very exciting space, and it's great to see your take on it.
Running fair benchmarks, not benchmarketing, is a significant effort and we recently put in this effort to make things as fair and transparent as possible across a range of databases.
You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...
We'd be very interested in seeing the benchmarks you'd run and how we compare :)
You can sacrifice many things for faster performance, such as security, consistency levels or referential integrity.
I'm genuinely curious to learn what design decisions you will make as you continue building the database. There are so many options, each with its pros and cons.
If you would like to have a chat where we can exchange ideas, happy to do that :)
One of the problems I know people experience with them is that they're super slow at bulk reading.
Oh also, they aren't built in Rust haha
I think you misspelled "vendor lock in"