What I like about these kinds of solutions is that they address the practical challenges of using multiple LLMs. Rate limits, cost per token, and even just choosing the right model for the job can be a real headache.
KNN-router, for example, lets you define your own logic for routing queries, so you can factor in things like model accuracy, response time, and cost. You can even set up fallback models for when your primary model is unavailable.
It's cool to see these kinds of tools emerging because it shows that people are starting to think seriously about how to build robust, cost-effective LLM pipelines. This is going to be crucial as more and more companies start incorporating LLMs into their products and services.
https://github.com/pulzeai-oss/knn-router
// HN doesn't handle squared circle as MD.
I don't find success just using a prompt against some other model without having some way to evaluate it and usually updating it for that model.
The answer is here. This is a cost-saving tool.
All companies and their moms want to be in the GenAI game but have strict budgets. Tools like this help to keep GenAI projects within budget.
With the router thingy, it keeps a record, so you know every query where you stand, and can switch to another model automatically instead of interrupting workflow?
I may be explaining this very badly, but I think that's one use-case for how these LLM Routers help.
I don’t think using different models is the right approach though. They behave differently. Better to use a big and small one from same family. Or alternatively using this to drive whether to give the ai more “thinking time” via chain of thought or agents.
Using this, they were able to produce quite good results applying this similarity measurement to unseen queries using a standard benchmark. The leap of faith here is assuming that the same query similarity method will continue to bear fruit when extended to queries that aren't benchmarkable.
I believe OpenRouter also provides an API that does the same thing as RouteLLM. Again, you only have to pay OpenRouter, not every model's service you use.
I’d be really good to allow more than two models and change dynamically based on multiple constraints like latency, reasoning complexity, costs, etc.
But the underlying "how to choose a model that's smart enough but not too smart" seems difficult to understand.
Or am I reading this wrong? :)
The OpenAI-compatible API allows you to talk to the router like a regular GPT model.
Problem solved, next.
im open to differing opinions but after dealing with langchain, premature optimization for non-critical problems is rampant in this space rn