Three key problems:
- Choices, picking from 100s of LLMs the best LLM for that 1 prompt is gonna be challenging, you're probably not picking the most optimized LLM for a prompt you wrote.
- Scaling/Upgrading, similar to choices but you want to keep consistency of your output even when models depreciate or configurations change.
- Prompt management is scary, if something works, you'll never want to touch it but you should be able to without fear of everything breaking.
So we launched Prompt Engine which automatically runs your prompts for you on the best LLM every single time with all the tools like internet access. You can also store prompts for reusability and caching which increases performance on every run.
How it works?
tldr, we built a really small model that is trained on datasets comparing 100s of LLMs that can automatically pick a model based on your prompt.
Here's an article explaining the details: https://jigsawstack.com/blog/jigsawstack-mixture-of-agents-m...
1) in practice, people tune prompts/behavior/output to models and don't dynamically switch all the time. to quote a notable founder i interviewed - "people who switch models all the time arent building serious AI apps"
2) the prompt router, to properly do its job at the extreme, will have to be as smart as its smartest model, because dumb models may not recognize a tricky/tough prompt that requires an upgrade, at which point you're basically just reduced to running the smartest model anyway. smart people disagree with me here (you guys, and https://latent.space/p/lmarena). the other side of this argument is that there are only like 3-4 usage modes for models to really spike on (coding, roleplay, function calling, what else) where you'll just pick that model and hardcode it or let the user pick - the scenario where you want a black box to pick for u is rare, and diminishes over time as all the labs are hellbent on bitter lessonning your switching advantage away. bad idea to bet against bitter lesson
3) both OAI and Anthropic will offer "good enough" routing for house models soon https://x.com/swyx/status/1861229884405883210 . people dont need theoretically globally perfect routing, they just need good enough.
it seems Prompt Engine is a little fancier than model routing, but reads sffecitvely still like routing to me. curious your responses to criticism.
<handwaving> the prompt will just adapt to model :)
- Prompt style is starting to standardize across models. Something we see happening more and more.
- There are keyword triggers and prompt styles that perform better in certain models over others. Prompt Engine first runs your prompt across 5-6 different models, then ranks the output over each run. The pool of models gets smaller, and we pick one that best fits, and the prompt gets more optimized towards that model.
1) The goal isn't often model switching but rather model-to-prompt optimization, balancing performance, quality, safeguards & cost at scale. Especially in a situation where an AI app requires multiple models, which oftentimes does exist from the users we talk to. I have seen applications that use llama 3.2 1b on Groq for quick data cleaning of simple unstructured data. At the same time, using GPT4 with streaming for front-end user chat use cases and separately using Gemini for processing large context-length PDF docs for queries tends to be a lot more accurate than the GPT suite. On average, we do see an application using 3 to 4 different models for their use case in significantly different ways.
2) There are hundreds of LLM routers in the market, and the goal isn't to add another one. Here are a few pain points we know exist that LLM routers in the market don't solve:
- On average, an AI app uses 3-4 models, meaning finding the best 3-4 for that application and sticking to it. In the article, we explain how our model scores the output and ranks the model over a few runs.
- Performance-Cost-Quality Ratio, while the best model in the market could execute almost every prompt. The best could also mean slow performance and high cost at scale, but this might not be the case in small apps that use Claud 3.5 sonnet, which should be good for all situations.
- Upgrading is not one of the biggest points, but AI is moving fast, just yesterday, Groq x Meta released llama 3.3 70b, which benchmarks close to the older 3.2 405b at a significantly lower cost with a huge performance increase with language support. Great "free" upgrade, but most apps using 3.2 405b/3.2 70b won't be able to make this quick change without expecting breaking changes on the prompt. We do this under the hood without the users having to change anything.
- Safeguards, with the AI world moving so quickly, safeguards are still far behind, and we're catching up. It's challenging for many companies to handle this at scale across multiple models while keeping up with all the new "hacks" and allowing users to choose the degree of guarding they would like in their application. This is something we can keep consistent across all models. Still pretty new for us too, but it's one of our main focus points. This is especially important when part of the prompt execution is dynamic variables that handle user inputs.
- The last and my favorite point is great DX, things from prompt caching at the API layer, prompt management, and storage for repeating execution. Dynamic variable management with smart caching. Internet access to LLMs with built-in safeguards. Etc...
Prompt Engine focuses on prompt management with smart routing rather than on the easy/constant switching of models.
3) This is a huge question mark for me, and I'm not sure how great this would be to develop as a product internally. We see in the SaaS-cloud world that multi-cloud is a norm now, it started with AWS being dominant, then providers like GCP & Azure took market share, and now we have layer-2 products like Supabase & Vercel. I bet that the same thing will happen in the LLM space, and we'll see apps building different pieces of their AI stack with ~2-3 different LLM providers that are stronger at different things.
Thanks for the super valid points! I love this kind of criticism, as it helps us be less idealistic and build better products :) I hope the points above answer some of the concerns, always happy to dive deeper over a chat!
The list is a little outdated but we're adding the new llama 3.3 models as well
Application error: a client-side exception has occurred (see the browser console for more information).
Make a simple HTML page which
uses the VideoEncoder API to
create a video that the user
can download.
Since the VideoEncoder API is made for this exact use case and is publicly available, it should be able for an LLM to figure it out. But I have yet to see an LLM answer with a working solution.Detractors will claim that it didn't complete the assignment because it didn't use the proscribed VideoEncoder API, but the end result, a simple HTML page that generates a 3 second long webm file, speaks for itself.
Also, how is this relevant to the submission?
Make a HTML page which uses the VideoEncoder API to create a video that the user can download. Make sure to incorporate a roll your own container muxing. Do not limit yourself on the header or data.
https://chatgpt.com/share/67531f7c-56cc-800b-ac7c-d3860d1cf9...
The title of the submission states "Auto pick LLMs based on your prompt".
The GP provided a prompt where auto picking an LLM would possibly help. Seems relevant to me. Even if the answer from the best LLM is, "This isn't directly possible, here are alternatives".