Show HN: Prompt Engine – Auto pick LLMs based on your prompts
Nowadays, a common AI tech stack has hundreds of different prompts running across different LLMs.

Three key problems:

- Choices, picking from 100s of LLMs the best LLM for that 1 prompt is gonna be challenging, you're probably not picking the most optimized LLM for a prompt you wrote.

- Scaling/Upgrading, similar to choices but you want to keep consistency of your output even when models depreciate or configurations change.

- Prompt management is scary, if something works, you'll never want to touch it but you should be able to without fear of everything breaking.

So we launched Prompt Engine which automatically runs your prompts for you on the best LLM every single time with all the tools like internet access. You can also store prompts for reusability and caching which increases performance on every run.

How it works?

tldr, we built a really small model that is trained on datasets comparing 100s of LLMs that can automatically pick a model based on your prompt.

Here's an article explaining the details: https://jigsawstack.com/blog/jigsawstack-mixture-of-agents-m...

  • swyx
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
congrats yoeven! i have been a skeptic of model routing like Martian, because while it sounds good in theory,

1) in practice, people tune prompts/behavior/output to models and don't dynamically switch all the time. to quote a notable founder i interviewed - "people who switch models all the time arent building serious AI apps"

2) the prompt router, to properly do its job at the extreme, will have to be as smart as its smartest model, because dumb models may not recognize a tricky/tough prompt that requires an upgrade, at which point you're basically just reduced to running the smartest model anyway. smart people disagree with me here (you guys, and https://latent.space/p/lmarena). the other side of this argument is that there are only like 3-4 usage modes for models to really spike on (coding, roleplay, function calling, what else) where you'll just pick that model and hardcode it or let the user pick - the scenario where you want a black box to pick for u is rare, and diminishes over time as all the labs are hellbent on bitter lessonning your switching advantage away. bad idea to bet against bitter lesson

3) both OAI and Anthropic will offer "good enough" routing for house models soon https://x.com/swyx/status/1861229884405883210 . people dont need theoretically globally perfect routing, they just need good enough.

it seems Prompt Engine is a little fancier than model routing, but reads sffecitvely still like routing to me. curious your responses to criticism.

  • htrp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
I'd also ask how you're dealing with the idiosyncrasies across model families (its very different to prompt gemini vs claude vs gpt4o) when you are routing these lm inputs
  • swyx
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
> The engine automatically enhances the initial prompt to improve accuracy, reduce token usage, and prevent output structure breakage.

<handwaving> the prompt will just adapt to model :)

Basically ^

- Prompt style is starting to standardize across models. Something we see happening more and more.

- There are keyword triggers and prompt styles that perform better in certain models over others. Prompt Engine first runs your prompt across 5-6 different models, then ranks the output over each run. The pool of models gets smaller, and we pick one that best fits, and the prompt gets more optimized towards that model.

Hey swyk! I have a similar perspective when it comes to model routing and even LLM frameworks that centralize the schema when almost every provider already supports OpenAI's standard.

1) The goal isn't often model switching but rather model-to-prompt optimization, balancing performance, quality, safeguards & cost at scale. Especially in a situation where an AI app requires multiple models, which oftentimes does exist from the users we talk to. I have seen applications that use llama 3.2 1b on Groq for quick data cleaning of simple unstructured data. At the same time, using GPT4 with streaming for front-end user chat use cases and separately using Gemini for processing large context-length PDF docs for queries tends to be a lot more accurate than the GPT suite. On average, we do see an application using 3 to 4 different models for their use case in significantly different ways.

2) There are hundreds of LLM routers in the market, and the goal isn't to add another one. Here are a few pain points we know exist that LLM routers in the market don't solve:

- On average, an AI app uses 3-4 models, meaning finding the best 3-4 for that application and sticking to it. In the article, we explain how our model scores the output and ranks the model over a few runs.

- Performance-Cost-Quality Ratio, while the best model in the market could execute almost every prompt. The best could also mean slow performance and high cost at scale, but this might not be the case in small apps that use Claud 3.5 sonnet, which should be good for all situations.

- Upgrading is not one of the biggest points, but AI is moving fast, just yesterday, Groq x Meta released llama 3.3 70b, which benchmarks close to the older 3.2 405b at a significantly lower cost with a huge performance increase with language support. Great "free" upgrade, but most apps using 3.2 405b/3.2 70b won't be able to make this quick change without expecting breaking changes on the prompt. We do this under the hood without the users having to change anything.

- Safeguards, with the AI world moving so quickly, safeguards are still far behind, and we're catching up. It's challenging for many companies to handle this at scale across multiple models while keeping up with all the new "hacks" and allowing users to choose the degree of guarding they would like in their application. This is something we can keep consistent across all models. Still pretty new for us too, but it's one of our main focus points. This is especially important when part of the prompt execution is dynamic variables that handle user inputs.

- The last and my favorite point is great DX, things from prompt caching at the API layer, prompt management, and storage for repeating execution. Dynamic variable management with smart caching. Internet access to LLMs with built-in safeguards. Etc...

Prompt Engine focuses on prompt management with smart routing rather than on the easy/constant switching of models.

3) This is a huge question mark for me, and I'm not sure how great this would be to develop as a product internally. We see in the SaaS-cloud world that multi-cloud is a norm now, it started with AWS being dominant, then providers like GCP & Azure took market share, and now we have layer-2 products like Supabase & Vercel. I bet that the same thing will happen in the LLM space, and we'll see apps building different pieces of their AI stack with ~2-3 different LLM providers that are stronger at different things.

Thanks for the super valid points! I love this kind of criticism, as it helps us be less idealistic and build better products :) I hope the points above answer some of the concerns, always happy to dive deeper over a chat!

The small model you trained, how did you annotate the dataset? Because most of the times output from big models will be subjective, and with not a drastic difference in quality. If there is a drastic difference, you dont need a model/classifier. For smaller differences in quality + cost, what would the annotated dataset look like?
We run tons of evals with public datasets from LLM Arena on common categories such as law, financing task, code, maths, etc., and pair them up with the public benchmarks such as Natural2Code, GPQA, etc then tagging the benchmarks with relevant cost, speed benchmarks on a provider level. One of the weights of the Prompt Engine model is on the similarity of the outputs as we execute a user prompt on 5-6 different prompts. Higher quality outputs tend to be similar across at least 3 of the models.
  • soco
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
I can't find that list of hundreds of models to save my life... is it there some place?
We haven't shared this yet, quickly exported the list we used for training into a gist: https://gist.github.com/yoeven/7a0179f75622167a33fd8040d9e72...

The list is a little outdated but we're adding the new llama 3.3 models as well

It's a pity your homepage crashes and replaces all content with the text

    Application error: a client-side exception has occurred (see the browser console for more information).
Yo! I can't seem to replicate this, if you don't mind, could you send me a screenshot and a few more details, like the browser you're using on windows/mac and maybe the console? You can dm me on https://x.com/yoeven or email me at yoeven@jigsawstack.com :)
  • mg
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Is there an LLM that can solve the following simple coding task?

    Make a simple HTML page which
    uses the VideoEncoder API to
    create a video that the user
    can download.
Since the VideoEncoder API is made for this exact use case and is publicly available, it should be able for an LLM to figure it out. But I have yet to see an LLM answer with a working solution.
Took 4 prompts and ChatGPT-4o decided to use a different API, but it I made it make a thing that generates a 3 second webm to download. https://chatgpt.com/share/67531a38-4bfc-8009-bc58-9c823230bf...

Detractors will claim that it didn't complete the assignment because it didn't use the proscribed VideoEncoder API, but the end result, a simple HTML page that generates a 3 second long webm file, speaks for itself.

  • mg
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
The problem withe the MediaRecorder API is that it saves the current timestamp with the frames. So the video plays at the speed it is created. Therefore you can't use the MediaRecorder API for video processing. That's why I referenced the VideoEncoder API in the prompt.
what are you really trying to do?
  • mg
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Create videos in the browser which the user then can download and play on their device.
  • Kiro
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
No, you can't do that with just the VideoEncoder API, which only produces raw encoded frames. You need container muxing to create something playable, which is far from a "simple coding task".

Also, how is this relevant to the submission?

I got it to work with this prompt using GPT-o1:

Make a HTML page which uses the VideoEncoder API to create a video that the user can download. Make sure to incorporate a roll your own container muxing. Do not limit yourself on the header or data.

https://chatgpt.com/share/67531f7c-56cc-800b-ac7c-d3860d1cf9...

  • mg
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Yay, I just tried it on my iPad and it works!

When you say "GPT-o1", do you mean the model "o1-preview"? Because I think that is the only o1 I can access via the API.

I believe they may have just changed GPT-o1-preview to GPT-o1 today.
> Also, how is this relevant to the submission?

The title of the submission states "Auto pick LLMs based on your prompt".

The GP provided a prompt where auto picking an LLM would possibly help. Seems relevant to me. Even if the answer from the best LLM is, "This isn't directly possible, here are alternatives".