- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition
Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.
In practice, people generally didn't even vote with two options, they voted with one!
IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.
No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".
Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.
“You’re absolutely right! Nice catch how I absolutely fooled you”
By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).
You can't get to 100% confidence with LLMs.
I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.
I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.
Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
I've been using the smaller models ever since. Nano/mini, flash, etc.
I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.
I have yet to go back to small models, waiting for the upstream feature / GPU provider has been seeing capacity issues, so I am sticking with the gemini family for now
gemini-3-flash stands well above gemini-2.5-pro
What I did splurge on was brief openai access for some subtitle translator program and when I used the deepseek api. Actually I think that $13 includes some as yet unused credits. :D
I'd be happy to provide details if CLIs are an option and you don't m ind some sweatshop agent. :)
(I am just now noticing I meant to type 2 years not 3 above. Sorry about that.)
Presumably that'll be some sort of funnel for a paid upload of prompts.
Here's a bug report, by switching the model group the api hangs in private mode.
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.
Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.
But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.
Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.
It sounds like he's building some kind of ai support chat bot.
I despise these things.
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Any resources you can recommend to properly tackle this going forward?