Show HN: Opik, an open source LLM evaluation framework

Hey HN! I'm Caleb, one of the contributors to Opik, a new open source framework for LLM evaluations.

Over the last few months, my colleagues and I have been working on a project to solve what we see as the most painful parts of writing evals for an LLM application. For this initial release, we've focused on a few core features that we think are the most essential:

- Simplifying the implementation of more complex LLM-based evaluation metrics, like Hallucination and Moderation.

- Enabling step-by-step tracking, such that you can test and debug each individual component of your LLM application, even in more complex multi-agent architectures.

- Exposing an API for "model unit tests" (built on Pytest), to allow you to run evals as part of your CI/CD pipelines

- Providing an easy UI for scoring, annotating, and versioning your logged LLM data, for further evaluation or training.

It's often hard to feel like you can trust an LLM application in production, not just because of the stochastic nature of the model, but because of the opaqueness of the application itself. Our belief is that with better tooling for evaluations, we can meaningfully improve this situation, and unlock a new wave of LLM applications.

You can run Opik locally, or with a free API key via our cloud platform. You can use it with any model server or hosted model, but we currently have a built-in integration with the OpenAI Python library, which means it automatically works not just with OpenAI models, but with any model served via a compatible model server (ollama, vLLM, etc). Opik also currently has out-of-the-box integrations with LangChain, LlamaIndex, Ragas, and a few other popular tools.

This is our initial release of Opik, so if you have any feedback or questions, I'd love to hear them!

86
15
calebkaiser
1 year ago
github.com

itssteadyfreddy
·
11 months ago
·
[ - ]

Ran through some colabs. Signed up for a key and tested the ollama colab. Got a little error on cell 2, "ConnectError: [Errno 99] Cannot assign requested address" but the traces went through which was fine. Just a little heads up

I am using Arize Phoenix and trying to see the difference. Can you highlight?

tcsizmadia
·
1 year ago
·
[ - ]

It looks very promising! Congratulations, great tool! I can't wait to start experimenting with it. I plan to use it locally, with Ollama.

calebkaiser
·
1 year ago
·
[ - ]

Awesome, thanks! If you run into any issues, you can open a ticket on the repo or ping me directly at caleb[at]comet.com

trolan
·
1 year ago
·
[ - ]

I'm in a University course related to AI testing and quality assurance. This is something I'll definitely bring up and see how it can be used.

With OpenAI comparability, hoping it supports OpenRouter out of the box, which means it supports Anthropic and Google too, along with a host of open models hosted elsewhere.

calebkaiser
·
1 year ago
·
[ - ]

Fantastic to hear! Opik should work with OpenRouter out of the box, particularly if you are using the OpenAI Python client to interface with OpenRouter. Opik's integration with OpenAI is implemented via their Python library, and so it is agnostic with respect to actual backend serving the model.

Your course sounds interesting! If you're doing any research around testing and evaluation, particularly regarding applied LLM applications, we have several researchers and engineers on our team who I'm sure would be happy to connect (myself included).

pablomendes
·
11 months ago
·
[ - ]

Is the course focused on LLMs used to generate text or does it also talk about other kinds of testing like search, images, etc?

smcleod
·
1 year ago
·
[ - ]

Looks interesting, great to see it specifically calls out supporting LLM servers as first class citizens!

I see some of the code is Java, that strikes me as an interesting choice - is there a reason behind that or simply the language that the devs were already familiar with?

calebkaiser
·
1 year ago
·
[ - ]

The decision to go with Java for the backend was because we feel Java is a bit more battle tested than Python for production (dependency management, concurrency, compilation etc). Go was another strong contender, but we felt like it's a bit easier to contribute to a Java codebase for OSS contributors. At the end of the day, I'm sure we could have gone with several different options. Our cloud codebase has services in multiple languages (Python, Java, Go, TS), and Opik uses TypeScript for FE, Python for SDKs, and Java for backend.

hrpnk
·
1 year ago
·
[ - ]

Is there a reason you didn't just implement OpenTelemetry (OT) straight away? Curious about the trade offs to opt for a home-grown telemetry inspired by OT instead.

calebkaiser
·
1 year ago
·
[ - ]

Good question! It mostly came down to implementation speed, as well as some uncertainty about performance/overhead. We will be releasing OpenTelemetry compatible ingestion endpoints in the near future, but since Opik has so many features that aren't related to OT, we decided to move forward without it for the initial release. It is a great project though and something we will be implementing soon—it will especially be useful for building out integrations with frameworks that are OpenTelemetry compatible.

baggiponte
·
1 year ago
·
[ - ]

Have you seen two “prototypes” of standard for LLM telemetry? One is openllmetry, maintained by the folks at TraceLoop. Seems the more popular. The other one is openinference IIRC, by Arize AI.

calebkaiser
·
1 year ago
·
[ - ]

Of the two, the only one I've ever personally explored is OpenLLMetry. Extremely cool project. In general, this is one of those areas where the field still needs to "shake out" a bit.

kakaly0403
·
1 year ago
·
[ - ]

There is a GenAI standard spec from OpenTelemetry for tracing LLM based applications. Currently there are 3 library implementations of this spec - Langtrace, OpenLLMetry and OpenLit. Microsoft has an implementation for .NET aswell. OpenInference, though opentelemetry compatible does not adhere to the standard spec.

yu3zhou4
·
1 year ago
·
[ - ]

Hello! How does it compare to DeepEval (open source)?

calebkaiser
·
1 year ago
·
[ - ]

Great question. First, we have a ton of respect for the work the DeepEval team is doing. That said, we took a fundamentally different approach in building Opik as an open source project. With DeepEval, if you want to log your data or use the UI, you need to use Confident AI's cloud platform (which as far as I'm aware, has no free plan). So, if you want to visualize traces, do production monitoring, labeling, etc, you can't just use the DeepEval open source library.

All of Opik's functionality, including the UI and logging, is available in the open source version. The only "features" that are inaccessible from the open source version of Opik are things that are actually features of the Comet platform. For example, Comet Artifacts allow you to store your datasets as versioned assets, preserved as an immutable series of snapshots, which automatically track any experiments they've been a part of in order to preserve their full data lineage. You can use Opik with Artifacts, but that will require a free Comet account. Any Opik-specific feature, however, is fully available in the open source version.