Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Hi HN,

My friend and I have been experimenting with using LLMs to reason about biotech stocks. Unlike many other sectors, Biotech trading is largely event-driven: FDA decisions, clinical trial readouts, safety updates, or changes in trial design can cause a stock to 3x in a single day (https://www.biotradingarena.com/cases/MDGL_2023-12-14_Resmet...).

Interpreting these ‘catalysts,’ which comes in the form of a press release, usually requires analysts with previous expertise in biology or medicine. A catalyst that sounds “positive” can still lead to a selloff if, for example: the effect size is weaker than expected

- results apply only to a narrow subgroup

- endpoints don’t meaningfully de-risk later phases,

- the readout doesn’t materially change approval odds.

To explore this, we built BioTradingArena, a benchmark for evaluating how well LLMs can interpret biotech catalysts and predict stock reactions. Given only the catalyst and the information available before the date of the press release (trial design, prior data, PubMed articles, and market expectations), the benchmark tests to see how accurate the model is at predicting the stock movement for when the catalyst is released.

The benchmark currently includes 317 historical catalysts. We also created subsets for specific indications (with the largest in Oncology) as different indications often have different patterns. We plan to add more catalysts to the public dataset over the next few weeks. The dataset spans companies of different sizes and creates an adjusted score, since large-cap biotech tends to exhibit much lower volatility than small and mid-cap names.

Each row of data includes:

- Real historical biotech catalysts (Phase 1–3 readouts, FDA actions, etc.) and pricing data from the day before, and the day of the catalyst

- Linked Clinical Trial data, and PubMed pdfs

Note, there are may exist some fairly obvious problems with our approach. First, many clinical trial press releases are likely already included in the LLMs’ pretraining data. While we try to reduce this by ‘de-identifying each press release’, and providing only the data available to the LLM up to the date of the catalyst, there are obviously some uncertainties about whether this is sufficient.

We’ve been using this benchmark to test prompting strategies and model families. Results so far are mixed but interesting as the most reliable approach we found was to use LLMs to quantify qualitative features and then a linear regression of these features, rather than direct price prediction.

Just wanted to share this with HN. I built a playground link for those of you who would like to play around with it in a sandbox. Would love to hear some ideas and hope people can play around with this!

austinwang115
·
3 hours ago
·
[ - ]

Interesting, biotech stocks have been notoriously hard to predict because their business model revolves around science, and it’s hard to know when the science is right. Depending on the situation, I think sentiment could potentially be a misleading/confounding variable here…

observationist
·
56 minutes ago
·
[ - ]

Sentiment is crucial - if you know sentiment is incorrectly oriented, you can capitalize on it. If you know it's correct, you can identify mispricing, and strategize accordingly.

worik
·
1 hour ago
·
[ - ]

Why do you think that LLMs would do any better than monkeys throwing darts?

I am raining on your parade but this is another in a long succession of ways to loose money.

The publicly available information in markets is priced very efficiently, us computer types do not like that and we like to think that our pattern analysis machines can do better than a room full of traders. They cannot.

The money to be made in markets is from private information and that is a crime (insider trading), is widespread, and any system like this is fighting it and will loose.

dchu17
·
23 minutes ago
·
[ - ]

Our initial goal with this project actually wasn't trying to get an edge in terms of better evaluating information, but rather, we wanted to see if an LLM can perform similarly to a human analyst at a lower latency. The latency for the market to react to catalysts is actually surprisingly high in biotech (at least in some cases) compared to other domains so there may be some edge there.

Appreciate the comment though! I generally agree with your sentiment!

sjkoelle
·
26 minutes ago
·
[ - ]

efficiency is not a given. also this is an eval set - they acknowledge the challenge themselves.

imho this is v cool