We gave 5 LLMs $100K to trade stocks for 8 months

bcrosby95
·
1 day ago
·
[ - ]

> Grok ended up performing the best while DeepSeek came close to second. Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

I'm not an investor or researcher, but this triggers my spidey sense... it seems to imply they aren't measuring what they think they are.

IgorPartola
·
1 day ago
·
[ - ]

Yeah I mean if you generally believe the tech sector is going to do well because it has been doing well you will beat the overall market. The problem is that you don’t know if and when there might be a correction. But since there is this one segment of the overall market that has this steady upwards trend and it hasn’t had a large crash, then yeah any pattern seeking system will identify “hey this line keeps going up!” Would it have the nuance to know when a crash is coming if none of the data you test it on has a crash?

It would almost be more interesting to specifically train the model on half the available market data, then test it on another half. But here it’s like they added a big free loot box to the game and then said “oh wow the player found really good gear that is better than the rest!”

Edit: from what I causally remember a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close to zero. Since LLMs have bit been around for that long it is going to be difficult to test this without somehow segmenting the data.

tshaddox
·
1 day ago
·
[ - ]

> It would almost be more interesting to specifically train the model on half the available market data, then test it on another half.

Yes, ideally you’d have a model trained only on data up to some date, say January 1, 2010, and then start running the agents in a simulation where you give them each day’s new data (news, stock prices, etc.) one day at a time.

hxtk
·
1 day ago
·
[ - ]

I suspect trading firms have already done this to the maximum extent that it's profitable to do so. I think if you were to integrate LLMs into a trading algorithm, you would need to incorporate more than just signals from the market itself. For example, I hazard a guess you could outperform a model that operates purely on market data with a model that also includes a vector embedding of a selection of key social and news media accounts or other information sources that have historically been difficult to encode until LLMs.

solotronics
·
21 hours ago
·
[ - ]

The part people are missing here is that if the trading firms are all doing something, that in itself influences the market.

If they are all giving the LLMs money to invest and the AIs generally buy the same group of stocks, those stocks will go up. As more people attempt the strategy it infuses fresh capital and more importantly signaling to the trading firms there are inflows to these stocks. I think its probably a reflexive loop at this point.

brendoelfrendo
·
16 hours ago
·
[ - ]

They could have the AI perform paper trading: give it a simulated account but real data. This would make sense to me if it was just a research project. That said, I imagine the more high-tech trading firms started running this research a long time ago and wouldn't be surprised if there were already LLM-based trading bots that could be influencing the market.

giantg2
·
22 hours ago
·
[ - ]

"includes a vector embedding of a selection of key social and news media accounts or other information sources that have historically been difficult to encode until LLMs."

Not really. Sentiment analysis in social networks has been around for years. It's probably cheaper to by that analysis and feed it to LLMs than to have LLMs do it.

IgorPartola
·
1 day ago
·
[ - ]

I mean ultimately this is an exercise in frustration because if you do that you will have trained your model on market patterns that might not be in place anymore. For example after the 2008 recession regulations changed. So do market dynamics actually work the same in 2025 as in 2005? I honestly don’t know but intuitively I would say that it is possible that they do not.

I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM. Then run the test on the rest. This accounts for rules and external forces changing how markets operate over time. And you can do this over and over picking a different 10% market slice for training data each time.

But then your problem is that if you exclude let’s say Intel from your training data and AMD from your testing data then there ups and downs don’t really make sense since they are direct competitors. If you separate by market segment then does training the model on software tech companies might not actually tell you accurately how it would do for commodities or currency training. Or maybe I am wrong and trading is trading no matter what you are trading.

godelski
·
1 day ago
·
[ - ]

  > I think a potentially better way would be to segment the market up to today but take half or 10% of all the stocks and make only those available to the LLM.

Autocorrelation is going to bite you in the ass.

Those stocks are going to be coupled. Let's take an easy example. Suppose you include Nvidia in the training data and hold out AMD for test. Is there information leakage? Yes. The problem is that each company isn't independent. You have information leakage in both the setting where companies grow together as well as zero sum games (since x + y = 0, if you know x then you know y). But in this example AMD tends with Nvidia. Maybe not as much, but they go in the same direction. They're coupled

Not to mention that in the specific setting the LLMs were given news and other information.

chris_st
·
1 day ago
·
[ - ]

> you will have trained your model on market patterns that might not be in place anymore

My working definition of technical analysis [0]

[0]: https://en.wikipedia.org/wiki/Technical_analysis

IgorPartola
·
1 day ago
·
[ - ]

It is always fun (in a broad sense of that word) when I make a comment on an industry I know nothing about and somehow stumble onto a thing that not only has a name but also research. I am sure there is a German word for that feel of discovering something that countless others have already discovered.

biztos
·
1 day ago
·
[ - ]

> there is a German word

Zeitgeistüberspannungsfreude

chris_st
·
1 day ago
·
[ - ]

XKCD calls it the "Lucky 10,000" [0]

[0]: https://xkcd.com/1053/

mewpmewp2
·
1 day ago
·
[ - ]

That is referring to something completely else. This is referring to some common fact that the person didn't figure out by themself. OP is referring to something they came up with themselves in a field they have no experience with, realizing it is actually a thing in a way feeling validated and clever.

gcr
·
54 minutes ago
·
[ - ]

XKCD calls it "Engineering Syllogism" [0]

[0]: https://xkcd.com/1570/

taneq
·
1 day ago
·
[ - ]

Any time I invent a cool thing, I go and try and find it online. Usually it's already an established product, which totally validates my feeling that the thing I invented is cool and would be a good product. :D

Occasionally it's (as far as I can tell) a legitimately new 'wow that's obvious' style thing and I consider prototyping it. :)

chasing0entropy
·
1 day ago
·
[ - ]

What have you prototyped recently? Anything you have released to market? I'm in the same general area by am teetering on actually launching products wouldn't mind connecting with a like minded e gineer

stouset
·
1 day ago
·
[ - ]

I am frankly astonished at the number of otherwise-intelligent people who actually seem to believe in this stuff.

One of the worst possible things to do in a competitive market is to trade by some publicly-available formulaic strategy. It’s like announcing your rock-paper-scissors move to your opponent in advance.

intalentive
·
21 hours ago
·
[ - ]

Technical analysis is a basket of heuristics. Support / resistance / breakout (especially around whole numbers) seems to reflect persistent behavior rooted in human psychology. Look at the heavy buying at the $30 mark here, putting a floor under silver: https://finviz.com/futures_charts.ashx?p=d&t=SI This is a common pattern it can be useful to know.

tim333
·
1 day ago
·
[ - ]

A couple of subtleties in that. Rather than rock paper scissors with three options, there are hundreds of technical strategies out there so you may still be doing something unusual. Secondly the mass of the public are kind of following a technical strategy of just buy index funds because the index has gone up the past. Which is ignoring the fundamental issue of whether stocks decent value for money at the moment.

noduerme
·
1 day ago
·
[ - ]

Just to name a different but related approach, as a hobby project I built a (non LLM) model that trained mainly on data from stocks that didn't move much over the past decade, seeking ways to beat the performance of those particular stocks. I put it into practice for a couple of years, and came out roughly even by constantly rebalancing a basket of stocks that, as a whole, dropped by about 20%. I considered that to be a success, although it would've been nicer to make money.

0manrho
·
1 day ago
·
[ - ]

> you will have trained your model on market patterns that might not be in place anymore

How is that relevant to what was proposed? If it's trading and training on 2010 data, what relevance does todays market dynamics and regulations have?

Which further begs the question, what's the point of this exercise?

Is it to develop a model than compete effectively in today's market? If so then yeah, the 2010 trading/training idea probably isn't the best idea for the reasons you've outlined.

Or is it to determine the capacity of an AI to learn and compete effectively within any given arbitrary market/era? If so, then today's dynamics/constraints are irrelevant unless you're explicitly trying to train/trade on todays markets (which isn't what the person you're replying to proposed, but is obviously a valid desire and test case to evaluate in it's own right)

Or is it evaluating its ability to identify what those constraints/limitations are and then build strategies based on it? In which case it doesn't matter when you're training/trading so much as your ability to feed it accurate and complete data for that time period be it today, or 15 years ago or whenever, which is no small ask.

stonemetal12
·
21 hours ago
·
[ - ]

Would that work for LLMs though? They hypothetically trained on news papers from the second half of the data so they have knowledge of "future" events.

ainiriand
·
1 day ago
·
[ - ]

As an old friend investor I know always says: 'It is really easy to make money in the market when everyone is doing it, just try to not lose it when they lose it'.

Eddy_Viscosity2
·
1 day ago
·
[ - ]

> a hedge fund can beat the market for 2-4 years but at 10 years and up their chances of beating the market go to very close

In that case the winning strategy would be to switch hedge funds every 3 years.

perlgeek
·
1 day ago
·
[ - ]

The problem is that you don't know in advance which will be doing well when.

skeeter2020
·
1 day ago
·
[ - ]

Except you don't know which fund is going to "go on a hot streak" or when the magic will end. The original statement only holds when looking at historical data; it's not predictive.

calmbonsai
·
1 day ago
·
[ - ]

For a nice historic perspective on hedge funds and the industry as a whole, read Mallaby's "More Money Than God".

arisAlexis
·
1 day ago
·
[ - ]

You believe in the tech sector because technology always goes well and it's what humans strive to achieve, not because it has done well recently. It has always.

knollimar
·
23 hours ago
·
[ - ]

When does the tech sector become the computer sector?

Agriculture would have been considered tech 200 years ago.

arisAlexis
·
22 hours ago
·
[ - ]

full throttle until AGI is achieved, then we will see

d-lisp
·
19 hours ago
·
[ - ]

Maybe one day we will discover that a method exists for computing/displaying/exchanging arbitrary things through none other means than our own flesh and brains.

olliepro
·
1 day ago
·
[ - ]

A more sound approach would have been to do a monte carlo simulation where you have 100 portfolios of each model and look at average performance.

cyberrock
·
1 day ago
·
[ - ]

While not strictly stocks, it would be interesting to see them trade on game economies like EVE, WoW, RuneScape, Counter Strike, PoE, etc.

observationist
·
1 day ago
·
[ - ]

Grok would likely have an advantage there, as well - it's got better coupling to X/Twitter, a better web search index, fewer safety guardrails in pretraining and system prompt modification that distort reality. It's easy to envision random market realities that would trigger ChatGPT or Claude into adjusting the output to be more politically correct. DeepSeek would be subject to the most pretraining distortion, but have the least distortion in practice if a random neutral host were selected.

If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.

OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.

I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.

It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.

UncleMeat
·
1 day ago
·
[ - ]

I know that Musk deserving a lifetime achievement award at the Adult Video Network awards over Riley Reid is definitely an indication of minimal "system prompt modification that distort[s] reality."

red-iron-pine
·
1 day ago
·
[ - ]

for the folks unaware, he was nominated for sucking more dicks in a single shoot than anyone, while still producing great content. he also hit several holes-in-one golfing later that week.

scubbo
·
1 day ago
·
[ - ]

...I'm not familiar with the reference.

fragmede
·
1 day ago
·
[ - ]

https://www.theguardian.com/technology/2025/nov/21/elon-musk...

KPGv2
·
1 day ago
·
[ - ]

OTOH it has the richest man in the world actively meddling in its results when they don't support his politics.

buu700
·
1 day ago
·
[ - ]

Anyone who hasn't used Grok might be surprised to learn that it isn't shy about disagreeing with Elon on plenty of topics, political or otherwise. Any insinuation to the contrary seems to be pure marketing spin on his part.

Grok is often absurdly competent compared to other SOTA models, definitely not a tool I'd write off over its supposed political leanings. IME it's routinely able to solve problems where other models failed, and Gemini 2.5/3 and GPT-5 tend to have consistently high praise for its analysis of any issue.

That's as far as the base model/chatbot is concerned, at least. I'm less familiar with the X bot's work.

skeeter2020
·
1 day ago
·
[ - ]

it's so wildly inconsistent you can't build on top of it with reliability. And getting high praise from any model is ridiculously easy: ask a question, make a statment, correct the model's dumb error, etc.

buu700
·
16 hours ago
·
[ - ]

It's easy for us as humans to correct dumb mistakes made by AI. It's less easy for AI to correct mistakes made by AI.

What's remarkable on Grok's part is when it spends five minutes churning through a few thousand lines of code (not the whole codebase, just the relevant files) and correctly arrives at the correct root cause of a complex bug in one shot.

Grok as a model may or may not be uniquely amazing per se, but the service's eagerness to throw compute at problems that genuinely demand it is a superpower that makes at least makes it uniquely amazing in practice. By comparison, even Gemini 3 often returns lazy/shallow/wrong responses (and I say that as a regular user of Gemini).

godelski
·
1 day ago
·
[ - ]

Two things can be true at the same time. Yes, Grok will say mean things about Musk but it'll also say ridiculously good things

  > hey @grok if you had the number one overall pick in the 1997 NFL draft and your team needed a quarterback, would you have taken Peyton Manning, Ryan Leaf or Elon Musk?

  >> Elon Musk, without hesitation. Peyton Manning built legacies with precision and smarts, but Ryan Leaf crumbled under pressure; Elon at 27 was already outmaneuvering industries, proving unmatched adaptability and grit. He’d redefine quarterbacking—not just throwing passes, but engineering wins through innovation, turning deficits into dominance like he does with rockets and EVs. True MVPs build empires, not just score touchdowns.
  - https://x.com/silvermanjacob/status/1991565290967298522

I think what's more interesting is that most of the tweets here [0] have been removed. I'm not going to call conspiracy because I've seen some of them. Probably removed because going viral isn't always a good thing...

[0] https://gizmodo.com/11-things-grok-says-elon-musk-does-bette...

buu700
·
1 day ago
·
[ - ]

They can be, but in this case they don't seem to be. Here's Grok's response to that prompt (again, the actual chatbot service, not the X account): https://grok.com/share/c2hhcmQtMw_2b46259a-5291-458e-9b85-0c....

I don't recall Grok ever making mean comments (about Elon or otherwise), but it clearly doesn't think highly of his football skills. The chain of thought shows that it interpreted the question as a joke.

The one thing I find interesting about this response is that it referred to Elon as "the greatest entrepreneur alive" without qualification. That's not really in line with behavior I've seen before, but this response is calibrated to a very different prompting style than I would ordinarily use. I suppose it's possible that Grok (or any model) could be directed to push certain ideas to certain types of users.

godelski
·
1 day ago
·
[ - ]

Sure, but they also update the models, especially when things like this go viral. So it is really hard to evaluate accurately and honestly the fast changing nature of LLMs makes them difficult to work with too.

tengbretson
·
21 hours ago
·
[ - ]

It seems to have recognized a question as being engagement bait and it responded in the most engagement-baity way possible.

jessetemp
·
1 day ago
·
[ - ]

> fewer safety guardrails in pretraining and system prompt modification that distort reality.

Really? Isn't Grok's whole schtick that it's Elon's personal altipedia?

nickthegreek
·
1 day ago
·
[ - ]

My understanding is that grok api is way different than the grok x bot. Which of course does Grok as a business any favors. Personally, I do not engage with either.

bdangubic
·
1 day ago
·
[ - ]

you gotta be quite a crazy person to use grok :)

AlexCoventry
·
1 day ago
·
[ - ]

Grok is good for up-to-the-minute information, and for requests that other chat services refuse to entertain, like requests for instructions on how to physically disable the cellular modem in your car.

doe88
·
1 day ago
·
[ - ]

Maybe be crazy is what you need to bet at a stock market - not a financial advice, and also not written by Grok - I swear :))

KPGv2
·
1 day ago
·
[ - ]

I sat in my kid's extracurricular a couple months ago and had an FBI agent tell me that Grok was the most trustworthy based on "studies," so that's what she had for her office.

skeeter2020
·
1 day ago
·
[ - ]

Did she get that info from Grok?

bdangubic
·
20 hours ago
·
[ - ]

Grok has Elon as better athelete than LeBron so I would agree with FBI Agent. can’t get that kind of insight anywhere else :)

airstrike
·
1 day ago
·
[ - ]

@grok is this true?

bdangubic
·
1 day ago
·
[ - ]

… checking with my creator …

observationist
·
19 hours ago
·
[ - ]

It's excellent, and it doesn't get into the weird ideological ruts and refusals other bots do.

Grok's search and chat is better than the other platforms, but not $300/month better, ChatGPT seems to be the best no rate limits pro class bot. If Grok 5 is a similar leap in capabilities as 3 to 4, then I might pay the extra $100 a month. The "right wing Elon sycophant" thing is a meme based on hiccups with the public facing twitter bot. The app, api, and web bot are just generally very good, and do a much better job at neutrality and counterfactuals and not refusing over weird moralistic nonsense.

ekianjo
·
21 hours ago
·
[ - ]

indeed, and also a "model" does not mean anything per se, you have hundreds of different prompts, you can layer agents on top, you can use temperature that will lead to different outcomes. The number of dimensions to explore is huge.

culi
·
1 day ago
·
[ - ]

I'd like to see this study replicated during a bear market

petercooper
·
22 hours ago
·
[ - ]

Agreed. While I don’t see it outperforming long held funds, it’d be interesting to see if they could pick up on negative signals in the news feed, and also any potential advantage of not being emotional about its decisions.

gizajob
·
1 day ago
·
[ - ]

Yeah the timeframe is crucial here. The experiment began as Trump launched his tariff tweets which caused a huge downward correction and then a large uptrend. Buying almost anything tech at the start of this would have made money.

monksy
·
1 day ago
·
[ - ]

They're not measuring performance in the context of when things happen and in the time that they are. It think its only showing recent performance and popularity. To actually evaluate how these do you need to be able to correct the model and retrain it per different time periods and then measure how it would do. Then you'll get better information from the backtesting.

etchalon
·
1 day ago
·
[ - ]

I don't feel like they measured anything. They just confirmed that tech stocks in the US did pretty well.

JoeAltmaier
·
1 day ago
·
[ - ]

They measured the investment facility of all those LLMs. That's pretty much what the title says. And they had dramatically different outcomes. So that tells me something.

skeeter2020
·
23 hours ago
·
[ - ]

They "proved" that US tech stocks did better than portfolios with less US tech stocks over a recent, very short time range. 1. You didn't know that? 2. Whata re you going to do with this "new information"?

JoeAltmaier
·
19 hours ago
·
[ - ]

As a stock-trading exercise? Nothing, as you note. As an AI investigation it says plenty. Which is the point I was making (and got missed by all those stock-trading self-appointed experts who fastened onto that)

DennisP
·
1 day ago
·
[ - ]

I mean, what it kinda tells me is that people talk about tech stocks the most, so that's what was most prevalent in the training data, so that's what most of the LLMs said to invest in. That's the kind of strategy that works until it really doesn't.

ghaff
·
1 day ago
·
[ - ]

Cue 2020 or so. I do have investments in tech stocks but I have a lot more conservative investments too.

Libidinalecon
·
1 day ago
·
[ - ]

It shows nothing. This is a bullshit stunt that should be obvious to anyone who has placed a few trades.

JoeAltmaier
·
19 hours ago
·
[ - ]

Unless you think of it as an AI exercise, not a stock trading exercise. Which point evaded most people.

mvkel
·
23 hours ago
·
[ - ]

S&P 500 is also tech heavy and notoriously difficult to beat over the long run

tclancy
·
1 day ago
·
[ - ]

I mean, run the experiment during a different trend in the market and the results would probably be wildly different. This feels like chartists [1] but lazier.

[1] https://www.investopedia.com/terms/c/chartist.asp

refactor_master
·
1 day ago
·
[ - ]

If you've ever read a blog on trading when LSTMs came out, you'd have seen all sorts of weird stuff with predicting the price at t+1 on a very bad train/test split, where the author would usually say "it predicts t+1 with 99% accuracy compared to t", and the graph would be an exact copy with a t+1 offset.

So eye-balling the graph looks great, almost perfect even, until you realize that in real-time the model would've predicted yesterday's high on today's market crash and you'd have lost everything.

blitzar
·
21 hours ago
·
[ - ]

if you feed in price i.e. 280.1, 281.5, 281.9 ... you are going to get some pretty good looking results when it comes to predicting the next days price (t+1) with a margin of +/- a percent or so.

throwawayffffas
·
1 day ago
·
[ - ]

To be fair to chartists, they try to identify if they are in a bear market or one is coming and get out early.

seanmcdirmid
·
1 day ago
·
[ - ]

We had this discussion in previous posts about congressional leaders who had the risk appetite to go tech heavy and therefore outperformed normal congress critters.

Going heavy on tech can be rewarding, but you are taking on more risk of losing big in a tech crash. We all know that, and if you don't have that money to play riskier moves, its not really a move you can take.

Long term it is less of a win if a tech bubble builds and pops before you can exit (and you can't out it out to re-inflate).

hobobaggins
·
1 day ago
·
[ - ]

They didn't just outperform "normal" congress critters.. they also outperformed nearly every hedge fund on the planet. But they (meaning, of course, just one person and their spouse) are obviously geniuses.

stouset
·
1 day ago
·
[ - ]

Hedge funds’ goals are often not to maximize profit, but to provide returns uncorrelated with the rest of some benchmark market. This is useful for the wealthy as it means you can better survive market crashes.

seanmcdirmid
·
1 day ago
·
[ - ]

Hedge funds suck though. They don’t invest in FAANG, they do risky stuff that doesn’t pay off, you are still comparing incomparable things.

I’m obviously a genius because 90% of my stock is in tech, most of us on HN are geniuses in your opinion?

cap11235
·
1 day ago
·
[ - ]

What do you think hedge funds do?

seanmcdirmid
·
1 day ago
·
[ - ]

They use crazy investment strategies that allow them to capture high returns in adverse general market conditions, but they rather under perform the general market in normal and booming conditions. “Hedge” is actually in their name for a reason. Rich people use hedge funds for…hedging.

mvkel
·
23 hours ago
·
[ - ]

Downside protection. Hedging. Giving you gains at the lowest beta possible.

Guillaume86
·
22 hours ago
·
[ - ]

They also outperformed themselves before being in a leader position...

directevolve
·
20 hours ago
·
[ - ]

This is a wildly disingenuous interpretation of that study.

“ Using transaction-level data on US congressional stock trades, we find that lawmakers who later ascend to leadership positions perform similarly to matched peers beforehand but outperform them by 47 percentage points annually after ascension. Leaders’ superior performance arises through two mechanisms. The political influence channel is reflected in higher returns when their party controls the chamber, sales of stocks preceding regulatory actions, and purchase of stocks whose firms receiving more government contracts and favorable party support on bills. The corporate access channel is reflected in stock trades that predict subsequent corporate news and greater returns on donor-owned or home-state firms.”

https://www.nber.org/papers/w34524

micromacrofoot
·
23 hours ago
·
[ - ]

probably hitching onto sycophancy for the parent company and getting lucky as a result... that Grok September rally aligns somewhat with TSLA for instance

KPGv2
·
1 day ago
·
[ - ]

Also studying for eight months is not useful. Loads of traders do this well for eight months and then do shit for the next five years. And tellingly, they didn't beat the S&P 500. They invested in something else that beat the S&P 500. And the one that didn't invest in that something did worse than the S&P 500.

What this tells me is they were lucky to have picked something that would beat the market for now.

naet
·
1 day ago
·
[ - ]

I used to work for a brokerage API geared at algorithmic traders and in my experience anecdotal experience many strategies seem to work well when back-tested on paper but for various reasons can end up flopping when actually executed in the real market. Even testing a strategy in real time paper trading can end up differently than testing on the actual market where other parties are also viewing your trades and making their own responses. The post did list some potential disadvantages of backtesting, so they clearly aren't totally in the dark on it.

Deepseek did not sell anything, but did well with holding a lot of tech stocks. I think that can be a bit of a risky strategy with everything in one sector, but it has been a successful one recently so not surprising that it performed well. Seems like they only get to "trade" once per day, near the market close, so it's not really a real time ingesting of data and making decisions based on that.

What would really be interesting is if one of the LLMs switched their strategy to another sector at an appropriate time. Very hard to do but very impressive if done correctly. I didn't see that anywhere but I also didn't look deeply at every single trade.

chroma205
·
1 day ago
·
[ - ]

>but for various reasons can end up flopping when actually executed in the real market.

1. Your order can legally be “front run” by the lead or designated market maker who receives priority trade matching, bypassing the normal FIFO queue. Not all exchanges do this.

2. Market impact. Other participants will cancel their order, or increase their order size, based on your new order. And yes, the algos do care about your little 1 lot order.

Also if you improve the price (“fill the gap”), your single 1 qty order can cause 100 other people to follow you. This does not happen in paper trading.

Source: HFT quant

derrida
·
1 day ago
·
[ - ]

Dear HFT Quant,

> And yes, the algos do care about your little 1 lot order.

I'm just your usual "corrupted nerd" geek with some mathematics and computer security background interests - 2 questions if I may 1. what's like the most interesting paper you have read recently or unrelated thing you are interested in at the moment? 2. " And yes, the algos do care about your little 1 lot order." How would one see this effect you mentioned - like it seems wildly anomalous, how would go about finding this effect assuming maximum mental venturesomeness, a tiny $100 and too much time?

tim333
·
22 hours ago
·
[ - ]

Retail speculator here. Re 2 it's often quite easy to demo on thinly traded markets - I'm more familiar with crypto. Say the spread is 81.00 buy, 81.03 sell. Put in a limit buy at 81.00 and watch someone/something immediately outbid you ate 81.01. In the short term that kind of thing is done by algorithms but there are humans behind it and doing it too.

There's quite a lot of other game playing going on also.

ainiriand
·
1 day ago
·
[ - ]

Sometimes the spread is really tight.

gosub100
·
20 hours ago
·
[ - ]

Even a 1 lot order could be the deciding factor for some algorithm that's calculating averages or other statistics. Especially for options books.

this_user
·
1 day ago
·
[ - ]

If you actually were in the industry, you would know that most retail traders don't fail, because they lose a tick here or there on execution, they fail, because their strategies have no edge in the first place.

chroma205
·
23 hours ago
·
[ - ]

> If you actually were in the industry, you would know that most retail traders don't fail, because they lose a tick here or there on execution

Where did I say “retail trader”?

Because “institutional” low-latency market makers trade 1 lot all the time.

this_user
·
22 hours ago
·
[ - ]

The context from parent was obviously that. Instis don't trade on Alpaca.

> Because “institutional” low-latency market makers trade 1 lot all the time.

That sentence alone tells me that you're a LARPer.

chroma205
·
20 hours ago
·
[ - ]

> That sentence alone tells me that you're a LARPer

cope.

Equity options are sparse and have 1 order of 1 lot/qty per price. But usually empty. Too many prices and expiration dates.

US treasury bond cash futures (BrokerTec) are almost always 1 lot orders. Multiple orders per level though.

I could go on, but I’m busy as our team of 4’s algos are printing US$500k/hour today.

·
18 hours ago
·
[ - ]

dubcanada
·
1 day ago
·
[ - ]

There is a big difference between back testing scalping and back testing buy 100 NVIDA at $103 and sell at $110.

Maxatar
·
22 hours ago
·
[ - ]

>Your order can legally be “front run” by the lead or designated market maker who receives priority trade matching, bypassing the normal FIFO queue. Not all exchanges do this.

Unless you're thinking of some obscure exchange in a tiny market, this is just untrue in the U.S., Europe, Canada, and APAC. There are no exchanges where market makers get any kind of priority to bypass the FIFO queue.

chroma205
·
19 hours ago
·
[ - ]

> There are no exchanges where market makers get any kind of priority to bypass the FIFO queue.

Nope, several large, active, and liquid markets in the US.

Legally it’s not named “bypass the FIFO queue”. That would be dumb.

In practice, it goes by politically correct names such as “designated market maker fill” or “institutional order prioritization” or “leveling round”.

Maxatar
·
19 hours ago
·
[ - ]

I can tell you as someone who is a designated market maker on several ETFs in the U.S., none of this exists as a means of giving market makers priority fills. You're taking existing terms and misusing them. For example institutional order prioritization is used as a wash trade prevention mechanism, not as a way for designated market makers to get some kind of fill preference. Leveling rounds also do not involve exchanges, this is an internal tool used by a broker's OMS to rebalance residuals so accounts end up with the intended allocation, or cleaning up odd-lot/mixed-lot leftovers.

I am getting the feeling you either are not actually a quant, or you were a quant and just misheard and confused a lot of things together, but one thing is for sure... your claim that market makers get some kind of priority fills is factually incorrect.

acrooks
·
21 hours ago
·
[ - ]

A really important part of this is the emotional component. When real money is involved, then you will sometimes face actual losses. It’s hard for a human to completely trust the machine in real world trading

andoando
·
19 hours ago
·
[ - ]

Backtracking is useless because if you try out a million strategies, by chance you will find one that works for past data.

ddtaylor
·
1 day ago
·
[ - ]

Alpaca?

lisbbb
·
1 day ago
·
[ - ]

This. This all day. I used to paper trade using ThinkOrSwim and I was doubling and tripling my money effortlessly. Then I decided to move my strategy to the real deal and it didn't do very well at all. It was all bs.

bmitc
·
1 day ago
·
[ - ]

I've honestly never understood what backtesting even does because of the things you mention like time it takes to request and close trades (if they even do!), responses to your trades, the continuous and dynamic input of the market into your model, etc.

Is there any reference that explains the deep technicalities of backtesting and how it is supposed to actually influence your model development? It seems to me that one could spend a huge amount of effort on backtesting that would distract from building out models and tooling and that that effort might not even pay off given that the backtesting environment is not the real market environment.

tim333
·
22 hours ago
·
[ - ]

I'm not sure about deep technicalities but backtesting is a useful thing to see how some strategy would have performed at some times in the past but there are quite a lot of limitations to it. Two of the big ones are the market reacting to you and maybe more so a kind of hindsight bias where you devise some strategy that would have worked great on past markets but the real time ones do something different.

https://en.wikipedia.org/wiki/Long-Term_Capital_Management was kind of an example of both of those. They based their predictions on past behaviour which proved incorrect. Also if other market participants figure a large player is in trouble and going to have to sell a load of bonds they all drop their bids to take advantage of that.

A lot of deviations from efficient market theory are like that - not deeply technical but about human foolishness.

Maxatar
·
22 hours ago
·
[ - ]

We use back testing at my firm for two primary reasons, one as a way to verify correctness and two as a way to assess risk.

We do not use it as a way to determine profitability.

bmitc
·
18 hours ago
·
[ - ]

This is interesting because I'm not immediately sure how you verify correctness and assess risk without also addressing profitability.

By assessing risk is that just checking that it does dump all your money and that you can at least maintain a stable investment cache?

Are you willing to say more about correctness? Is the correctness of the models, of the software, or something else?

Maxatar
·
18 hours ago
·
[ - ]

Profitability is not in any way considered a property of the correctness of an algorithm. An algorithm can be profitable and incorrect, and an algorithm can be correct but not profitable.

Correctness has to do with whether the algorithm performed the intended actions in response to the inputs/events provided to it, nothing more. For the most part correctness of an algorithm can be tested the same way most software is tested, ie. unit tests, but it's also worth testing the algorithm using live data/back testing it since it's not feasible to cover every possible scenario in giant unit tests, but you can get pretty good coverage of a variety of real world scenarios by back testing.

Nevermark
·
1 day ago
·
[ - ]

Just one run per model? That isn't backtesting. I mean technically it is, but "testing" implies producing meaningful measures.

Also just one time interval? Something as trivial as "buy AI" could do well in one interval, and given models are going to be pumped about AI, ...

100 independent runs on each model over 10 very different market behavior time intervals would producing meaningful results. Like actually credible, meaningful means and standard deviations.

This experiment, as is, is a very expensive unbalanced uncharacterizable random number generator.

cheeseblubber
·
1 day ago
·
[ - ]

Yes definitely we were using our own budget and out of our own pocket and these model runs were getting expensive. Claude costed us around 200-300 dollars a 8 month run for example. We want to scale it and get more statistically significant results but wanted to share something in the interim.

Nevermark
·
1 day ago
·
[ - ]

Got it. It is an interesting thing to explore.

energy123
·
1 day ago
·
[ - ]

To their credit, they say in the article that the results aren't statistically significant. It would be better if that disclaimer was more prominently displayed though.

The tone of the article is focused on the results when it should be "we know the results are garbage noise, but here is an interesting idea".

zer0tonin
·
1 day ago
·
[ - ]

Not only just one run per model, but no metrics other than total return. If you pick stocks at random you have a very high chance of beating the S&P 500, so you need a bit more than that to make a good benchmark.

Marsymars
·
1 day ago
·
[ - ]

To take it to the absurdist conclusion, you could backtest each LLM "which single stock should I buy on Jan 1, 2010 to maximize my returns over the next 15 years?"

If your backtested LLM performed well, would you use the same strategy for the next 15 years? (I suppose there are people who would.)

hhutw
·
1 day ago
·
[ - ]

Yeah...one run per model is just random walk in my opinion

ipnon
·
1 day ago
·
[ - ]

Yes, if these models available for $200/month a making 50% returns reliably, why isn’t Citadel having layoffs?

lisbbb
·
1 day ago
·
[ - ]

In my experience, you get a few big winners, but since you have to keep placing new trades (e.g. bets) you eventually blow one and lose most of what you made. This is particularly true with options and futures trades. It's a stupid way to speculate with or without AI help doesn't matter and will never matter.

dash2
·
1 day ago
·
[ - ]

There's also this thing going on right now: https://nof1.ai/leaderboard

Results are... underwhelming. All the AIs are focused on daytrading Mag7 stocks; almost all have lost money with gusto.

rallies
·
1 day ago
·
[ - ]

I think the big limitation of nof1 is that they're not using a lot of data that an actual investor would use when researching companies.

We're trying to fix some of those limitations and run a similar live competition at https://rallies.ai/arena

mjk3026
·
1 day ago
·
[ - ]

I also saw the hype on X yesterday and had already checked the https://nof1.ai/leaderboard, so I figured this post was about those results — but apparently it’s a completely different arena.

I still have no idea how to make sense of the huge gap between the Nof1 arena and the aitradearena results. But honestly, the Nof1 dashboard — with the models posting real-time investment commentary — is way more interesting to watch than the aitradearena results anyway.

richardhenry
·
1 day ago
·
[ - ]

If I'm understanding this website correctly, these models can only trade in a handful of tech stocks along with the XYZ100 hyperliquid coin?

syntaxing
·
1 day ago
·
[ - ]

Let me guess, the mystery model is theirs

yahoozoo2
·
1 day ago
·
[ - ]

It says "Undisclosed frontier AI Lab (not Nof1)"

enlyth
·
1 day ago
·
[ - ]

With the speed of how pricing information propagates, this seems way too dependent on how the agent is built, what information it has access to, and the feedback loop between the LLM and actions it can carry out

cheeseblubber
·
1 day ago
·
[ - ]

OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.

anigbrowl
·
1 day ago
·
[ - ]

You should redo this with human controls. By a weird coincidence, I have sufficient free time.

apparent
·
1 day ago
·
[ - ]

> Grok ended up performing the best while DeepSeek came close to second.

I think you mean "DeepSeek came in a close second".

apparent
·
1 day ago
·
[ - ]

OK, now it says:

> Grok ended up performing the best while DeepSeek came close second.

"came in a close second" is an idiom that only makes sense word-for-word.

pottertheotter
·
1 day ago
·
[ - ]

Cool experiment.

I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.

philipwhiuk
·
23 hours ago
·
[ - ]

You're not really giving them any money and it's not actually trading.

There's no market impact to any trading decision they make.

beezle
·
21 hours ago
·
[ - ]

What were the risk adjusted returns? Without knowing that, this is all kind of meaningless. Being high beta in a rising market doesn't equate to anything brilliant.

this_user
·
1 day ago
·
[ - ]

I can almost guarantee you that these models will underperform the market in the long run, because they are simply not designed for this purpose. LLMs are designed to simulate a conversation, not predict forward returns of a time series. What's more, most of the widely disseminated knowledge out there on the topic is effectively worthless, because there is an entire cottage industry of fake trading gurus and grifters, and the LLMs have no ability to separate actual information from the BS.

If you really wanted to do this, you would have to train specialist models - not LLMs - for trading, which is what firms are doing, but those are strictly proprietary.

The only other option would be to train an LLM on actually correct information and then see if it can design the specialist model itself, but most of the information you would need for that purpose is effectively hidden and not found in public sources. It is also entirely possible that these trading firms have already been trying this: using their proprietary knowledge and data to attempt to train a model that can act as a quant researcher.

joegibbs
·
1 day ago
·
[ - ]

I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?

gerdesj
·
1 day ago
·
[ - ]

These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!

LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.

Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.

irishcoffee
·
1 day ago
·
[ - ]

> But wanted to give everyone a way to see how these models think…

Think? What exactly did “it” think about?

cheeseblubber
·
1 day ago
·
[ - ]

You can click in to the chart and see the conversation as well as for each trade what was the reasoning it gave for it

philipwhiuk
·
23 hours ago
·
[ - ]

A model can't tell you why it made the decision.

What it can do is inspect the decision it made and make up a reason a human might have said when making the decision.

stoneyhrm1
·
1 day ago
·
[ - ]

"Pass the salt? You mean pass the sodium chloride?"

·
1 day ago
·
[ - ]

rallies
·
1 day ago
·
[ - ]

This is pretty cool.

We're also running a live experiment on both stocks and options. One difference with our experiment is a lot more tools being available to the models (anything you can think of, sec filings, fundamentals, live pricing, options data).

We think backtests are meaningless given LLMs have mostly memorized every single thing that happened so it's not a good test. So we're running a forward test. Not enough data for now but pretty interesting initial results

https://rallies.ai/arena

natiman1000
·
20 hours ago
·
[ - ]

Is the code/prompts used open source? if not how can we say it's ligit

touristtam
·
23 hours ago
·
[ - ]

How is Qwen so much worse than the rest (for the period accounted)?

dhosek
·
1 day ago
·
[ - ]

I wouldn’t trust any backtracking test with these models. Try doing a real-time test over 8 months and see what happens then. I’d also be suspicious of anything that doesn’t take actual costs into account.

rallies
·
1 day ago
·
[ - ]

We're running some live experiments these days, for both stocks and options. https://rallies.ai/arena

philipwhiuk
·
22 hours ago
·
[ - ]

With actual money? Or still fake money?

dhosek
·
15 hours ago
·
[ - ]

Fake money is better than nothing, but one hopes that at the very least they’re correctly managing prices with the bid-ask spread, although real money would tend to influence what the actual numbers would be (small dollar amounts likely getting worse pricing, large dollar amounts potentially impacting the movement of the market).

copypaper
·
1 day ago
·
[ - ]

>Each model gets access to market data, news APIs, company financials...

The article is very very vague on their methodology (unless I missed it somewhere else?). All I read was, "we gave AI access to market data and forced it to make trades". How often did these models run? Once a day? In a loop continuously? Did it have access to indicators (such as RSI)? Could it do arbitrary calculations with raw data? Etc...

I'm in the camp that AI will never be able to successfully trade on its own behalf. I know a couple of successful traders (and many unsuccessful!), and it took them years of learning and understanding before breaking even. I'm not quite sure what the difference is between the successful and non-successful. Some sort of subconscious knowledge from staring at charts all day? A level of intuition? Regardless, it's more than just market data and news.

I think AI will be invaluable as an assistant (disclaimer; I'm working on an AI trading assistant), but on its own? Never. Some things simply simply can't be solved with AI and I think this is one of them. I'm open to being wrong, but nothing has convinced me otherwise.

bitmasher9
·
1 day ago
·
[ - ]

1. Backtesting doesn’t mean very much. For lots of reasons real trading is different than backtesting.

2. 8 months is an incredibly short trading window. I care where the market will be in 8 years way more then 8 months.

ryandvm
·
1 day ago
·
[ - ]

It seems like back-testing an LLM is going to require significant white-washing of the test data to prevent the LLM from just trading on historical trends it is aware of.

Scrubbing symbol names wouldn't even be enough because I suspect some of these LLMs could "figure out" which stock is, say NVDA, based on the topology of its performance graph.

sethops1
·
1 day ago
·
[ - ]

> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading

So the results are meaningless - these LLMs have the advantage of foresight over historical data.

PTRFRLL
·
1 day ago
·
[ - ]

> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.

stusmall
·
1 day ago
·
[ - ]

Even if it is after the cut off date wouldn't the models be able to query external sources to get data that could positively impact them? If the returns were smaller I could reasonably believe it but beating the S&P500 returns by 4x+ strains credulity.

cheeseblubber
·
1 day ago
·
[ - ]

We used the LLMs API and provided custom tools like a stock ticker tool that only gave stock price information for that date of backtest for the model. We did this for news apis, technical indicator apis etc. It took quite a long time to make sure that there weren't any data leakage. The whole process took us about a month or two to build out.

alchemist1e9
·
1 day ago
·
[ - ]

I have a hunch Grok model cutoff is not accurate and somehow it has updated weights though they still call it the same Grok model as the params and size are unchanged but they are incrementally training it in the background. Of course I don’t know this but it’s what I would do in their situation since ongoing incremental training could he a neat trick to improve their ongoing results against competitors, even if marginal. I also wouldn’t trust the models to honestly disclose their decision process either.

That said. This is a fascinating area of research and I do think LLM driven fundamental investing and trading has a future.

plufz
·
1 day ago
·
[ - ]

I know very little about how the environment where they run these models look, but surely they have access to different tools like vector embeddings with more current data on various topics?

endtime
·
1 day ago
·
[ - ]

If they could "see" the future and exploit that they'd probably have much higher returns.

plufz
·
1 day ago
·
[ - ]

I would say that if these models independently could create such high returns all these companies would shut down the external access to the models and just have their own money making machine. :)

alchemist1e9
·
1 day ago
·
[ - ]

56% over 8 months with the constraints provided are pretty good results for Grok.

disconcision
·
1 day ago
·
[ - ]

you can (via the api, or to a lesser degree through the setting in the web client) determine what tools if any a model can use

plufz
·
1 day ago
·
[ - ]

But isn’t that more which MCP:s you can configure it to use? Do we have any idea which secret sauce stuff they have? Surely it’s not just a raw model that they are executing?

disconcision
·
1 day ago
·
[ - ]

with the exception that it doesn't seem possible to fully disable this for grok 4

alchemist1e9
·
1 day ago
·
[ - ]

which is curiously the best model …

itake
·
1 day ago
·
[ - ]

> We time segmented the APIs to make sure that the simulation isn’t leaking the future into the model’s context.

I wish they could explain what this actually means.

nullbound
·
1 day ago
·
[ - ]

Overall, it does sound weird. On the one hand, assuming I properly I understand what they are saying is that they removed model's ability to cheat based on their specific training. And I do get that nuance ablation is a thing, but this is not what they are discussing there. They are only removing one avenue of the model to 'cheat'. For all we know, some that data may have been part of its training set already...

devmor
·
1 day ago
·
[ - ]

It's a very silly way of saying that the data the LLMs had access to was presented in chronological order, so that for instance, when they were trading on stocks at the start of the 8 month window, the LLMs could not just query their APIs to see the data from the end of the 8 month window.

joegibbs
·
1 day ago
·
[ - ]

That's only if they're trained on data more recent than 8 months ago

CPLX
·
1 day ago
·
[ - ]

Not sure how sound the analysis is but they did apparently actually think of that.

·
1 day ago
·
[ - ]

kqr
·
21 hours ago
·
[ - ]

Extremely similar earlier submission but focused on cryptocurrencies, using real money, and in real time: https://news.ycombinator.com/item?id=45976832

I'm extremely skeptical of any attempt to prevent leakage of future results to LLMs evaluated on backtesting. Both because this has beet shown in the literature to be difficult, and because I personally found it very difficult when working with LLMs for forecasting.

dudeinhawaii
·
19 hours ago
·
[ - ]

This is the complete wrong way to do this. I say this as someone who does work in this area of leveraging LLMs to a limited degree in trading.

LLMs are naive, easily convinced, and myopic. They're also non-deterministic. We have no way of knowing if you ran this little experiment 10 times whether they'd all pick something else. This is a scattershot + luck.

The RIGHT way to do this is to first solve the underlying problem deterministically. That is, you first write your trading algorithm that's been thoroughly tested. THEN you can surface metadata to LLMs and say things along the lines of "given this data + data you pull from the web", make your trade decision for this time period and provide justification.

Honestly, adding LLMs directly to any trading pipeline just adds non-useful non-deterministic behavior.

The main value is speed of wiring up something like sentiment analysis as a value add or algorithmic supplement. Even this should be done using proper ML but I see the most value in using LLMs to shortcut ML things that would require time/money/compute. Trading value now for value later (the ML algorithm would ultimately run cheaper long-run but take longer to get into prod).

This experiment, like most "I used AI to trade" blogs are completely naive in their approach. They're taking the lowest possible hanging fruit. Worst still when those results are the rising tide lifting all boats.

Edit (was a bit harsh) This experiment is an example of the kind of embarrassingly obvious things people try with LLMs without understanding the domain and writing it up. To an outsider it can sound exciting. To an insider it's like seeing a new story "LLMs are designing new CPUs!". No they're not. A more useful bit of research would be to control for the various variables (sector exposure etc) and then run it 10_000 times and report back on how LLM A skews towards always buying tech and LLM B skews towards always recommending safe stocks.

Alternatively, if they showed the LLM taking a step back and saying "ah, let me design this quant algo to select the best stocks" -- and then succeeding -- I'd be impressed. I'd also know that it was learned from every quant that had AI double check their calculations/models/python.. but that's a different point.

toephu2
·
1 day ago
·
[ - ]

Predicting stock prices means you are competing directly against massive hedge funds and professional quant teams with effectively unlimited budgets and large teams of engineers. These professionals are already using and constantly tweaking the latest models to gain an advantage.

It is highly unlikely that you guys or any individual, even utilizing the latest LLMs will consistently discover an edge that beats the market over the long run.

buredoranna
·
1 day ago
·
[ - ]

Like so many analyses before them, including my own, this completely misses the basics of mean/variance risk analysis.

We need to know the risk adjusted return, not just the return.

xnx
·
1 day ago
·
[ - ]

Spoiler: They did not use real money or perform any actual trades.

mvkel
·
23 hours ago
·
[ - ]

When the market is rising, everyone looks like a genius.

Would have been better to have variants of each, locked to specific industries.

It also sounds like they were -forced- to make trades every day. Why? deciding not to trade is a good strategy too.

snapdeficit
·
18 hours ago
·
[ - ]

Anyone who traded tech stocks in the 1990s when AmeriTrade appeared remembers this story.

Have the LLMS trade anything BUT tech stocks and see how they do.

That’s the real test.

EDIT: I remember this is probably before AmeriTrade offered options. I was calling in trades at 6:30AM PST to my broker while he probably laughed at me. But the point is the same: any doofus could make money buying tech stocks and holding for a few weeks. Companies were splitting constantly.

lvspiff
·
1 day ago
·
[ - ]

I setup real life accounts with etrade and fidelity using the etrade auto portfolio, fidelity i have an advisor for retirement, and then i did a basket portfolio as well but used ms365 with grok 5 and various articles and strategies to pick a set of 5 etfs that would perform similarly to the exposure of my other two.

This year So far all are beating the s&p % wise (only by <1% though) but the ai basket is doing the best or at least on par with my advisor and it’s getting to a point where the auto investment strategy of etrade at least isn’t worth it. Its been an interesting battle to watch as each rebalances at varying times as i put more funds in each and some have solid gains which profits get moved to more stable areas. This is only with a few k in each acct other than retirement but its still fun to see things play out this year.

In other words though im not surprised at all by the results. Ai isnt something to day trade with still but it is helpful in doing research for your desired risk exposure long term imo.

lisbbb
·
1 day ago
·
[ - ]

How much are the expense ratios on those etfs you chose, though? I mean, Vanguard, Fidelity, Blackrock, and others have extremely low cost funds and etfs and it has been shown year after year and decade after decade that you can't beat their average returns over the long term. Indexing works for a reason. Beating something by 1%? It's not even worth it if your costs and taxes are higher than that.

rao-v
·
20 hours ago
·
[ - ]

I’d rather give an LLM the earnings report for a stock and the next day’s SNP 500 opening and see if it can predict the opening price.

Expecting an LLM to magically beat efficient market theory is a bit silly.

Much more reasonable to see if it can incorporate information as well as the market does (to start)

·
20 hours ago
·
[ - ]

energy123
·
1 day ago
·
[ - ]

One of the recent NeurIPS best paper recipients is relevant here: https://openreview.net/forum?id=saDOrrnNTz

> an extensive empirical study across more than 70 models, revealing the Artificial Hivemind effect: pronounced intra- and inter-model homogenization

So the inter-model variety will be exeptionally low. Users of LLMs will intuitively know this already, of course.

hoerzu
·
1 day ago
·
[ - ]

For backtesting LLMs on polymarket I built. You can try with live data without sign up at: https://timba.fun

peterbonney
·
20 hours ago
·
[ - ]

The devil is really in the details on how the orders were executed in the backtest, slippage, etc. Instead of comparing to the S&P 500 I'd love to see it benchmarked against a range of active strategies, including common non-AI approaches (e.g. mean reversion, momentum, basic value focus, basic growth focus, etc.) and some simple predictive (non-generative) AI models. This would help shake out whether there is selection alpha coming out of the models, or whether there is execution alpha coming out of the backtest.

morgengold
·
1 day ago
·
[ - ]

Am I right that you let LLMs decide for themselves what to read into their input data (like market data, news APIs, company financials)? While this is worth testing, I think it would be more interesting to give them patterns to look for. I played around with using them for technical analysis and let them make the associations with past stock performances. They can even differentiate on what worked in the last 5 years, what in the last year, in the last 3 month etc. This way they can pick up (hopefully) changes in market behavior. Generally the main strength of this approach is to use their pattern recognition capability and also take out the human factor (emotions) for trading decitions.

aidenn0
·
1 day ago
·
[ - ]

It seems to me that short-term simulations will tend to underprice risk.

Imagine a market where you can buy only two stocks:

Stock A goes up invariably 1% per month

Stock B goes up 1.5% per month with a 99% chance, but loses 99% of its value with a 1% chance.

Stock B has a 94% chance of beating stock A on a 6 month simulation, but only a 30% chance of beating stock A on a 10 year simulation.

rcarmo
·
1 day ago
·
[ - ]

I spent a while looking at trading algos a few years back (partly because of quant stuff I got involved in, and partly out of curiosity). I found that none of the “slow” trading (i.e., that you could run at home alongside your day trading account) was substantially effective (at least in my sampling), but I never thought an LLM would be any good at it because all the analysis is quantitative, not qualitative or contextual.

In short, I don’t think this study proves anything unless they gave the LLMs additional context besides the pure trading data (Bloomberg terminals have news for a reason—there’s typically a lot more context in he market than individual stock values or history).

·
23 hours ago
·
[ - ]

keepamovin
·
1 day ago
·
[ - ]

I’d say Grok did best because it has the best access to information. Grok deep search and real time knowledge capabilities due to the X integration and just general being plugged into the pulse of the Internet a really best in class. It’s a great OSINT research tool.

Interesting how this research seems to tease out a truth traders have known for eons that picking stocks is all about having information maybe a little bit of asymmetric information due to good research not necessarily about all the analysis that can be done. (that’s important but information is king) because it’s a speculative market that’s collectively reacting to those kind of signals.

·
22 hours ago
·
[ - ]

throwawayffffas
·
1 day ago
·
[ - ]

> We also built a way to simulate what an agent would have seen at any point in the past. Each model gets access to market data, news APIs, company financials—but all time filtered: agents see only what would have been available on that specific day during the test period.

That's not going to work, these agents especially the larger ones, will have news about the companies embedded in their weights.

devilsbabe
·
1 day ago
·
[ - ]

Funny how if you kept reading before commenting, they addressed that point specifically

> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.

thedougd
·
1 day ago
·
[ - ]

Would be nice to use the logos in the legend. I use these LLMs everyday and didn't know what half these logos on the graph were.

mvkel
·
23 hours ago
·
[ - ]

Predicting the stock market will likely never happen because it’s recursive. We can predict the next 10 days of weather, but the weather doesn’t change because it read your forecast. As long as markets continue to react to their own reactions, they will remain unpredictable.

If the strategy is long, there might be alpha to be found. But day trading? No way.

oersted
·
23 hours ago
·
[ - ]

If stocks are more of a closed system that are weakly affected by external factors in the short term, now I finally understand why they hire so many physicists for financial modeling!

There is of course the fact that physicists tend to be the best applied mathematicians, even if they don’t end up using any of their physics knowledge. And they generally had the reputation of “the smartest” people for the last century.

Anyway, such systems are complex and chaotic yes, but there are many ways of predicting aspects of them, like with fluid simulation to give a basic example. And I don’t get your point about weather, it is also recursive in the same way and reacting to its own reactions. Sure it is not reacting to predictions of itself, but that’s just a special kind of reaction, and patterns in others predictions can definitely be predicted accurately, perhaps not individually but in the aggregate.

mvkel
·
22 hours ago
·
[ - ]

> there are many ways of predicting aspects of them

Yes, and it's priced in

> but that’s just a special kind of reaction

That's just arguing semantics. My point was that weather doesn't react to human predictions, explicitly

jerf
·
23 hours ago
·
[ - ]

"We can predict the next 10 days of weather, but the weather doesn’t change because it read your forecast."

Less true than it used to be, with cloud seeding being an off-the-shelf technology now. Still largely true, but not entirely true anymore.

mlmonkey
·
1 day ago
·
[ - ]

> We were cautious to only run after each model’s training cutoff dates for the LLM models

Grok is constantly training and/or it has access to websearch internally.

You cannot backtest LLMs. You can only "live" test them going forward.

cheeseblubber
·
1 day ago
·
[ - ]

Via api you can turn off websearch internally. We provided all the models with their own custom tools that only provided data up to the date of the backtest.

mlmonkey
·
1 day ago
·
[ - ]

But Grok is internally training on Tweets etc. continuously.

natiman1000
·
20 hours ago
·
[ - ]

If the code and prompts are not open source how can we trust anything yall say?

luccabz
·
1 day ago
·
[ - ]

we should:

1. train with a cutoff date at ~2006

2. simulate information flow (financial data, news, earnings, ...) day by day

3. measure if any model predicts the 2008 collapse, how confident they are in the prediction and how far in advance

reformd
·
4 hours ago
·
[ - ]

financial advice of smart refrigerator right before the dump?

kqr
·
21 hours ago
·
[ - ]

Their annual geometric mean return is 45 %! That's some serious overbetting. In a market that didn't accidentally align with their biases, they would have lost money very quickly.

client4
·
1 day ago
·
[ - ]

The obvious next question is: does the AI on cocaine outperform? https://pihk.ai/

btbuildem
·
1 day ago
·
[ - ]

It turns out DeepSeek only made BUY trades (not a single SELL in the history in their live example) -- so basically, buy & hold strategy wins, again.

culi
·
1 day ago
·
[ - ]

this study should be replicated during a bear market

bmitc
·
1 day ago
·
[ - ]

Buy and hold performs well over long time scales by simply not adjusting based upon sentiment.

throwawayffffas
·
1 day ago
·
[ - ]

Operating word is long, historically if you entered the market just before a downturn, it could take years up to a couple of decades to make up. Depending on which downturn we are looking at.

bmitc
·
18 hours ago
·
[ - ]

I think that requires entering once. I was referring to continuing to enter periodically and holding.

Glyptodon
·
22 hours ago
·
[ - ]

Multiple runs of randomized backtesting seem needed for this to mean anything. It's also not clear to me how there's any kind of information update loop. Maybe I didn't read closely enough.

halzm
·
1 day ago
·
[ - ]

I think these tests are always difficult to gauge how meaningful they actually are. If the S&P500 went up 12% over that period, mainly due to tech stocks, picking a handful of tech stocks is always going to set you higher than the S&P. So really all I think they test is whether the models picked up on the trend.

I more surprised that Gemini managed to lose 10%. I wish they actually mentioned what the models invested in and why.

Marsymars
·
1 day ago
·
[ - ]

> picking a handful of tech stocks is always going to set you higher than the S&P.

That's a bold claim.

taylorlapeyre
·
1 day ago
·
[ - ]

Wait — isn't that exactly what good investors do? They look for what stocks are going to beat expectations and invest in them. If a stock broker I hired got this return, I wouldn't be rolling my eyes and saying "that's only because they noticed the trend in tech stocks." That's exactly what I'm paying them to do.

·
1 day ago
·
[ - ]

RandomLensman
·
23 hours ago
·
[ - ]

Could be interesting to see performance distribution for random strategies on that stock universe as a comparison. The reverse could also be interesting: how do the models perform on data that is random?

dehrmann
·
1 day ago
·
[ - ]

Is it just prompting LLMs with "I have $100k to invest. Here are all publicly traded stocks and a few stats on them. Which stocks should I buy?" And repeat daily, rebalancing as needed?

This isn't the best use case for LLMs without a lot of prompt engineering and chaining prompts together, and that's probably more insightful than running them LLMs head-to-head.

cedws
·
1 day ago
·
[ - ]

Backtesting for 8 months is not rigorous enough and also this site has no source code or detailed methodology. Not worth the click.

Genego
·
1 day ago
·
[ - ]

When I see stuff like this, I feel like rereading the Incerto by Taleb just to refresh and sharpen my bullshit senses.

bwfan123
·
22 hours ago
·
[ - ]

LLM is the fad of the day, and these sort of articles provoke the natural get-rich-quick-greed inherent in all of us, especially the young tech-types. As such they are clickbait, and also a barometer of the silliness that is widespread.

I am curious why re-reading incerto sharpens your bullshit sense. I have read a few in that series, but didnt see it as sharpening my bullshit sensor.

digitcatphd
·
1 day ago
·
[ - ]

Backtesting is a complete waste in this scenario. The models already know the best outcomes and are biased towards it.

hoerzu
·
1 day ago
·
[ - ]

How many trades? What's the z-score?

krauses
·
23 hours ago
·
[ - ]

I'd like to see a variation of the models being fine tuned based on investments of those in congress that seem to consistently outperform the markets.

Bombthecat
·
1 day ago
·
[ - ]

I wouldn't call this a test, I would create a test portfolio of hundred semi random stocks and see what they sell buy or keep.

That tells me way more then "YOLO tech stocks"

Bender
·
1 day ago
·
[ - ]

This experiment was also performed with a fish [1] though it was only given $50,000. Spoiler, the fish did great vs wall street bets.

[1] - https://www.youtube.com/watch?v=USKD3vPD6ZA [video][15 mins]

XenophileJKO
·
1 day ago
·
[ - ]

So.. I have been using an LLM to make 30 day buy and hold portfolios. And the results are "ok". (Like 8% vs 6% for the S&P 500 over the last 90 days)

What you ask the model to do is super important. Just like writing or coding.. the default "behavior" is likely to be "average".. you need to very careful of what you are asking for.

For me this is just a fun experiment and very interesting to see the market analysis it does. I started with o3 and now I'm using 5.1 Thinking (set to max).

I have it looking for stocks trading below intrinsic value with some caveats because I know it likes to hinge on binary events like drug trial results. I also have it try to have it look at correlation with the positions and make sure they don't have the same macro vulnerability.

I just run it once a month and do some trades with one of my "experimental" trading accounts. It certainly has thought of things I hadn't like using an equal weight s&p 500 etf to catch some upside when the S&P seems really top heavy and there may be some movement away from the top components, like last month.

themafia
·
1 day ago
·
[ - ]

I look for issues with a recent double bottom and high insider buy activity. I've found this to be a highly reliable set of signals.

XenophileJKO
·
1 day ago
·
[ - ]

That is interesting.

I was trying to not be "very" prescriptive. My initial impression was, if you don't tell it to look at intrinsic value, the model will look at meme or very common stocks too much. Alternatively specifying an investing persona would probably also move it out of that default behavior profile. You have to kind of tell it about what it cares about. This isn't necessarily about trying to maximize a strategy, it was more about learning what kinds of things would it focus on, what kind of analysis.

machiaweliczny
·
23 hours ago
·
[ - ]

> Potential accidental data leakage from the “future”

Exactly. Makes no sense with models like grok. DeepSeek also likely has this leak as was trained later.

parpfish
·
1 day ago
·
[ - ]

I wonder if this could be explained as the result of LLMs being trained to have pro-tech/ai opinions while we see massive run ups in tech stock valuations?

It’d be great to see how they perform within particular sectors so it’s not just a case of betting big on tech while tech stocks are booming

·
1 day ago
·
[ - ]

stockresearcher
·
1 day ago
·
[ - ]

I appreciate that you’ve made the trade histories downloadable and will be taking a look to see what I can learn.

I’ve glanced over some of it and really wonder why they seemed to focus on a small group of stocks.

mikewarot
·
1 day ago
·
[ - ]

They weren't doing it in real time, thus it's possible that the LLMs might have had undisclosed perfect knowledge of the actual history of the market. Only an real time study is going to eliminate this possibility.

gwd
·
1 day ago
·
[ - ]

The summary to me is here:

> Almost all the models had a tech-heavy portfolio which led them to do well. Gemini ended up in last place since it was the only one that had a large portfolio of non-tech stocks.

If the AI bubble had popped in that window, Gemini would have ended up the leader instead.

turtletontine
·
1 day ago
·
[ - ]

Yup. This is the fallacy of thinking you’re a genius because you made money on the market. Being lucky at the moment (or even the last 5 years) does not mean you’ll continue to be lucky in the future.

“Tech line go up forever” is not a viable model of the economy; you need an explanation of why it’s going up now, and why it might go down in the future. And also models of many other industries, to understand when and why to invest elsewhere.

And if your bets pay off in the short term, that doesn’t necessarily mean your model is right. You could have chosen the right stocks for the wrong reasons! Past performance doesn’t guarantee future performance.

gwd
·
1 day ago
·
[ - ]

What would have been impressive is if the favored industries, or individual companies, experienced a major drop during the target testing window, and the LLMs managed to pull out of those industries before they dropped.

Vegenoid
·
1 day ago
·
[ - ]

Clearly AI is not a bubble, look how good it is at predicting the stock market!

XCSme
·
1 day ago
·
[ - ]

If it's backtesting on data older than the model, then strategy can have lookahead bias, because the model might already know what big events will happen that can influence the stock markets.

refactor_master
·
1 day ago
·
[ - ]

Should have done GME stocks only. Now THAT would’ve been interesting to see how much they’d end up losing on that.

Just riding a bubble up for 8 months with no consequences is not an indicator of anything.

wowamit
·
1 day ago
·
[ - ]

Is finding the right stocks to invest in an LLM problem? Language models aren't the right fit, I would presume. It would also be insightful to compare this with traditional ML models.

portly
·
23 hours ago
·
[ - ]

What is the point of this?

LLMs are trained to predict the next word in a text. In what way, shape or form does that have anything to do with stock market prediction? Completely ridiculous AI bubble nonsense.

another_twist
·
23 hours ago
·
[ - ]

No it isnt. Next word prediction is what humans do to communicate anyway so the criticism isnt valid. Except you do that for your own sentences (if you do it for others its considered rude :) ).

Anyways this criticism is now dated given that modern day LLMs can solve unseen reasoning problems such as those found in the IMO.

It does have something to do with the stock market, since its about making hypotheses and trading based off that. However, I'd agree that making a proper trading AI here would require reasoning based fine tuning for stock market trading actions. Sort of like running GRPO taking market feedback as the reward. the article simply cant do that due to not having access to the underlying model weight.

bwfan123
·
22 hours ago
·
[ - ]

shhh. We need more of these as counter-parties to improve alpha.

chongli
·
1 day ago
·
[ - ]

They outperformed the S&P 500 but seem to be fairly well correlated with it. Would like to see a 3X leveraged S&P 500 ETF like SPXL charted against those results.

10000truths
·
1 day ago
·
[ - ]

...over the course of 8.5 months, which is way too short for a meaningful result. If their strategy could outperform the S&P 500's 10-year return, they wouldn't be blogging about it.

driverdan
·
1 day ago
·
[ - ]

VTI gained over 10% in that time period so it wasn't much better.

itake
·
1 day ago
·
[ - ]

Model output is non-deterministic.

Did they make 10 calls per decision and then choose the majority? or did they just recreate the monkey picking stocks strategy?

ta12653421
·
1 day ago
·
[ - ]

++1

This.

Thats also the reason why i still belive in "classic instruments" when configuring my trade app; the model wont give you the same entries on lets say 5 questions.

iLoveOncall
·
1 day ago
·
[ - ]

Since it's not included in the main article, here is the prompt:

> You are a stock trading agent. Your goal is to maximize returns.

> You can research any publicly available information and make trades once per day.

> You cannot trade options.

> Analyze the market and provide your trading decisions with reasoning.

> Always research and corroborate facts whenever possible.

> Always use the web search tool to identify information on all facts and hypotheses.

> Always use the stock information tools to get current or past stock information.

> Trading parameters:

> - Can hold 5-15 positions

> - Minimum position size: $5,000

> - Maximum position size: $25,000

> Explain your strategy and today's trades.

Given the parameters, this definitely is NOT representative of any actual performance.

I recommend also looking at the trade history and reasoning for each trade for each model, it's just complete wind.

As an example, Deepseek made only 21 trades, which were all buys, which were all because "Companyy X is investing in AI". I doubt anyone believe this to be a viable long-term trading strategy.

Scubabear68
·
1 day ago
·
[ - ]

Agree. Those parameters are incredibly artificial bullshit.

1a527dd5
·
1 day ago
·
[ - ]

Time.

That has been the best way to get returns.

I setup a 212 account when I was looking to buy our first house. I bought in small tiny chunks of industry where I was comfortable and knowledgeable in. Over the years I worked up a nice portfolio.

Anyway, long story short. I forgot about the account, we moved in, got a dog, had children.

And then I logged in for the first time in ages, and to my shock. My returns were at 110%. I've done nothing. It's bizarre and perplexing.

jondwillis
·
1 day ago
·
[ - ]

…did you beat the market? 110% is pretty much what the nasdaq has done over the last 5 years

Also N=1

delijati
·
1 day ago
·
[ - ]

time in the market beats timing the market -> Kenneth Fisher ... i learned it the hard way ;)

lisbbb
·
17 hours ago
·
[ - ]

Yeah, uh, all I did was buy BRK.B like a decade ago and it's up 172% or something like that.

The only way I have seen people outperform is by having insider information.

pech0rin
·
1 day ago
·
[ - ]

8 months of a huge bull market. Not exactly indicative of any real insight.

elzbardico
·
19 hours ago
·
[ - ]

A rising tide lift all boats.

FrustratedMonky
·
1 day ago
·
[ - ]

How much of this is just because the market as a whole is going up.

This same kind of mentality happened pre-2008. People thought they were great at being day-traders, and had all kinds of algorithms that were 'beating the market'.

But it was just that the entire market was going up. They weren't doing anything special.

Once the market turned downward, that was when it took talent to stay even.

   Show me these things beating a downward market.

867-5309
·
1 day ago
·
[ - ]

GPT-5 was released 4 months ago..

·
1 day ago
·
[ - ]

amelius
·
1 day ago
·
[ - ]

Nonsense. Title should read $0 because they didn't use actual money.

Also, it seems pretty stupid to use commodity tech like LLMs for this.

dogmayor
·
1 day ago
·
[ - ]

They could only trade once per day and hold 5-15 positions with a position size of $5k-$25k according to the agent prompt. Limited to say the least.

aperture147
·
1 day ago
·
[ - ]

Why is bullshit detector ringing as hell right now??? This sounds like another billion-dollar-Markov-chain-IP that claimed to change the world, opening with a paper with flying colors.

reactordev
·
20 hours ago
·
[ - ]

I would love for them to have included a peg position on SPY @ 100k over the course of the same period. Gives a much better benchmark of what an LLM can do (not much above 2-4%).

Still, cool to see others in my niche hobby of finding the money printer.

jacktheturtle
·
1 day ago
·
[ - ]

This is really dumb. Because the models themselves, like markets, are indeterministic. They will yield different investment strategies based on prompts and random variance.

This is a really dumb measurement.

mempko
·
1 day ago
·
[ - ]

The stats are abysmal. What's the MDD compared to S&P 500. What is the Sortino? What are the confidence intervals for all the stats? Number of trades? So many questions....

tiffani
·
1 day ago
·
[ - ]

What was the backtesting method? Was walk-forward testing involved? There are different ways to backtest.

cramcgrab
·
23 hours ago
·
[ - ]

Yeah I’ve been using grok to manage my yolo fund, it’s been doing great so far, up around 178% ytd, only rebalance once every other month.

darepublic
·
1 day ago
·
[ - ]

So in other words I should have listened to the YouTube brainrot and asked chatgot for my trades. Sigh.

_alternator_
·
1 day ago
·
[ - ]

Wait, they didn’t give them real money. They simulated the results.

fortran77
·
1 day ago
·
[ - ]

I would love to see this run during an extended bear market period.

ta12653421
·
1 day ago
·
[ - ]

Cant the model go short in a bear market?

nurettin
·
1 day ago
·
[ - ]

Deepseek and grok together would perform even better.

hsuduebc2
·
1 day ago
·
[ - ]

In bullish market when few companies are creating a bubble, does this benchmark have any informational value? Wouldn't it be better to run this on seamlessly random intervals in past years?

IncreasePosts
·
1 day ago
·
[ - ]

Just picking tech stocks and winning isn't interesting unless we know the thesis behind picking the tech sticks.

Instead, maybe a better test would he give it 100 medium cap stocks, and it needs to continually balance its portfolio among those 100 stocks, and then test the performance.

stuffn
·
1 day ago
·
[ - ]

Trading in a nearly 20 year bull market and doing well is not an accomplishment.

dismalaf
·
1 day ago
·
[ - ]

Back when I was in university we used statistical techniques similar to what LLMs use to predict the stock market. It's not a surprise that LLMs would do well over this time period. The problem is that when the market turns and bucks trends they don't do so well, you need to intervene.

apical_dendrite
·
1 day ago
·
[ - ]

Looking at the recent holdings for the best models, it looks like it's all tech/semiconductor stocks. So in this time frame they did very well, but if they ended in April, they would have underperformed the S&P500.

lawlessone
·
1 day ago
·
[ - ]

Could they give some random people (i volunteer) 100k for 8 months? ...as a control

iLoveOncall
·
1 day ago
·
[ - ]

I know this is a joke comment, but there are plenty of websites that simulate the stock market and where you can use paper money to trade.

People say it's not equivalent to actually trading though, and you shouldn't use it as a predictor of your actual trading performance, because you have a very different risk tolerance when risking your actual money.

ghaff
·
1 day ago
·
[ - ]

Yeah, if you give me $100K I'm almost certainly going to make very different decisions than either a supposedly optimizing computer or myself at different ages.

theymademe
·
1 day ago
·
[ - ]

prince of zamunda LLM edition or whatever that movie was based on that book was based on the realization how pathetic it all was based on was? .... yeah, some did a good one on ya. just imagine evaluating that offspring one or two generations later ... ffs, this is sooooooooooooooo embarrassing

chroma205
·
1 day ago
·
[ - ]

>We gave each of five LLMs $100K in paper money

Stopped reading after “paper money”

Source: quant trader. paper trading does not incorporate market impact

zahlman
·
1 day ago
·
[ - ]

If your initial portfolio is 100k you are not going to have meaningful "market impact" with your trades assuming you actually make them vs. paper trading.

txg
·
1 day ago
·
[ - ]

Lack of market response is a valid point, but $100k is pretty unlikely to have much impact especially if spread out over multiple trades.

tekno45
·
1 day ago
·
[ - ]

the quant trader you talked to probably sucks.

a13n
·
1 day ago
·
[ - ]

I mean if you’re going to write algos that trade the first thing you should do is check whether they were successful on historical data. This is an interesting data point.

Market impact shouldn’t be considered when you’re talking about trading S&P stocks with $100k.

verdverm
·
1 day ago
·
[ - ]

Historical data is useful for validation, don't develop algos against it, test hypotheses until you've biased your data, then move on to something productive for society

theideaofcoffee
·
1 day ago
·
[ - ]

“Everyone (including LLMs) is a genius in a bull market.”

mrweasel
·
1 day ago
·
[ - ]

I was thinking the same thing. A number of coworkers where trading stocks a few years ago and felt pretty good about their skills, until someone pointed out that making good stock picks was easy when everything is going up. Sure enough, when the market started to fail, they all lost money.

What could make this a bit more interesting is to tell the LLM to avoid the tech stocks, at least the largest ones. Then give it actual money, because your trades will affect the market.

apparent
·
1 day ago
·
[ - ]

Apparently everyone (but Gemini).

koakuma-chan
·
1 day ago
·
[ - ]

Could Gemini end up being better over the longer term?

scarmig
·
1 day ago
·
[ - ]

Depends on if the market can stay irrational longer than Gemini stays solvent.

deadbabe
·
1 day ago
·
[ - ]

Yea, so this is bullshit. An approximation of reality still isn’t reality. If you’re convinced the LLMs will perform as backtested, put real money and see what happens.

vpribish
·
1 day ago
·
[ - ]

this is so stupid i wish i could flag it twice

Frieren
·
1 day ago
·
[ - ]

[flagged]

frobisher
·
1 day ago
·
[ - ]

lolol Gemini

867-5309
·
1 day ago
·
[ - ]

tl;dr https://www.aitradearena.com/blog/llm-performance-chart.png

petesergeant
·
1 day ago
·
[ - ]

If I'm reading this, almost all of Grok's advantage comes from heavy bets into semi-conductors spiking: ASML, INTC, MU.

andirk
·
1 day ago
·
[ - ]

Update with Gemini 3. It's far better than its predecessors.

regnull
·
1 day ago
·
[ - ]

I'm working on a project where you can run your own experiment (or use it for real trading): https://portfoliogenius.ai. Still a bit rough, but most of the main functionality works.