Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.
Source: I'm part of the training team
Good luck though, very needed project!
Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.
(I'm personaly most interested in covering the 24 official EU languages)
I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.
But it's good to have more and more players in this space.
Great to read that!
I understand the web is a dynamic thing but still it would seem to be useful on some level.
How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.
It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.
Don't focus too much on a single variable, especially when all the variables have diminishing returns.
[0] the ultimate, of course, being profit.
I'm not sure if we're thinking of the same field of AI development. I think I'm talking about the super-autocomplete with integrated copy of all of digitalized human knowledge, while you're talking about trying to do (proto-)AGI. Is that it?
You just listed possible options in the order of their relative probability. Human would attempt to use them in exactly that order
They missed an opportunity though. They should have called their machine the AIps (AI Petaflops Supercomputer).
OLMo is fully open
Ai2 believes in the power of openness to build a future where AI is accessible to all. Open weights alone aren’t enough – true openness requires models to be trained in the open with fully open access to data, models, and code.
This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.
C.f. https://medium.com/@biswanai92/understanding-token-fertility...
We'll find out in September if it's true?
“ Open LLMs are increasingly viewed as credible alternatives to commercial systems, most of which are developed behind closed doors in the United States or China”
It is obvious that the companies producing big LLMs today have the incentive to try to enshitify them. Trying to get subscriptions at the same time as trying to do product placement ads etc. Worse, some already have political biases they promote.
It would be wonderful if a partnership between academia and government in Europe can do a public good search and AI that endeavours to serve the user over the company.
LLMs do seem to favor general relativity but probably would've favored classical mechanics at the time given the training corpora.
Not-yet unified: Quantum gravity, QFT, "A unified model must: " https://news.ycombinator.com/item?id=44289148
Will be interested to see how this model responds to currently unresolvable issues in physics. Is it an open or a closed world mentality and/or a conditioned disclaimer which encourages progress?
What are the current benchmarks?
From https://news.ycombinator.com/item?id=42899805 re: "Large Language Models for Mathematicians" (2023) :
> Benchmarks for math and physics LLMs: FrontierMath, TheoremQA, Multi SWE-bench: https://news.ycombinator.com/item?id=42097683
Multi-SWE-bench: A Multi-Lingual and Multi-Modal GitHub Issue Resolving Benchmark: https://multi-swe-bench.github.io/
Add'l LLM benchmarks and awesome lists: https://news.ycombinator.com/item?id=44485226
Microsoft has a new datacenter that you don't have to keep adding water to; which spares the aquifers.
How to use this LLM to solve energy and sustainability problems all LLMs exacerbate? Solutions for the Global Goals, hopefully
In that regard it's absolutely not a waste of public infra just like this car was not a waste.
What does anyone get out of this when we have open weight models already ?
Are they going to do very innovative AI research that companies wouldn't dare try/fund? Seems unlikely ..
Is it a moonshot huge project that no single company could fund..? Not that either
If it's just a little fun to train the next generation of LLM researchers.. Then you might as well just make a small scale toy instead of using up a super computer center
Including how it was trained, what data was used, how training data was synthesized, how other models were used etc. All the stuff that is kept secret in case of llama, deepseek etc.