My finetuned models beat OpenAI's GPT-4

414
91
majc2
1 year ago
mlops.systems

kcorbitt
·
1 year ago
·
[ - ]

(Disclaimer: I'm the founder of OpenPipe, one of the fine-tuning services OP tried and ultimately the one that produced the highest performing model, it appears.)

Data extraction is a use case that fine-tuned models are fantastic at, so I'm not surprised that OP got good results. That said, I've also found it's pretty easy to beat GPT-4 across many task types if you have a way of getting strong training data. We published some research[1] a week ago where we found that across 4 example tasks spanning creative summarization, question answering, data extraction and classification a fine-tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The key was to create a repeatable way of generating high-quality training data, which is also addressed in the post.

[1]: https://openpipe.ai/blog/mixture-of-agents

GlassOwAter
·
1 year ago
·
[ - ]

Is this something, as a tech enthusiast that's no expert, I can easily fine tune are run?

My use case would be fine tuning on technical docs. Specific news, 2 years of blog posts, primary source material, and Twitter explainer thread. I want to gather all the niche information of a topic from the last two years, dump it into this and have an LLM that is a subject-matter expert.

afro88
·
1 year ago
·
[ - ]

Fine tuning doesn't quite work that way. You have to format the training data set as request/response. The idea of fine tuning is to get the model to output things in a specific format, style or structure.

Your use case is better suited to RAG. This is where you retrieve data from a large dataset and inject it into the user's request so the AI model has the context it needs to answer accurately.

But that's not a silver bullet and you would need to spend significant time on chunking strategy and ranking of results to hopefully get a decent response accuracy.

w4nderlust
·
1 year ago
·
[ - ]

Here is an example of the Predibase platform, referred in the article for the Solar model, but that can train also Llama-3, Phi-3 and Mistral. https://www.youtube.com/watch?v=R2JQhzfaOFw&themeRefresh=1 I think you can assess by yourself if it's easy enough to do for you. (Predibase founder here)

colordrops
·
1 year ago
·
[ - ]

Why isn't someone providing a "meta model" that uses an LLM to choose between various fine tuned models depending on the question to get overall better results than gpt4?

billmalarky
·
1 year ago
·
[ - ]

Founding AI Engineer at OpenPipe here, using a fine tuned "router LLM" to route between various specialized (inc fine tuned but not necessarily) applied models depending on the input is becoming a common pattern in more modern "graph like" LLM applications.

See LangGraph's "conditional edges" concept here: https://langchain-ai.github.io/langgraph/concepts/low_level/...

You can see how that "routing function" could include a call to a "Router LLM." And yes, fine tuning is a great method to better improve the routing intelligence of said Router LLM.

Great question btw!

anon373839
·
1 year ago
·
[ - ]

Worth mentioning that you don’t even need separate models to implement this. Dynamically loading LoRA adapters is much more efficient, and is the approach Apple took.

bashfulpup
·
1 year ago
·
[ - ]

Already a big thing. See the constellation architecture used here:

https://arxiv.org/html/2403.13313v1

sheepscreek
·
1 year ago
·
[ - ]

Very loosely, isn’t this what is happening inside most LLMs that have a “multi-head” mechanism?

drphilwinder
·
1 year ago
·
[ - ]

Check out https://unify.ai/chat if you're interested in a router optimised for cost/ttft/performance for commercial language models.

babelfish
·
1 year ago
·
[ - ]

Is using model responses to train a new model against the ToS for the major LLM providers (OpenAI, Anthropic, etc)?

yreg
·
1 year ago
·
[ - ]

There doesn't seem to be any restriction like that in OpenAI terms.

zepton
·
1 year ago
·
[ - ]

There is: "you may not... Use Output to develop models that compete with OpenAI"

(from https://openai.com/policies/terms-of-use/)

yreg
·
1 year ago
·
[ - ]

Thanks, I've missed that.

I suppose the Output could be washed by publishing it on the web and having another entity crawl it.

OpenAI doesn't treat anyone else's content any differently, acting like it's a fair game, so why should we care.

jaredhallen
·
1 year ago
·
[ - ]

Data laundering. What a time to be alive.

babelfish
·
1 year ago
·
[ - ]

It seems like you do not work for OpenPipe (OP), so it probably doesn't matter for you, but it could (should) matter a whole lot for OpenPipe and/or their customers

barfbagginus
·
1 year ago
·
[ - ]

[dead]

imfgrly
·
1 year ago
·
[ - ]

[dead]

saopaulodax
·
1 year ago
·
[ - ]

[dead]

gillesjacobs
·
1 year ago
·
[ - ]

This is entirely unsurprising and in-line with the finding that even small specialized models do better in information extraction and text classification. So no wonder finetuned large LMs do good too.

Personally, my PhD did fine grained ACE-like event and sentiment extraction and "small" specialized finetuned transformers outperformed prompting LLMs like BERT and Roberta-large. Would love to see an inclusion of small model scores with some sota pipelines.

This is great work anyway even if it replicates known results!

renegade-otter
·
1 year ago
·
[ - ]

The caveat here is that if you don't know how to create good specialized models - you are just wasting everyone't time and money:

https://www.threads.net/@ethan_mollick/post/C46AfItO8RS?hl=e...

gillesjacobs
·
1 year ago
·
[ - ]

Exactly, BloombergGPT performed worse on financial sentiment analysis then much smaller fine-tuned Bert-based models.

For many extractive tasks BloombergGPT was quite disappointing. A 5-10% performance hit with much larger inference cost compared to smaller models is not desirable.

But the research investment for Bloomberg makes sense to take the risk: a do-it-all generative model can mean significant complexity reduction in maintenance and deployment overhead.

It didn't directly pay off for many extractive tasks, but I bet they're iterating. Bloomberg has the data moat and the business needs in their core products to make it worthwhile.

pandatigox
·
1 year ago
·
[ - ]

Your thesis sounds interesting! Do you have a link to it by any chance?

gillesjacobs
·
1 year ago
·
[ - ]

rovr beat me to it below. Here are more links: https://jacobsgill.es/phdobtained (fun fact: because my thesis contains published papers, I am in breach of a few journal's copyright by uploading my own thesis pdf, but fuck'em).

LLM approaches were evaluated on my own time and but published (I left research after obtaining my PhD).

SpaceManNabs
·
1 year ago
·
[ - ]

> because my thesis contains published papers, ..., but f 'em

Excluding the part in the middle because I don't wanna repost potential issues for you. I just wanted to comment that that is terrible. People often talk about the siloed nature of research in industry, without considering that academia supports the draconian publishing system. I understand IP protection, but IP protection doesn't have to mean no access. This is such a huge issue in the bio- world (biostats, genetics, etc).

uolmir
·
1 year ago
·
[ - ]

I don't know your circumstances but often you retain the right to distribute a "post print", ie the final text as published but absent journal formatting. A dissertation should fit that definition.

gillesjacobs
·
1 year ago
·
[ - ]

This is indeed often the case, however, my university reviews each thesis, and deemed it can only change to open access in 2026 (+5 years from defense).

I think this is default policy for thesis based on publication agreements here.

In any case, I am not too worried.

pandatigox
·
1 year ago
·
[ - ]

Thank you for the link! And congratulations on obtaining your PhD

I have skimmed through it and it's truly amazing how good annotation of the dataset can lead to impressive results.

I apologise in advance if the question seems ignorant: The blog post talked about fine-tuning models online. Given that BERT models can run comfortably on even iPhone hardware, were you able to finetune your models locally or did you have to do it online too? If so, are there any products that you recommend?

gillesjacobs
·
1 year ago
·
[ - ]

Thanks! The fine-tunes where done in 2019-21 on a 4xV100 server with hyperparameter search, so thousands of individual fine-tuned models were trained in the end. I used weights and biased for experiment dashboarding the hyperparam search, but the hardware was our own GPU server (no cloud service used).

I doubt you can fine-tune BERT-large on a phone. A quantized, inference optimised pipeline can be leaps and bounds more efficient and is not comparable with the huggingface training pipelines on full models I did at the time. For non-adapter based training you're going to need GPUs ideally.

Mockapapella
·
1 year ago
·
[ - ]

This is really cool -- thanks for posting it! I'll have to skim through it at some point since a lot of my work is in classifications models and mirrors the results you've seen

rovr138
·
1 year ago
·
[ - ]

Check https://www.researchgate.net/publication/356873749_Extractin...

wuschel
·
1 year ago
·
[ - ]

Seconded! Any URI to your PhD?

rovr138
·
1 year ago
·
[ - ]

Check https://www.researchgate.net/publication/356873749_Extractin...

dimask
·
1 year ago
·
[ - ]

Thanks for putting all this work and sharing it in such a detail! Data extraction/structuring data is the only serious application of LLMs I have actually engaged in for real work and found useful. I had to extract data from experience sampling reports which I could not share online, thus chatgpt etc was out of question. There were sentences describing onsets and offsets of events and descriptions of what went on. I ran models through llama.cpp to turn these into csv format with 4 columns (onset, offset, description, plus one for whether a specific condition was met in that event or not which had to interpreted through the description). Giving some examples of how I want it all structured in the prompt, was enough for many different models to do it right. Mixtral 8x7b was my favourite because it ran the fastest in that quality level on my laptop.

I am pretty sure that a finetuned smaller model would be better and faster for this task. It would be great to start finetuning and sharing such smaller models: they do not really have to be really better than commercial LLMs that run online, as long as they are not at least worse. They are already much faster and cheaper, which is a big advantage for this purpose. There is already need for these tasks to be offline when one cannot share the data with openai and the like. Higher speed and lower cost also allow for more experimentation with more specific finetuning and prompts, with less care about token lengths of prompts and cost. This is an application where smaller, locally run, finetunable models can shine.

hubraumhugo
·
1 year ago
·
[ - ]

> Data extraction/structuring data is the only serious application of LLMs

I fully agree. I realized this early on when experimenting with GPT-3 for web data extraction. After posting the first prototype on Reddit and HN, we started seeing a lot of demand for automating rule-based web scraping stacks (lots of maintenance, hard to scale). This eventually led to the creation of our startup (https://kadoa.com) focused on automating this "boring and hard" problem.

It comes down to such relatively unexciting use cases where AI adds the most value.

AI won't eliminate our jobs, but it will automate tedious, repetitive work such as web scraping, form filling, and data entry.

furyofantares
·
1 year ago
·
[ - ]

The way you cut that quote turns it into an assertion that doesn't exist in parent post.

They didn't make the (incorrect) statement that no other serious, useful application exists.

But that's how it reads when you cut off before "I have actually engaged in for real work and found useful"

jappgar
·
1 year ago
·
[ - ]

To be fair the original sentence could still be implying the same thing. The second half of the sentence just sounds like a hedge.

dimask
·
1 year ago
·
[ - ]

Well I precisely talked about things I have engaged professionally. Obviously this cannot cover everything one may do, eg I do not build chatbots for customer service or stuff like that, thus I obviously cannot speak for all possible applications of LLMs and how useful they may be. I am pretty sure there will be useful applications in fields I am not and will not be engaged in as nobody engages with everything. However, some other things that I have tried (eg copilots, summarising scientific articles) imo create much more hype than real value. They can be a bit useful if you know what to actually use them for and what their limits are, but nowhere close to the hype they generate, and I just find myself just googling again tbh. They are absolutely horrible especially with more niche subjects and areas. On the other hand, data extraction and structuring has a quite universal application, has already demonstrated usefulness and potential, and seems a quite realistic, down to earth application that I am happy to see other people and startups working on. Not as fancy, and harder to build hype upon, but very useful regardless.

strickvl
·
1 year ago
·
[ - ]

Thanks! Yes one 'next step' that I'd like to do (probably around the work on deployment / inference that I'm turning to now) will be to see just how small I can get the model. Spacy have been pushing this kind of workflow (models in the order of tens of MB) for years and it's nice that there's a bit more attention to it. As you say, ideally I'd want lots of these tiny models that were super specialists at what they do, small in size and speedy in inference time. As I hinted towards the end of the post, however, keeping all that updated starts to get unwieldy at a certain point if you don't set it all up in the right way.

scosman
·
1 year ago
·
[ - ]

And that’s the point of fine tuning models.

Still good to see someone walk through their fine tuning process, with a mix of hosted and local options.

scosman
·
1 year ago
·
[ - ]

On that note: is there a good service for “here’s my dataset”, please fine tune these 9 models and give me evaluation stats?

strickvl
·
1 year ago
·
[ - ]

OpenpPipe - https://openpipe.ai/ - is probably the service that most closely resembles what you’re asking for, but I found the evals weren’t really what I wanted — i.e. following my custom evaluation criteria — so you probably will end up having to do that yourself anyway. But for the finetuning, they’re all somewhat the same. Predibase and OpenPipe are two good options for that. Predibase has more base models for you to finetune, but it’s a bit more unwieldy to work with. I wrote about that in a previous post here -- https://mlops.systems/posts/2024-06-17-one-click-finetuning.....

kcorbitt
·
1 year ago
·
[ - ]

(Disclaimer: founder of OpenPipe). Thanks for the shout-out. Note that we're actively working on improved evaluations that will let you add more specific criteria as well as more evaluation types, like comparing field values to that of a golden dataset. This is definitely something that customers are asking for!

scosman
·
1 year ago
·
[ - ]

Wild to see them advertising collecting GPT4 responses for training other models. That’s definitely not allowed by TOS. I suspect many do, but front page advertising is another thing entirely.

w4nderlust
·
1 year ago
·
[ - ]

Predibase ( http://predibase.com ), also referred in the article, is a platform specifically designed for exactly that. It also has "repos" for finetuning multiple models and comapre their performance and keeping things organzie. It also allow you to query any of the finetuned models on the fly from a single GPU with multi-lora serving. (Predibase founder here)

tucnak
·
1 year ago
·
[ - ]

Together.AI is a good starting point. Even though I'm not sure what fine-tuning method they're using, the results are REALLY good.

geokon
·
1 year ago
·
[ - ]

As I understood the point was not that they fine tuned a model and it got better

They use a much simpler model, fine tune it, and manage to beat a way more advanced model

wongarsu
·
1 year ago
·
[ - ]

When jumping from 7B parameters to 70B to 400B (or whatever GPT-4 uses) most of the additional neurons seem to go towards a better world model and better reasoning (or whatever you want to call the inference of new information from known information). There doesn't seem to be any major improvements in basic language skills past 7B, and even 1B and 3B models do pretty well on that front.

In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.

It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.

scosman
·
1 year ago
·
[ - ]

Re: 7b models vs gpt-3.5, I’m guessing different fine tuning parameters can account for the difference. The OpenAI fine tuning is a black box.

scosman
·
1 year ago
·
[ - ]

That’s still the point. That model now does exactly one thing, and because of that can do better than a model 50x the size that tries to do everything. It will crush it in instruction following and consistency.

A fine tuned 500b parameter model would probably beat the fine tuned 7b model, but only by a bit (depending on task obviously). A lot of that capacity is being used for knowledge, and isn’t needed for extraction/classification tasks. Fine tuning isn’t touching most of those weights. The smaller models need to focus on more general language skills, not answering “describe the evolution of France’s economy in the 1800s”.

botro
·
1 year ago
·
[ - ]

Thanks for sharing this, It's well written and informative. I noticed you used 'temperature=1' in the GPT test for the example in the post. Is this best practice for a task requiring structured output? Have you tested other temperature settings? My casual understanding was that a temperature of 0 is best for these types of workloads while higher temperatures would be more effective for more 'creative' workloads.

strickvl
·
1 year ago
·
[ - ]

I followed whatever the guidance was for a specific model. Some of the LLM finetuning providers did indeed set the temperature to 0 and I followed that, but others suggested 1. I could probably iterate a bit to see what is best for each model, and I might well do that for the one that I choose as the one I’ll be doubling down on in subsequent iterations / finetunes. Thanks for the suggestion!

Tiberium
·
1 year ago
·
[ - ]

GPT models shouldn't be used at temp 1 unless you only care about creative writing. They get much worse at factual stuff and code than with lower temperatures. And yes, 3.5 Turbo is less affected by this, which might be the reason why the models performed for you in reverse.

mewpmewp2
·
1 year ago
·
[ - ]

For GPT, I would really urge to try again with 0. 1 kind of starts to force it to fail.

I would say this actually invalidates the whole thing.

bongodongobob
·
1 year ago
·
[ - ]

You never use 1 for stuff like this. 1 is for poetry and creative writing. You need to redo this with temp=0 imo.

mewpmewp2
·
1 year ago
·
[ - ]

1. It would be nice to see examples where GPT-4o was inaccurate, but best performing models were accurate.

2. It would be nice to try again with 0 temperature, as I do a lot of structured data extraction. In my experience 0 temperature should always be used, and it can make a huge difference. Temperature of 1 essentially means that it will start to pick tokens with lower probability of being accurate...

elawler24
·
1 year ago
·
[ - ]

Agree, temp 0 would be interesting to compare for this usecase where there's a clear right and wrong answer based on historical data. We experimented with temperature for our AI SQL editor and found .3 to be ideal so it can still self heal when errors appear (which will happen closer to 0 because you're optimizing for correctness).

denhaus
·
1 year ago
·
[ - ]

For anyone interested, we wrote a paper on a similar topic: https://www.nature.com/articles/s41467-024-45563-x

courseofaction
·
1 year ago
·
[ - ]

Really interesting. Could the potentially controversial content of the target news article have an effect on ChatGPT's ability to summarize it?

gillesjacobs
·
1 year ago
·
[ - ]

I use LLM information extraction for financial news articles with OpenAI Azure and it is a huge problem for me.

404 Content moderation response in 4% of articles. This is just financial news text.

It is a prime reason we are considering open models.

strickvl
·
1 year ago
·
[ - ]

I think not. Normally if you get those kinds of errors you wouldn’t get any output at all. In the blog I show that all 724 of the test cases got proper JSON output etc for the queries so I don’t think this was an issue. I think these kinds of topics would have been well covered in the training data, and probably the OSS models would have used similar data so I don’t even think there’s a disparity to be found between proprietary vs OSS models here.

resource_waste
·
1 year ago
·
[ - ]

>Normally if you get those kinds of errors you wouldn’t get any output at all

I am not sure. I disagree. If there is a pro-chatGPT user, I'm probably it.

Ive often seen it give significantly less effort to answer the question.

strickvl
·
1 year ago
·
[ - ]

Interesting. I can maybe try finetuning one or two of the so-called 'uncensored' open models and see if that makes a difference. A bit harder to switch out the dataset completely, as that's really what I'm interested in :) I think the general point that finetuning a model for some custom task works is fairly uncontroversial, but if OpenAI's poor performance was on account of these kinds of guardrails it'd be yet another reason someone might want to finetune their own models I guess.

jrm4
·
1 year ago
·
[ - ]

At the risk of sounding like an old head;

Seems to me then, priority one should be "free and open source all the models as hard as possible, so that EVERYONE can fine-tune."

(This being a subset of the idea of, free / open source is generally preferable for both freedom and quality)

klabb3
·
1 year ago
·
[ - ]

It seems to me this means whoever has hoarded and declared ownership of the most personal data will make the best products. Kinda like how some people liked their targeted ads because they’re more “relevant”, only now it’s not just ads but useful products. Another winner is of course platform owners like Apple and Microsoft who can scrape your data off their apps and products, even locally. This is a much bigger edge than being 3-6 months ahead in model quality.

I despise the centralization of this tech as well, and while it’s hopeful that smaller fine tuned models are better, they won’t win (or barely stand a chance) out of the virtue of openness and privacy alone. Best we can hope for is proliferation in the small-medium sized business service space - that OpenAI tokens are not worth the extra expense if open models are commoditized and effective. This was probably Zuck's plan all along – to prevent centralized gate keepers in tech that’s mainly benefiting his rivals. But the enemy of my enemy is my friend, so his actions may be the best he’s ever done for the public good.

jrm4
·
1 year ago
·
[ - ]

Your end point I think is exactly right.

I think your first one is getting downvoted hard because your first sentence is not at all how any of this works.

Sucking down personal data isn't JUST a bad idea for privacy, it's actually also bad for "making the best products," I think you're overstating the extent to which all that data that is stolen and sold to the highest bidder actually helps the company buying it?

klabb3
·
1 year ago
·
[ - ]

Ah thanks for pointing out. I don't care much for LLMs at all, but my point was simply that whoever has data, and especially personalized data, has an upper hand in making LLMs into better end user product, for those that like them. This may be underestimated right now when most dick measuring is comparing model-model not integration into a product.

> data that is stolen and sold to the highest bidder

Didn’t mean necessarily the data brokers (although that’s an interesting angle), but say Apple now has a bunch of info about your calendar, email, contacts, then clearly they have an upper hand in providing better products than an anonymous API call. Not all products need personalization but LLMs? I can think of tons of use cases.

michaelortega01
·
1 year ago
·
[ - ]

At Predibase, we recently conducted 700+ fine-tuning experiments to benchmark the performance of popular open-source LLMs across 30 tasks and compared their results to GPT-4.

85% of the time they beat GPT-4.

You can see the results here: https://predibase.com/fine-tuning-index.

The site has a series of interactive charts and a link to our Arxiv paper.

mewpmewp2
·
1 year ago
·
[ - ]

I took a look at a random row to try to find why mistakes were happening.

Why is this one labelled with start_date: 2011-02-07?

> Afghan, Coalition Forces Clear Northern Kandahar ISAF Joint Command - Afghanistan 2011-02-D-081 For Immediate Release KABUL, Afghanistan (Feb. 12) – Afghan and coalition forces set out to provide security and assist the local population during a clearing operation in a remote village in Shah Wali Kot district, Kandahar province, Feb. 8. District Chief of Police Bacha Khan, and his policemen; Afghan commandos from 2nd Company, 3rd Commando Kandak, along with U.S. service members from Special Operations Task Force – South, searched the village throughout the day and detained 20 suspected insurgents. Also found were 80 pounds (36 kilograms) of homemade explosives and various improvised explosive device-making materials. Leading a squad during the operation was Afghan commando Sgt. Hafiz Rahman, who said this operation has shown him progress. “The people are respecting us,” Rahman said. “They ask us if we want tea, or ‘do we want bread?’ They are thankful for the security.” Children during the operation brought commandos blankets in the evening and offered them food throughout the day.

Trying to find the source, I'm also not seeing any indication of Feb 7.

https://www.dvidshub.net/news/65238/afghan-police-commandos-...

---------------

And why is this labelled as Mar 6, GPT-4o and I personally find Mar 7 to be logical.

ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces targeted a Taliban district chief, killed one insurgent and detained several others during an operation in Burkah district, Baghlan province, yesterday. The Taliban district chief maintains ties to Taliban senior leadership throughout Kunduz, Baghlan, and Takhar provinces. He is involved in purchasing weapons and IEDs. Intelligence reports led the security force to the targeted compound in the city, where Afghan forces called for all occupants to exit the buildings peacefully before conducting a search. During that time, an armed individual threatened the security force and the force returned fire, killing him. Several suspected insurgents were detained after initial questioning at the scene.

But despite that the "finetuned" model also gets Mar 6. How does the finetuned model get Mar 6?

toisanji
·
1 year ago
·
[ - ]

I'm most excited about getting a faster model. A model like GPT4 can be overkill because its too slow. What are the smallest fine tuned models that could beat a gpt4 model? Is it 7b or could a 3b model like phi3 do well for tasks like classification and summarization?

soist
·
1 year ago
·
[ - ]

Eventually people will realize any underdetermined system of equations has infinitely many solutions. Give me any open source AI model and I will beat any SOTA benchmark. Why am I so confident? Because curve fitting can be applied to any data set to get as good of a result as needed. Combine this approach with mixtures of "experts" and any predetermined set of benchmarks will fall to a curve fit to the benchmark.

The hype is really getting tiresome. There is no way to get from here to any intelligent system with the current techniques. New breakthroughs will require insights into discrete spaces which are not amenable to curve fitting with gradient descent.

simonw
·
1 year ago
·
[ - ]

I'd be interested to see how well these fine-tuned models compare to Claude 3 Haiku (or one of the more expensive Claude models) with a larger set of examples.

The Claude models all have a 200,000 token limit and respond _really_ well to examples - you can feed them in as chat JSON message pairs of user input / ideal assistant output.

Haiku is dirt cheap for this kind of thing and with 200,000 tokens you can probably provide a dozen or so examples.

Tiberium
·
1 year ago
·
[ - ]

Did you release the dataset and the code for testing? It would be interesting to check how 3.5 Sonnet performs on this task.

mewpmewp2
·
1 year ago
·
[ - ]

The dataset is there:

https://huggingface.co/datasets/strickvl/isafpressreleases_t...

but when looking for rows where GPT-4o was deemed inaccurate then to me it seems the label was wrong or at least it wasn't possible to infer that certain label from the input text. But finetuned model was able to predict it.

Which makes me wonder whether the finetuned models are poisoned with eval data...

See this one:

> ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces targeted a Taliban district chief, killed one insurgent and detained several others during an operation in Burkah district, Baghlan province, yesterday. The Taliban district chief maintains ties to Taliban senior leadership throughout Kunduz, Baghlan, and Takhar provinces. He is involved in purchasing weapons and IEDs. Intelligence reports led the security force to the targeted compound in the city, where Afghan forces called for all occupants to exit the buildings peacefully before conducting a search. During that time, an armed individual threatened the security force and the force returned fire, killing him. Several suspected insurgents were detained after initial questioning at the scene.

It claims "Yesterday" on March 8, so you would assume March 7 is correct start_date, but it's labelled Mar 6, and finetuned models get it "right", while GPT says Mar 7.

wrsh07
·
1 year ago
·
[ - ]

I was wondering if there was some info in the bizarrely formatted date, but I think 022 is just the issue number: https://www.dvidshub.net/news/66703/correction-isaf-joint-co...

mewpmewp2
·
1 year ago
·
[ - ]

Also a lot of the time the dates are wrong seems to be due to only having those formats, which does make me wonder again how do fine tuned get this right unless they have been fine tuned using eval data...

alach11
·
1 year ago
·
[ - ]

Props to the author for releasing the data. My instinct is also to immediately suspect data leakage. It's super easy for this to happen. For example the original dataset could contain multiple articles about the same event.

w4nderlust
·
1 year ago
·
[ - ]

We got very similar findings: we published a paper that show that smaller LLMs (3-7b) when finetuned with LoRA can match or outperform GPT-4 on a variety of tasks (29 out of 31) including classification, summarization, info extraction, "reasoning". https://arxiv.org/abs/2405.00732 (Predibase cofounder and coauthor of the paper)

blueboo
·
1 year ago
·
[ - ]

Why would you set temperature=1 for this task?

·
1 year ago
·
[ - ]

visarga
·
1 year ago
·
[ - ]

What is a good fine-tuning script for Mistral and LLaMA3 on an A100?

strickvl
·
1 year ago
·
[ - ]

Depends a bit where you’re running etc. This works for Modal, e.g., but they’re just using axolotl under the hood so you can just connect to whatever cloud provider of choice you’re using and then run axolotl straight. I did my finetunes across local GPUs, but it would have been just as easy to do it in a cloud environment using the same axolotl config.

swalsh
·
1 year ago
·
[ - ]

Unsloth is a great tool, super fast.

strickvl
·
1 year ago
·
[ - ]

But still only single GPU for now. I also heard great things about it, but wanted to make the maximum use of my multi-GPU local setup.

swalsh
·
1 year ago
·
[ - ]

Last time I emailed them, I think they said they're trying to release pro in July... so should be coming soon.

animanoir
·
1 year ago
·
[ - ]

Anything beats GPT-4 nowdays to be honest.

uptownfunk
·
1 year ago
·
[ - ]

Remember folks there is no free lunch :)

pcwelder
·
1 year ago
·
[ - ]

Here are some test data samples and corresponding closest train data rows to give you an idea of the task complexity.

---

Test 1: KABUL, Afghanistan (Jan. 25, 2013) During a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.

Train: KABUL, Afghanistan (Jan. 8, 2013) – During a security operation in Washer district, Helmand province, yesterday, an Afghan and coalition force killed the Taliban leader, Mohammad Sayed, and one other insurgent. Mohammad Sayed distributed weapons and ammunition to Taliban fighters. Prior to his death, Sayed was attempting to acquire rockets for attacks targeting Afghan government officials in the province.

---

Test 2: For Immediate Release

KABUL, Afghanistan (Aug. 6, 2012) Afghan and coalition forces conducted a security operation in search of a Haqqani leader in Tsamkani district, Paktiya province, yesterday. During the operation the security force engaged a group of insurgents with a precision airstrike. After the strike, the Afghan and coalition security force conducted a follow-on assessment and confirmed several insurgents had been killed in the strike. They also confirmed the strike had not injured any civilians or damaged any civilian property.

Train: For Immediate Release

KABUL, Afghanistan (July 22, 2012) — Afghan and coalition forces conducted a security operation in Muhammad Aghah district, Logar province, Saturday.

During the operation, a group of armed insurgents were engaged with a precision airstrike. After the strike, the Afghan and coalition force conducted a follow-on assessment and confirmed multiple insurgents had been killed.

The security force also confirmed the airstrike had not injured any civilians or damaged civilian property.

---

Test 3: ISAF Joint Command Morning Operational Update March 24, 2011 ISAF Joint Command - Afghanistan 2011-03-S-081 For Immediate Release KABUL, Afghanistan (March 24, 2011) A separate Afghan and coalition security force targeted a Taliban IED cell leader in Kandahar today. The leader is responsible for planning, preparing and executing explosive-device attacks on Afghan civilians, Afghan and coalition security forces. The joint security force targeted the leader’s suspected compound in Kandahar City based on tips from citizens. The security team contained the area and detained several suspected insurgents. There were no shots fired and no damage done to the targeted compound.

Train: ISAF Joint Command Operational Update Dec. 22 ISAF Joint Command - Afghanistan 2010-12-S-267 2699, 2935, 3022, 3078 For Immediate Release Download PDF KABUL, Afghanistan (Dec. 22) – Several insurgents were killed by Afghan National Security and International Security Assistance Forces in separate clearing operations in southern Afghanistan over the last 24 hours. An Afghan Army and ISAF patrol spotted some insurgents emplacing an improvised explosive device in Sangin district, Helmand province today. After gaining positive identification, combined forces engaged the enemy position, killing two insurgents.

·
1 year ago
·
[ - ]

fagrobot
·
1 year ago
·
[ - ]

[dead]

blackice_cowboy
·
1 year ago
·
[ - ]

sva_
·
1 year ago
·
[ - ]

Clickbait headline

XiphiasX
·
1 year ago
·
[ - ]

1) beat at what? 2) do they beat Claude 3.5 Sonnet?

input_sh
·
1 year ago
·
[ - ]

Have you tried clicking on the link and finding out?

singularity2001
·
1 year ago
·
[ - ]

Just in the task of structured data extraction

So very misleading title

furyofantares
·
1 year ago
·
[ - ]

> So very misleading title

Eh, I can see that, but to me "finetuned model" pretty strongly implies some specific task

freehorse
·
1 year ago
·
[ - ]

Did you read the article or just the title? It is all explained there.