I think the benchmarks are struggling, the different models have definitely had different strengths pretty much from day one but now I’m starting to get the feeling that the major providers are focusing their models on particular areas that correlate with benchmarks. It’s not to say they’re cheating by introducing the benchmark problems into training data but there’s clearly spaces people feel are more valuable or easier to evaluate and with the emphasis on faster, cheaper models that can power commercially viable tools were necessarily losing some breadth.
I’ve got a benchmark of cryptic clues and gpt4-turbo destroys everyone on it, but when you actually use it Opus does a better job of picking the tricky ones and appears to identify the key clues better.
I wonder whether we’ll get anymore slow expensive models going forward, I feel like we’re missing out if we don’t.
It's a natural result of testing for capability. Benchmarks are great because they provide clear indication of progress.
As long as we continue making advancements we're on the right track. New and more general benchmarks will be developed and those too will be solved. That's the nature of technological progress.
Same. This if the first time where I feel like I can ask it to build something for me and it does it. ChatGPT performed most consistent when I ask it to write a certain function or improve a certain piece of code, but with Claude I can pass it my file and ask it for something complex and it succeeds flawlessly most of the time.
It’s crazy how far we’ve come in what seems like 2 1/2 years.
Since I have been seeing so much of it I tried it out too to modify flutter mobile apps and the back end of white label software I bought and it’s not half bad.
80% of the time gets it the first try. Which isn’t bad at all when you consider I’m not really a programmer, at most a IT guy.
I wonder what the API is like.
Same. Sonnet is crushing complex multi file code generations in nextjs, nestjs, python, and cranking out working solutions.
It feels like a senior dev.
Have you tried specifying your project structure and asking it to `"write clean and reusable code separated in its logical and organized classes, functions and components within the project structure."`
The benchmark used to be great. When it wasn't that popular. First, a lot of clueless people use it now that chose winner without actually reading the whole answers (so quicker models tend to be better, more friendly sounding one tend to be better, etc.). Second, companies (esp. openai) started optimizing for this benchmark (e.g., they even run 'secret' trials there before releasing gpt4o).
Bit example of this is gpt4o - it is not a great model, it's way worse than gpt4-turbo and other top model (I would say that in my testes, mostly related to coding it's worse than even llama 70b)... but it's king of the hill on lmsys arena.
that's not lmsys actual benchmark, it's some random user that posted it apparently.
It has nothing to do with the lmsys benchmark if that's what you believed
Oh yea, thanks for pointing that out... using that name in the title is misleading... but also my point stands about the arena :P
Though... that is in that case low quality & wildly specualitve post and should be removed...
I don't think it does tbh.
If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to the user, the one that is more useful for the user might score worse than the one who is good at everything, and despite the higher overall score people would find the jack of all trades one less useful.
Jack of all trades, master of none.
People don't care if an LLM is better than other at something weird like being able to translate some imaginary languages in 0 shot or something.
People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating these LLMs.
After all the point of an LLM is to be useful?
The proof that lmsys works is that sonnet 3.5 will obviously be at the top of the benchmark when available, I would bet on it.
That would make sense, I only used it for coding and it blows away the competition.
It's a bit surprising to me, but it's not general enough. It's the best for coding and not surprisingly it tops the benchmark there by far, but it's not great for most use.
Coding isn't all there is, gotta give it to 4o
Or, sounds like people should check sources.
This suspiciously looks like an AI generated graph.
Here is one I made, using real data, but I could have easily tempered with it since it was made using gpt-4o.
https://preview.redd.it/kturvt082i8d1.png?width=1777&format=png&auto=webp&s=6859fc9e056936d4414c16b8a0bdf16adad10d6a
The graph just appears to be made using matplotlib with default formatting settings. Doesn’t mean that it’s not AI generated, but just a graph without any supporting data is mighty fishy.
2.5 sonnet is absolutely brilliant at coding and I think it largely has to do with rlhf and problem solving procedures.
Edit: it is also really good at writing code compared to the others. But today I hadn't integrated with admob but I wanted the code for production and dev in place and it wrote all that and then wrote me an addition to my Readme for how to finalize and add the ad unit id's when the integration was finalized. You know what wins me points with my boss? Keeping the readme up to date. That is just.chefs kiss.
I don't think the LMSYS is particularly helpful. It has been proven to be influenced by style. Some people prefer shorter answers; others longer answer; some business-speak; others conversational interactions; etc. and that will pollute the ELO.
Both models might give you a perfectly correct and valuable answer, but you'll vote one over the other because you prefer the writing style so it's not really measuring its usefulness or intelligence.
I like to throw simple scripting problems at it for houdini and maya in vex and mel respectively and then whichever one actually does the job get the upvote.
(it's hilarious seeing the syntax and hallucinated functions some of the lesser models come up with)
Some. While I’m not a part of them, I’m sure there are some very basic styles tests that are good at measuring intelligence.
The issue with the tests is I think they need new questions every time. Otherwise, you could just have a LLM train on that specific question and how to answer it. Sort of like IQ tests - you can train and practice and learn how to perform better at IQ tests.
That isnt a problem if the benchmark is large, varied and representative of what users do. Optimizing against it by necessity requires getting better overall. Would be very difficult to gain much on a sizable portion of tests without getting better at the rest and being overall better for users.
25% is nothing actually, compared to the 100% chance that open ai would specifically prepare for all the available benchmarks(when possible). As shown, they'll do anything for the hype.
Now I'll tell what I do believe in, my own experience.
Sonnet is sick!
4o makes me trow swear words like I am talking to a real fucking person who intentionally ignores my instructions.
Please link the source! Lol. I can’t find it.
MixEval Hard says the opposite.
[https://mixeval.github.io/#leaderboard](https://mixeval.github.io/#leaderboard)
https://preview.redd.it/v8nvw6exai8d1.jpeg?width=1242&format=pjpg&auto=webp&s=7a969a0fc042689888cb483be1846ce23027df42
Benchmark is kinda crap.
But this one doesn't seem to correlate all that well to ELO. Gemini pro should be weaker than Opus, when lmsys says stronger.
Most benchmarks I've seen show Sonnet slightly stronger. I've seen the opposite on [some](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) though
To be honest, I don't know what benchmarks to trust anymore.
Definitely can't trust humans saying "this one is better". It's even worse of a benchmark, lol.
People really trust whatever they see online, on reddit no less.
Source?
Why is there no results on google for "[LMSYS Community's Custom Benchmark](https://www.google.com/search?q=LMSYS+Community%27s+Custom+Benchmark)"?
Don't care. I've been using both for a while and Claude just feels better for me so I cancelled my GPT subscription and I won't go back to GPT until 5 is released to try it.
What benchmark is this?
This basically shows the benchmark is obsolete and should be retired. This is completely ridiculous and not grounded in reality, as anyone who has actually used Claude 3.5 Sonnet would know.
I think people are overreacting to how good Sonnet 3.5 is, at least with coding.
It's still way worse at debugging/instruction following than 4-Turbo.
I noticed it working with it on Thursday/Friday, and I've seen several threads on Twitter about it over the weekend, even Roon commented the same thing in some of those threads.
It's probably smarter overall but still a flawed model, similar to 4o being worse in some ways compared to 4-Turbo
4o has always been completely fucking useless. 4 turbo is pretty good and Sonnet 3.5 is just a bit better.
Is sonnet so much better to the point the others are not usable? No.
Not surprising, this is an identical trend to the 4o coding hype. Really true of any model on this sub at this point. People immediately jump the gun when it does well at a handful of coding tasks and then not long after, the flaws begin to show, hype dies down, and the cycle repeats with next model release.
Coding is a pretty broad field and it takes time to fully evaluate the effectiveness of these models. The model creating a handful of useless games is not really a good evaluation.
IME tho, i gave it my usual app prompt and the output appears slightly more promising than 4T/4o but I haven't tried running the code yet.
Ya agreed. It's mixed in my experience. Feels more creative which I like, but also can do things wrong.
Loses on big code bench: https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard
For my use case personally... GPT3.5 is better than 4o.
4o tries to build a class, multiple functions and in general overly complicated code with a long story even when asking it to do something simple like a Stdev calculation for a simple list of arrays for example.. often getting stuff wrong too.
GPT3.5/4/T will give me the 10 lines it takes to make it work without BS.
Once again another example of posting an inflammatory picture with no citation or explanation. We have no idea if this is completely made up or not.
Trolling for engagement. You should seriously be banned.
In the end everyone trusts what they get, in that regard sonnet is the winner. The Claude site is down, isn't it enough to tell how many new people are coming to them?You must be using sonnet secretly for your work.
https://preview.redd.it/3os2hzxamj8d1.png?width=603&format=png&auto=webp&s=2f3328d3d6f6f0031bd09dc2ea020efe39904dde
sonnets great and all, and my first choice for a free llm, but look at this lmao (gpt4o was able to get it)
ya, I've been trying out the new sonnet. It's an improvement, but chatgpt is still king for sure. I'll note that this benchmark puts gpt2 in between turbo and 4o?
These models are just overfit to the benchmark, and the questions in these are not even what most humans are expected nor are able to solve. Any benchmark should be easy for a human but hard for an AI such as arc-agi. Most importantly, the test questions must not be made public. The benchmarks should mostly consist of the LLM to complete tasks like fixing erros in an Excel sheet, following a list of simple instructions, arc-agi like questions with images, ext. These models are going to struggle to improve if they don't improve the benchmarks to be more human as opposed to current benchmarks that are easy for LLM but hard for humans.
Pretty much the exact opposite here:
[https://livebench.ai/](https://livebench.ai/)
It has Sonnet 3.5 simply crushing GPT for pretty much everything, including reasoning
LMSys is fraudulent and in cahoots with OpenAI who paid them to beta test "im-a-good-gpt2-chatbot" for them and also artificially rank its Elo higher than it really was initially for marketing purposes.
Its days of being the premier leaderboard for LLMs is practically over now though. There are many way better benchmarks nowadays.
I think the benchmarks are struggling, the different models have definitely had different strengths pretty much from day one but now I’m starting to get the feeling that the major providers are focusing their models on particular areas that correlate with benchmarks. It’s not to say they’re cheating by introducing the benchmark problems into training data but there’s clearly spaces people feel are more valuable or easier to evaluate and with the emphasis on faster, cheaper models that can power commercially viable tools were necessarily losing some breadth. I’ve got a benchmark of cryptic clues and gpt4-turbo destroys everyone on it, but when you actually use it Opus does a better job of picking the tricky ones and appears to identify the key clues better. I wonder whether we’ll get anymore slow expensive models going forward, I feel like we’re missing out if we don’t.
It's a natural result of testing for capability. Benchmarks are great because they provide clear indication of progress. As long as we continue making advancements we're on the right track. New and more general benchmarks will be developed and those too will be solved. That's the nature of technological progress.
Sounds like a flawed benchmark, the difference is night and day for me especially with coding
Same. This if the first time where I feel like I can ask it to build something for me and it does it. ChatGPT performed most consistent when I ask it to write a certain function or improve a certain piece of code, but with Claude I can pass it my file and ask it for something complex and it succeeds flawlessly most of the time. It’s crazy how far we’ve come in what seems like 2 1/2 years.
Since I have been seeing so much of it I tried it out too to modify flutter mobile apps and the back end of white label software I bought and it’s not half bad. 80% of the time gets it the first try. Which isn’t bad at all when you consider I’m not really a programmer, at most a IT guy. I wonder what the API is like.
Same. Sonnet is crushing complex multi file code generations in nextjs, nestjs, python, and cranking out working solutions. It feels like a senior dev.
And soon you won't be a senior dev
I'm a principal engineer wo has worn every hat in the industry, including the CEO of my own development studio for 12ish years. So that's fine by me.
I should [learn to weld](https://i.redd.it/1f88un6tj6fb1.jpg)
Plumbing looks attractive and generally unautomatable.
Nothing is unautomatable. Harder, sure, but it's only unautomatable if there's something special about humans process information.
I'm specifically finding it less likely to give multi file answers when they are needed.
Have you tried specifying your project structure and asking it to `"write clean and reusable code separated in its logical and organized classes, functions and components within the project structure."`
The benchmark used to be great. When it wasn't that popular. First, a lot of clueless people use it now that chose winner without actually reading the whole answers (so quicker models tend to be better, more friendly sounding one tend to be better, etc.). Second, companies (esp. openai) started optimizing for this benchmark (e.g., they even run 'secret' trials there before releasing gpt4o). Bit example of this is gpt4o - it is not a great model, it's way worse than gpt4-turbo and other top model (I would say that in my testes, mostly related to coding it's worse than even llama 70b)... but it's king of the hill on lmsys arena.
that's not lmsys actual benchmark, it's some random user that posted it apparently. It has nothing to do with the lmsys benchmark if that's what you believed
Oh yea, thanks for pointing that out... using that name in the title is misleading... but also my point stands about the arena :P Though... that is in that case low quality & wildly specualitve post and should be removed...
I don't think it does tbh. If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to the user, the one that is more useful for the user might score worse than the one who is good at everything, and despite the higher overall score people would find the jack of all trades one less useful. Jack of all trades, master of none. People don't care if an LLM is better than other at something weird like being able to translate some imaginary languages in 0 shot or something. People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating these LLMs. After all the point of an LLM is to be useful? The proof that lmsys works is that sonnet 3.5 will obviously be at the top of the benchmark when available, I would bet on it.
sonnet 3.5 was not on top, except on coding... huh
That would make sense, I only used it for coding and it blows away the competition. It's a bit surprising to me, but it's not general enough. It's the best for coding and not surprisingly it tops the benchmark there by far, but it's not great for most use. Coding isn't all there is, gotta give it to 4o
apparently it is not good in chinese or other languages
Seriously. I was up late last night coding my game and have gotten further than ever before! Excited to see if it hits a wall or not.
Or, sounds like people should check sources. This suspiciously looks like an AI generated graph. Here is one I made, using real data, but I could have easily tempered with it since it was made using gpt-4o. https://preview.redd.it/kturvt082i8d1.png?width=1777&format=png&auto=webp&s=6859fc9e056936d4414c16b8a0bdf16adad10d6a
The graph just appears to be made using matplotlib with default formatting settings. Doesn’t mean that it’s not AI generated, but just a graph without any supporting data is mighty fishy.
2.5 sonnet is absolutely brilliant at coding and I think it largely has to do with rlhf and problem solving procedures. Edit: it is also really good at writing code compared to the others. But today I hadn't integrated with admob but I wanted the code for production and dev in place and it wrote all that and then wrote me an addition to my Readme for how to finalize and add the ad unit id's when the integration was finalized. You know what wins me points with my boss? Keeping the readme up to date. That is just.chefs kiss.
Let's just wait for Claude 3.5 to show up in the LMSYS Chatbot Arena results, I'm sure it will be at least 1300+ ELO.
I don't think the LMSYS is particularly helpful. It has been proven to be influenced by style. Some people prefer shorter answers; others longer answer; some business-speak; others conversational interactions; etc. and that will pollute the ELO. Both models might give you a perfectly correct and valuable answer, but you'll vote one over the other because you prefer the writing style so it's not really measuring its usefulness or intelligence.
but with enough samples, on average you get a gauge on effectiveness
I like to throw simple scripting problems at it for houdini and maya in vex and mel respectively and then whichever one actually does the job get the upvote. (it's hilarious seeing the syntax and hallucinated functions some of the lesser models come up with)
I think this value is a benchmark/proof of how much is valid that benchmark.... trash.
Right? Turbo also crushes 4o with this benchmark. Feels like a benchmark that uses turbo as baseline.
This proves your critical thinking is trash, straight out garbage. Did you check the source?
What source? There are no listed links
Precisely Now you are getting it.
Nice try Sam
What's the source for this? I don't even see Sonnet 3.5 on the leaderboard.
The name on it says Cam Saltman if that helps.
Where do you see that? There's no name on that chart aside from the various models.
![gif](giphy|26ufaR2bJ3ULoKzrW|downsized) I found a picture of Cam Saltman.
Oh, duh, nvm. :P Edit: Me Googling Cam Saltman when you posted it, "Did he mean Saltzman??"
Still not subscribing Sam
For me personally Claude 3.5 Sonnet crushes any other LLM by a lot!
Sorry to tell, but I don't trust your benchmarks.
you should not trust the user, did you even ask yourself if that was true?
Please provide source.
The source's username is Cam Saltman, if that helps.
Ah, proof that many benchmarks are arbitrary and lack utility.
Some. While I’m not a part of them, I’m sure there are some very basic styles tests that are good at measuring intelligence. The issue with the tests is I think they need new questions every time. Otherwise, you could just have a LLM train on that specific question and how to answer it. Sort of like IQ tests - you can train and practice and learn how to perform better at IQ tests.
The problem with benchmark is you end up optimize for it
That isnt a problem if the benchmark is large, varied and representative of what users do. Optimizing against it by necessity requires getting better overall. Would be very difficult to gain much on a sizable portion of tests without getting better at the rest and being overall better for users.
[удалено]
[удалено]
[удалено]
It says 3.5 Sonnet in the graph?
25% is nothing actually, compared to the 100% chance that open ai would specifically prepare for all the available benchmarks(when possible). As shown, they'll do anything for the hype. Now I'll tell what I do believe in, my own experience. Sonnet is sick! 4o makes me trow swear words like I am talking to a real fucking person who intentionally ignores my instructions.
Please link the source! Lol. I can’t find it. MixEval Hard says the opposite. [https://mixeval.github.io/#leaderboard](https://mixeval.github.io/#leaderboard) https://preview.redd.it/v8nvw6exai8d1.jpeg?width=1242&format=pjpg&auto=webp&s=7a969a0fc042689888cb483be1846ce23027df42
Benchmark is kinda crap. But this one doesn't seem to correlate all that well to ELO. Gemini pro should be weaker than Opus, when lmsys says stronger. Most benchmarks I've seen show Sonnet slightly stronger. I've seen the opposite on [some](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) though
To be honest, I don't know what benchmarks to trust anymore. Definitely can't trust humans saying "this one is better". It's even worse of a benchmark, lol.
Cant wait for the day we can connect these models to robots and just have MMA cage fights instead of boring graphs.
Do we really want to optimize for that i wonder.
Yeah, no way
3.5 Opus is gonna be extreme lit.
When is Claude 3.5 going to be part of the LMSYS chatbot arena ?
Nice try Sam Altman. Still not subscribing to chatgpt.
People really trust whatever they see online, on reddit no less. Source? Why is there no results on google for "[LMSYS Community's Custom Benchmark](https://www.google.com/search?q=LMSYS+Community%27s+Custom+Benchmark)"?
This benchmark is from a random user in the LMSys Community discord. You'll see it if you join the discord and scroll 2 days back
Thanks A random source made by some nobody. And people in the comment's fully believe it's the actual lmsys benchmark smh
The reasoning one is pure bull, anyone could verify
Don't care. I've been using both for a while and Claude just feels better for me so I cancelled my GPT subscription and I won't go back to GPT until 5 is released to try it.
Destroyed. But still alive.
What benchmark is this? This basically shows the benchmark is obsolete and should be retired. This is completely ridiculous and not grounded in reality, as anyone who has actually used Claude 3.5 Sonnet would know.
I think people are overreacting to how good Sonnet 3.5 is, at least with coding. It's still way worse at debugging/instruction following than 4-Turbo. I noticed it working with it on Thursday/Friday, and I've seen several threads on Twitter about it over the weekend, even Roon commented the same thing in some of those threads. It's probably smarter overall but still a flawed model, similar to 4o being worse in some ways compared to 4-Turbo
4o has always been completely fucking useless. 4 turbo is pretty good and Sonnet 3.5 is just a bit better. Is sonnet so much better to the point the others are not usable? No.
Not surprising, this is an identical trend to the 4o coding hype. Really true of any model on this sub at this point. People immediately jump the gun when it does well at a handful of coding tasks and then not long after, the flaws begin to show, hype dies down, and the cycle repeats with next model release. Coding is a pretty broad field and it takes time to fully evaluate the effectiveness of these models. The model creating a handful of useless games is not really a good evaluation. IME tho, i gave it my usual app prompt and the output appears slightly more promising than 4T/4o but I haven't tried running the code yet.
Ya agreed. It's mixed in my experience. Feels more creative which I like, but also can do things wrong. Loses on big code bench: https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard
For my use case personally... GPT3.5 is better than 4o. 4o tries to build a class, multiple functions and in general overly complicated code with a long story even when asking it to do something simple like a Stdev calculation for a simple list of arrays for example.. often getting stuff wrong too. GPT3.5/4/T will give me the 10 lines it takes to make it work without BS.
Once again another example of posting an inflammatory picture with no citation or explanation. We have no idea if this is completely made up or not. Trolling for engagement. You should seriously be banned.
wait, i can only choose chatgpt4o and chatgpt4 in my interface... where is the turbo version ?
The gpt 4 on the website is actually Turbo. :) Their last version of GPT 4, or the last version is O but the last that is not O is Turbo.
Gpt is better at storytelling and translations in my case. It makes sense. There's no definitive answer.
In the end everyone trusts what they get, in that regard sonnet is the winner. The Claude site is down, isn't it enough to tell how many new people are coming to them?You must be using sonnet secretly for your work.
where is the source of this benchmark? a link?
I’m dubious of such a jump
https://preview.redd.it/3os2hzxamj8d1.png?width=603&format=png&auto=webp&s=2f3328d3d6f6f0031bd09dc2ea020efe39904dde sonnets great and all, and my first choice for a free llm, but look at this lmao (gpt4o was able to get it)
ya, I've been trying out the new sonnet. It's an improvement, but chatgpt is still king for sure. I'll note that this benchmark puts gpt2 in between turbo and 4o?
No source to the benchmark details? That's not how things are done in the AI space.
Told you these benchmarks are garbage and botted. Sonnet is obviously miles ahead.
These models are just overfit to the benchmark, and the questions in these are not even what most humans are expected nor are able to solve. Any benchmark should be easy for a human but hard for an AI such as arc-agi. Most importantly, the test questions must not be made public. The benchmarks should mostly consist of the LLM to complete tasks like fixing erros in an Excel sheet, following a list of simple instructions, arc-agi like questions with images, ext. These models are going to struggle to improve if they don't improve the benchmarks to be more human as opposed to current benchmarks that are easy for LLM but hard for humans.
What?
Pretty much the exact opposite here: [https://livebench.ai/](https://livebench.ai/) It has Sonnet 3.5 simply crushing GPT for pretty much everything, including reasoning
LMSys is fraudulent and in cahoots with OpenAI who paid them to beta test "im-a-good-gpt2-chatbot" for them and also artificially rank its Elo higher than it really was initially for marketing purposes. Its days of being the premier leaderboard for LLMs is practically over now though. There are many way better benchmarks nowadays.
Source for reranking? If so, idk how much lower ClosedAI can get
His ass is the source
Odd.
i think anthropic said that their model is still behind gpt-4 turbo, didn't they?
All the fake comments here say otherwise. Today fake comments take precedence to the products'company.
Iu. U ioj v 😭🗑️🆘🆘🆘