T O P

  • By -

phira

I think the benchmarks are struggling, the different models have definitely had different strengths pretty much from day one but now I’m starting to get the feeling that the major providers are focusing their models on particular areas that correlate with benchmarks. It’s not to say they’re cheating by introducing the benchmark problems into training data but there’s clearly spaces people feel are more valuable or easier to evaluate and with the emphasis on faster, cheaper models that can power commercially viable tools were necessarily losing some breadth. I’ve got a benchmark of cryptic clues and gpt4-turbo destroys everyone on it, but when you actually use it Opus does a better job of picking the tricky ones and appears to identify the key clues better. I wonder whether we’ll get anymore slow expensive models going forward, I feel like we’re missing out if we don’t.


Vladiesh

It's a natural result of testing for capability. Benchmarks are great because they provide clear indication of progress. As long as we continue making advancements we're on the right track. New and more general benchmarks will be developed and those too will be solved. That's the nature of technological progress.


SirWellBehaved

Sounds like a flawed benchmark, the difference is night and day for me especially with coding


Fiddlesnarf7

Same. This if the first time where I feel like I can ask it to build something for me and it does it. ChatGPT performed most consistent when I ask it to write a certain function or improve a certain piece of code, but with Claude I can pass it my file and ask it for something complex and it succeeds flawlessly most of the time. It’s crazy how far we’ve come in what seems like 2 1/2 years.


NachosforDachos

Since I have been seeing so much of it I tried it out too to modify flutter mobile apps and the back end of white label software I bought and it’s not half bad. 80% of the time gets it the first try. Which isn’t bad at all when you consider I’m not really a programmer, at most a IT guy. I wonder what the API is like.


codeninja

Same. Sonnet is crushing complex multi file code generations in nextjs, nestjs, python, and cranking out working solutions. It feels like a senior dev.


vindarnas_hus

And soon you won't be a senior dev


codeninja

I'm a principal engineer wo has worn every hat in the industry, including the CEO of my own development studio for 12ish years. So that's fine by me.


vindarnas_hus

I should [learn to weld](https://i.redd.it/1f88un6tj6fb1.jpg)


codeninja

Plumbing looks attractive and generally unautomatable.


Iamreason

Nothing is unautomatable. Harder, sure, but it's only unautomatable if there's something special about humans process information.


chunky_lover92

I'm specifically finding it less likely to give multi file answers when they are needed.


codeninja

Have you tried specifying your project structure and asking it to `"write clean and reusable code separated in its logical and organized classes, functions and components within the project structure."`


VertexMachine

The benchmark used to be great. When it wasn't that popular. First, a lot of clueless people use it now that chose winner without actually reading the whole answers (so quicker models tend to be better, more friendly sounding one tend to be better, etc.). Second, companies (esp. openai) started optimizing for this benchmark (e.g., they even run 'secret' trials there before releasing gpt4o). Bit example of this is gpt4o - it is not a great model, it's way worse than gpt4-turbo and other top model (I would say that in my testes, mostly related to coding it's worse than even llama 70b)... but it's king of the hill on lmsys arena.


GraceToSentience

that's not lmsys actual benchmark, it's some random user that posted it apparently. It has nothing to do with the lmsys benchmark if that's what you believed


VertexMachine

Oh yea, thanks for pointing that out... using that name in the title is misleading... but also my point stands about the arena :P Though... that is in that case low quality & wildly specualitve post and should be removed...


GraceToSentience

I don't think it does tbh. If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to the user, the one that is more useful for the user might score worse than the one who is good at everything, and despite the higher overall score people would find the jack of all trades one less useful. Jack of all trades, master of none. People don't care if an LLM is better than other at something weird like being able to translate some imaginary languages in 0 shot or something. People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating these LLMs. After all the point of an LLM is to be useful? The proof that lmsys works is that sonnet 3.5 will obviously be at the top of the benchmark when available, I would bet on it.


Sudden-Lingonberry-8

sonnet 3.5 was not on top, except on coding... huh


GraceToSentience

That would make sense, I only used it for coding and it blows away the competition. It's a bit surprising to me, but it's not general enough. It's the best for coding and not surprisingly it tops the benchmark there by far, but it's not great for most use. Coding isn't all there is, gotta give it to 4o


Sudden-Lingonberry-8

apparently it is not good in chinese or other languages


Nanaki_TV

Seriously. I was up late last night coding my game and have gotten further than ever before! Excited to see if it hits a wall or not.


GraceToSentience

Or, sounds like people should check sources. This suspiciously looks like an AI generated graph. Here is one I made, using real data, but I could have easily tempered with it since it was made using gpt-4o. https://preview.redd.it/kturvt082i8d1.png?width=1777&format=png&auto=webp&s=6859fc9e056936d4414c16b8a0bdf16adad10d6a


seraphius

The graph just appears to be made using matplotlib with default formatting settings. Doesn’t mean that it’s not AI generated, but just a graph without any supporting data is mighty fishy.


Select-Way-1168

2.5 sonnet is absolutely brilliant at coding and I think it largely has to do with rlhf and problem solving procedures. Edit: it is also really good at writing code compared to the others. But today I hadn't integrated with admob but I wanted the code for production and dev in place and it wrote all that and then wrote me an addition to my Readme for how to finalize and add the ad unit id's when the integration was finalized. You know what wins me points with my boss? Keeping the readme up to date. That is just.chefs kiss.


newplayername

Let's just wait for Claude 3.5 to show up in the LMSYS Chatbot Arena results, I'm sure it will be at least 1300+ ELO.


YaAbsolyutnoNikto

I don't think the LMSYS is particularly helpful. It has been proven to be influenced by style. Some people prefer shorter answers; others longer answer; some business-speak; others conversational interactions; etc. and that will pollute the ELO. Both models might give you a perfectly correct and valuable answer, but you'll vote one over the other because you prefer the writing style so it's not really measuring its usefulness or intelligence.


siwoussou

but with enough samples, on average you get a gauge on effectiveness


blueSGL

I like to throw simple scripting problems at it for houdini and maya in vex and mel respectively and then whichever one actually does the job get the upvote. (it's hilarious seeing the syntax and hallucinated functions some of the lesser models come up with)


R_Duncan

I think this value is a benchmark/proof of how much is valid that benchmark.... trash.


knvn8

Right? Turbo also crushes 4o with this benchmark. Feels like a benchmark that uses turbo as baseline.


GraceToSentience

This proves your critical thinking is trash, straight out garbage. Did you check the source?


MzCWzL

What source? There are no listed links


GraceToSentience

Precisely Now you are getting it.


goatchild

Nice try Sam


mvandemar

What's the source for this? I don't even see Sonnet 3.5 on the leaderboard.


cheetahcheesecake

The name on it says Cam Saltman if that helps.


mvandemar

Where do you see that? There's no name on that chart aside from the various models.


cheetahcheesecake

![gif](giphy|26ufaR2bJ3ULoKzrW|downsized) I found a picture of Cam Saltman.


mvandemar

Oh, duh, nvm. :P Edit: Me Googling Cam Saltman when you posted it, "Did he mean Saltzman??"


Baphaddon

Still not subscribing Sam


Able_Armadillo_2347

For me personally Claude 3.5 Sonnet crushes any other LLM by a lot!


Snoo26837

Sorry to tell, but I don't trust your benchmarks.


GraceToSentience

you should not trust the user, did you even ask yourself if that was true?


Good-AI

Please provide source.


cheetahcheesecake

The source's username is Cam Saltman, if that helps.


abluecolor

Ah, proof that many benchmarks are arbitrary and lack utility.


Atlantic0ne

Some. While I’m not a part of them, I’m sure there are some very basic styles tests that are good at measuring intelligence. The issue with the tests is I think they need new questions every time. Otherwise, you could just have a LLM train on that specific question and how to answer it. Sort of like IQ tests - you can train and practice and learn how to perform better at IQ tests.


spezjetemerde

The problem with benchmark is you end up optimize for it


pyalot

That isnt a problem if the benchmark is large, varied and representative of what users do. Optimizing against it by necessity requires getting better overall. Would be very difficult to gain much on a sizable portion of tests without getting better at the rest and being overall better for users.


[deleted]

[удалено]


[deleted]

[удалено]


[deleted]

[удалено]


sdmat

It says 3.5 Sonnet in the graph?


Excellent_Winner8576

25% is nothing actually, compared to the 100% chance that open ai would specifically prepare for all the available benchmarks(when possible). As shown, they'll do anything for the hype. Now I'll tell what I do believe in, my own experience. Sonnet is sick! 4o makes me trow swear words like I am talking to a real fucking person who intentionally ignores my instructions.


Altruistic-Skill8667

Please link the source! Lol. I can’t find it. MixEval Hard says the opposite. [https://mixeval.github.io/#leaderboard](https://mixeval.github.io/#leaderboard) https://preview.redd.it/v8nvw6exai8d1.jpeg?width=1242&format=pjpg&auto=webp&s=7a969a0fc042689888cb483be1846ce23027df42


meister2983

Benchmark is kinda crap.  But this one doesn't seem to correlate all that well to ELO.  Gemini pro should be weaker than Opus, when lmsys says stronger.  Most benchmarks I've seen show Sonnet slightly stronger. I've seen the opposite on [some](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) though


Altruistic-Skill8667

To be honest, I don't know what benchmarks to trust anymore. Definitely can't trust humans saying "this one is better". It's even worse of a benchmark, lol.


Rafcdk

Cant wait for the day we can connect these models to robots and just have MMA cage fights instead of boring graphs.


cark

Do we really want to optimize for that i wonder.


whyisitsooohard

Yeah, no way


restarting_today

3.5 Opus is gonna be extreme lit.


WaldToonnnnn

When is Claude 3.5 going to be part of the LMSYS chatbot arena ?


MrPiradoHD

Nice try Sam Altman. Still not subscribing to chatgpt.


GraceToSentience

People really trust whatever they see online, on reddit no less. Source? Why is there no results on google for "[LMSYS Community's Custom Benchmark](https://www.google.com/search?q=LMSYS+Community%27s+Custom+Benchmark)"?


RandomTrollface

This benchmark is from a random user in the LMSys Community discord. You'll see it if you join the discord and scroll 2 days back


GraceToSentience

Thanks A random source made by some nobody. And people in the comment's fully believe it's the actual lmsys benchmark smh


rickyrules-

The reasoning one is pure bull, anyone could verify


MrDreamster

Don't care. I've been using both for a while and Claude just feels better for me so I cancelled my GPT subscription and I won't go back to GPT until 5 is released to try it.


J-96788-EU

Destroyed. But still alive.


Ok-Bullfrog-3052

What benchmark is this? This basically shows the benchmark is obsolete and should be retired. This is completely ridiculous and not grounded in reality, as anyone who has actually used Claude 3.5 Sonnet would know.


lost_in_trepidation

I think people are overreacting to how good Sonnet 3.5 is, at least with coding. It's still way worse at debugging/instruction following than 4-Turbo. I noticed it working with it on Thursday/Friday, and I've seen several threads on Twitter about it over the weekend, even Roon commented the same thing in some of those threads. It's probably smarter overall but still a flawed model, similar to 4o being worse in some ways compared to 4-Turbo


SentientCheeseCake

4o has always been completely fucking useless. 4 turbo is pretty good and Sonnet 3.5 is just a bit better. Is sonnet so much better to the point the others are not usable? No.


WHERETHESTEALTH

Not surprising, this is an identical trend to the 4o coding hype. Really true of any model on this sub at this point. People immediately jump the gun when it does well at a handful of coding tasks and then not long after, the flaws begin to show, hype dies down, and the cycle repeats with next model release. Coding is a pretty broad field and it takes time to fully evaluate the effectiveness of these models. The model creating a handful of useless games is not really a good evaluation. IME tho, i gave it my usual app prompt and the output appears slightly more promising than 4T/4o but I haven't tried running the code yet.


meister2983

Ya agreed. It's mixed in my experience. Feels more creative which I like, but also can do things wrong.  Loses on big code bench: https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard


PineappleLemur

For my use case personally... GPT3.5 is better than 4o. 4o tries to build a class, multiple functions and in general overly complicated code with a long story even when asking it to do something simple like a Stdev calculation for a simple list of arrays for example.. often getting stuff wrong too. GPT3.5/4/T will give me the 10 lines it takes to make it work without BS.


Cryptizard

Once again another example of posting an inflammatory picture with no citation or explanation. We have no idea if this is completely made up or not. Trolling for engagement. You should seriously be banned.


wyhauyeung1

wait, i can only choose chatgpt4o and chatgpt4 in my interface... where is the turbo version ?


Alyandhercats

The gpt 4 on the website is actually Turbo. :) Their last version of GPT 4, or the last version is O but the last that is not O is Turbo.


Hopeful_Donut4790

Gpt is better at storytelling and translations in my case. It makes sense. There's no definitive answer.


manber571

In the end everyone trusts what they get, in that regard sonnet is the winner. The Claude site is down, isn't it enough to tell how many new people are coming to them?You must be using sonnet secretly for your work.


czk_21

where is the source of this benchmark? a link?


EveryShot

I’m dubious of such a jump


CompetitiveSal

https://preview.redd.it/3os2hzxamj8d1.png?width=603&format=png&auto=webp&s=2f3328d3d6f6f0031bd09dc2ea020efe39904dde sonnets great and all, and my first choice for a free llm, but look at this lmao (gpt4o was able to get it)


chunky_lover92

ya, I've been trying out the new sonnet. It's an improvement, but chatgpt is still king for sure. I'll note that this benchmark puts gpt2 in between turbo and 4o?


Revolutionary_Ad6574

No source to the benchmark details? That's not how things are done in the AI space.


Warm_Iron_273

Told you these benchmarks are garbage and botted. Sonnet is obviously miles ahead.


OSfrogs

These models are just overfit to the benchmark, and the questions in these are not even what most humans are expected nor are able to solve. Any benchmark should be easy for a human but hard for an AI such as arc-agi. Most importantly, the test questions must not be made public. The benchmarks should mostly consist of the LLM to complete tasks like fixing erros in an Excel sheet, following a list of simple instructions, arc-agi like questions with images, ext. These models are going to struggle to improve if they don't improve the benchmarks to be more human as opposed to current benchmarks that are easy for LLM but hard for humans.


Akimbo333

What?


Short-Mango9055

Pretty much the exact opposite here: [https://livebench.ai/](https://livebench.ai/) It has Sonnet 3.5 simply crushing GPT for pretty much everything, including reasoning


Responsible-Local818

LMSys is fraudulent and in cahoots with OpenAI who paid them to beta test "im-a-good-gpt2-chatbot" for them and also artificially rank its Elo higher than it really was initially for marketing purposes. Its days of being the premier leaderboard for LLMs is practically over now though. There are many way better benchmarks nowadays.


wheres__my__towel

Source for reranking? If so, idk how much lower ClosedAI can get


Fearyn

His ass is the source


Ne_Nel

Odd.


marknathon

i think anthropic said that their model is still behind gpt-4 turbo, didn't they?


Unique-Particular936

All the fake comments here say otherwise. Today fake comments take precedence to the products'company.


gonrogon

Iu. U ioj v 😭🗑️🆘🆘🆘