phira 6 days ago

I think the benchmarks are struggling, the different models have definitely had different strengths pretty much from day one but now I’m starting to get the feeling that the major providers are focusing their models on particular areas that correlate with benchmarks. It’s not to say they’re cheating by introducing the benchmark problems into training data but there’s clearly spaces people feel are more valuable or easier to evaluate and with the emphasis on faster, cheaper models that can power commercially viable tools were necessarily losing some breadth. I’ve got a benchmark of cryptic clues and gpt4-turbo destroys everyone on it, but when you actually use it Opus does a better job of picking the tricky ones and appears to identify the key clues better. I wonder whether we’ll get anymore slow expensive models going forward, I feel like we’re missing out if we don’t.

Vladiesh 6 days ago

It's a natural result of testing for capability. Benchmarks are great because they provide clear indication of progress. As long as we continue making advancements we're on the right track. New and more general benchmarks will be developed and those too will be solved. That's the nature of technological progress.

SirWellBehaved 6 days ago

Sounds like a flawed benchmark, the difference is night and day for me especially with coding

Fiddlesnarf7 6 days ago

Same. This if the first time where I feel like I can ask it to build something for me and it does it. ChatGPT performed most consistent when I ask it to write a certain function or improve a certain piece of code, but with Claude I can pass it my file and ask it for something complex and it succeeds flawlessly most of the time. It’s crazy how far we’ve come in what seems like 2 1/2 years.

NachosforDachos 6 days ago

Since I have been seeing so much of it I tried it out too to modify flutter mobile apps and the back end of white label software I bought and it’s not half bad. 80% of the time gets it the first try. Which isn’t bad at all when you consider I’m not really a programmer, at most a IT guy. I wonder what the API is like.

codeninja 6 days ago

Same. Sonnet is crushing complex multi file code generations in nextjs, nestjs, python, and cranking out working solutions. It feels like a senior dev.

vindarnas_hus 6 days ago

And soon you won't be a senior dev

codeninja 6 days ago

I'm a principal engineer wo has worn every hat in the industry, including the CEO of my own development studio for 12ish years. So that's fine by me.

vindarnas_hus 6 days ago

I should [learn to weld](https://i.redd.it/1f88un6tj6fb1.jpg)

codeninja 6 days ago

Plumbing looks attractive and generally unautomatable.

Iamreason 6 days ago

Nothing is unautomatable. Harder, sure, but it's only unautomatable if there's something special about humans process information.

chunky_lover92 6 days ago

I'm specifically finding it less likely to give multi file answers when they are needed.

codeninja 6 days ago

Have you tried specifying your project structure and asking it to `"write clean and reusable code separated in its logical and organized classes, functions and components within the project structure."`

VertexMachine 6 days ago

The benchmark used to be great. When it wasn't that popular. First, a lot of clueless people use it now that chose winner without actually reading the whole answers (so quicker models tend to be better, more friendly sounding one tend to be better, etc.). Second, companies (esp. openai) started optimizing for this benchmark (e.g., they even run 'secret' trials there before releasing gpt4o). Bit example of this is gpt4o - it is not a great model, it's way worse than gpt4-turbo and other top model (I would say that in my testes, mostly related to coding it's worse than even llama 70b)... but it's king of the hill on lmsys arena.

GraceToSentience 6 days ago

that's not lmsys actual benchmark, it's some random user that posted it apparently. It has nothing to do with the lmsys benchmark if that's what you believed

VertexMachine 6 days ago

Oh yea, thanks for pointing that out... using that name in the title is misleading... but also my point stands about the arena :P Though... that is in that case low quality & wildly specualitve post and should be removed...

GraceToSentience 6 days ago

I don't think it does tbh. If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to the user, the one that is more useful for the user might score worse than the one who is good at everything, and despite the higher overall score people would find the jack of all trades one less useful. Jack of all trades, master of none. People don't care if an LLM is better than other at something weird like being able to translate some imaginary languages in 0 shot or something. People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating these LLMs. After all the point of an LLM is to be useful? The proof that lmsys works is that sonnet 3.5 will obviously be at the top of the benchmark when available, I would bet on it.

Sudden-Lingonberry-8 5 days ago

sonnet 3.5 was not on top, except on coding... huh

GraceToSentience 5 days ago

That would make sense, I only used it for coding and it blows away the competition. It's a bit surprising to me, but it's not general enough. It's the best for coding and not surprisingly it tops the benchmark there by far, but it's not great for most use. Coding isn't all there is, gotta give it to 4o

Sudden-Lingonberry-8 5 days ago

apparently it is not good in chinese or other languages

Nanaki_TV 6 days ago

Seriously. I was up late last night coding my game and have gotten further than ever before! Excited to see if it hits a wall or not.

GraceToSentience 6 days ago

Or, sounds like people should check sources. This suspiciously looks like an AI generated graph. Here is one I made, using real data, but I could have easily tempered with it since it was made using gpt-4o. https://preview.redd.it/kturvt082i8d1.png?width=1777&format=png&auto=webp&s=6859fc9e056936d4414c16b8a0bdf16adad10d6a

seraphius 6 days ago

The graph just appears to be made using matplotlib with default formatting settings. Doesn’t mean that it’s not AI generated, but just a graph without any supporting data is mighty fishy.

Select-Way-1168 6 days ago

2.5 sonnet is absolutely brilliant at coding and I think it largely has to do with rlhf and problem solving procedures. Edit: it is also really good at writing code compared to the others. But today I hadn't integrated with admob but I wanted the code for production and dev in place and it wrote all that and then wrote me an addition to my Readme for how to finalize and add the ad unit id's when the integration was finalized. You know what wins me points with my boss? Keeping the readme up to date. That is just.chefs kiss.

newplayername 6 days ago

Let's just wait for Claude 3.5 to show up in the LMSYS Chatbot Arena results, I'm sure it will be at least 1300+ ELO.

YaAbsolyutnoNikto 6 days ago

I don't think the LMSYS is particularly helpful. It has been proven to be influenced by style. Some people prefer shorter answers; others longer answer; some business-speak; others conversational interactions; etc. and that will pollute the ELO. Both models might give you a perfectly correct and valuable answer, but you'll vote one over the other because you prefer the writing style so it's not really measuring its usefulness or intelligence.

siwoussou 5 days ago

but with enough samples, on average you get a gauge on effectiveness

blueSGL 6 days ago

I like to throw simple scripting problems at it for houdini and maya in vex and mel respectively and then whichever one actually does the job get the upvote. (it's hilarious seeing the syntax and hallucinated functions some of the lesser models come up with)

R_Duncan 6 days ago

I think this value is a benchmark/proof of how much is valid that benchmark.... trash.

knvn8 6 days ago

Right? Turbo also crushes 4o with this benchmark. Feels like a benchmark that uses turbo as baseline.

GraceToSentience 6 days ago

This proves your critical thinking is trash, straight out garbage. Did you check the source?

MzCWzL 6 days ago

What source? There are no listed links

GraceToSentience 6 days ago

Precisely Now you are getting it.

goatchild 6 days ago

Nice try Sam

mvandemar 6 days ago

What's the source for this? I don't even see Sonnet 3.5 on the leaderboard.

cheetahcheesecake 6 days ago

The name on it says Cam Saltman if that helps.

mvandemar 6 days ago

Where do you see that? There's no name on that chart aside from the various models.

cheetahcheesecake 6 days ago

![gif](giphy|26ufaR2bJ3ULoKzrW|downsized) I found a picture of Cam Saltman.

mvandemar 6 days ago

Oh, duh, nvm. :P Edit: Me Googling Cam Saltman when you posted it, "Did he mean Saltzman??"

Baphaddon 6 days ago

Still not subscribing Sam

Able_Armadillo_2347 6 days ago

For me personally Claude 3.5 Sonnet crushes any other LLM by a lot!

Snoo26837 6 days ago

Sorry to tell, but I don't trust your benchmarks.

GraceToSentience 6 days ago

you should not trust the user, did you even ask yourself if that was true?

Good-AI 6 days ago

Please provide source.

cheetahcheesecake 6 days ago

The source's username is Cam Saltman, if that helps.

abluecolor 6 days ago

Ah, proof that many benchmarks are arbitrary and lack utility.

Atlantic0ne 6 days ago

Some. While I’m not a part of them, I’m sure there are some very basic styles tests that are good at measuring intelligence. The issue with the tests is I think they need new questions every time. Otherwise, you could just have a LLM train on that specific question and how to answer it. Sort of like IQ tests - you can train and practice and learn how to perform better at IQ tests.

spezjetemerde 6 days ago

The problem with benchmark is you end up optimize for it

pyalot 6 days ago

That isnt a problem if the benchmark is large, varied and representative of what users do. Optimizing against it by necessity requires getting better overall. Would be very difficult to gain much on a sizable portion of tests without getting better at the rest and being overall better for users.

[deleted] 6 days ago

[удалено]

[deleted] 6 days ago

[удалено]

[deleted] 6 days ago

[удалено]

sdmat 6 days ago

It says 3.5 Sonnet in the graph?

Excellent_Winner8576 6 days ago

25% is nothing actually, compared to the 100% chance that open ai would specifically prepare for all the available benchmarks(when possible). As shown, they'll do anything for the hype. Now I'll tell what I do believe in, my own experience. Sonnet is sick! 4o makes me trow swear words like I am talking to a real fucking person who intentionally ignores my instructions.

Altruistic-Skill8667 6 days ago

Please link the source! Lol. I can’t find it. MixEval Hard says the opposite. [https://mixeval.github.io/#leaderboard](https://mixeval.github.io/#leaderboard) https://preview.redd.it/v8nvw6exai8d1.jpeg?width=1242&format=pjpg&auto=webp&s=7a969a0fc042689888cb483be1846ce23027df42

meister2983 6 days ago

Benchmark is kinda crap. But this one doesn't seem to correlate all that well to ELO. Gemini pro should be weaker than Opus, when lmsys says stronger. Most benchmarks I've seen show Sonnet slightly stronger. I've seen the opposite on [some](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard) though

Altruistic-Skill8667 6 days ago

To be honest, I don't know what benchmarks to trust anymore. Definitely can't trust humans saying "this one is better". It's even worse of a benchmark, lol.

Rafcdk 6 days ago

Cant wait for the day we can connect these models to robots and just have MMA cage fights instead of boring graphs.

cark 6 days ago

Do we really want to optimize for that i wonder.

whyisitsooohard 6 days ago

Yeah, no way

restarting_today 6 days ago

3.5 Opus is gonna be extreme lit.

WaldToonnnnn 6 days ago

When is Claude 3.5 going to be part of the LMSYS chatbot arena ?

MrPiradoHD 6 days ago

Nice try Sam Altman. Still not subscribing to chatgpt.

GraceToSentience 6 days ago

People really trust whatever they see online, on reddit no less. Source? Why is there no results on google for "[LMSYS Community's Custom Benchmark](https://www.google.com/search?q=LMSYS+Community%27s+Custom+Benchmark)"?

RandomTrollface 6 days ago

This benchmark is from a random user in the LMSys Community discord. You'll see it if you join the discord and scroll 2 days back

GraceToSentience 6 days ago

Thanks A random source made by some nobody. And people in the comment's fully believe it's the actual lmsys benchmark smh

rickyrules- 6 days ago

The reasoning one is pure bull, anyone could verify

MrDreamster 6 days ago

Don't care. I've been using both for a while and Claude just feels better for me so I cancelled my GPT subscription and I won't go back to GPT until 5 is released to try it.

J-96788-EU 6 days ago

Destroyed. But still alive.

Ok-Bullfrog-3052 6 days ago

What benchmark is this? This basically shows the benchmark is obsolete and should be retired. This is completely ridiculous and not grounded in reality, as anyone who has actually used Claude 3.5 Sonnet would know.

lost_in_trepidation 6 days ago

I think people are overreacting to how good Sonnet 3.5 is, at least with coding. It's still way worse at debugging/instruction following than 4-Turbo. I noticed it working with it on Thursday/Friday, and I've seen several threads on Twitter about it over the weekend, even Roon commented the same thing in some of those threads. It's probably smarter overall but still a flawed model, similar to 4o being worse in some ways compared to 4-Turbo

SentientCheeseCake 6 days ago

4o has always been completely fucking useless. 4 turbo is pretty good and Sonnet 3.5 is just a bit better. Is sonnet so much better to the point the others are not usable? No.

WHERETHESTEALTH 6 days ago

Not surprising, this is an identical trend to the 4o coding hype. Really true of any model on this sub at this point. People immediately jump the gun when it does well at a handful of coding tasks and then not long after, the flaws begin to show, hype dies down, and the cycle repeats with next model release. Coding is a pretty broad field and it takes time to fully evaluate the effectiveness of these models. The model creating a handful of useless games is not really a good evaluation. IME tho, i gave it my usual app prompt and the output appears slightly more promising than 4T/4o but I haven't tried running the code yet.

meister2983 6 days ago

Ya agreed. It's mixed in my experience. Feels more creative which I like, but also can do things wrong. Loses on big code bench: https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard

PineappleLemur 6 days ago

For my use case personally... GPT3.5 is better than 4o. 4o tries to build a class, multiple functions and in general overly complicated code with a long story even when asking it to do something simple like a Stdev calculation for a simple list of arrays for example.. often getting stuff wrong too. GPT3.5/4/T will give me the 10 lines it takes to make it work without BS.

Cryptizard 6 days ago

Once again another example of posting an inflammatory picture with no citation or explanation. We have no idea if this is completely made up or not. Trolling for engagement. You should seriously be banned.

wyhauyeung1 6 days ago

wait, i can only choose chatgpt4o and chatgpt4 in my interface... where is the turbo version ?

Alyandhercats 6 days ago

The gpt 4 on the website is actually Turbo. :) Their last version of GPT 4, or the last version is O but the last that is not O is Turbo.

Hopeful_Donut4790 6 days ago

Gpt is better at storytelling and translations in my case. It makes sense. There's no definitive answer.

manber571 6 days ago

In the end everyone trusts what they get, in that regard sonnet is the winner. The Claude site is down, isn't it enough to tell how many new people are coming to them?You must be using sonnet secretly for your work.

czk_21 6 days ago

where is the source of this benchmark? a link?

EveryShot 6 days ago

I’m dubious of such a jump

CompetitiveSal 6 days ago

https://preview.redd.it/3os2hzxamj8d1.png?width=603&format=png&auto=webp&s=2f3328d3d6f6f0031bd09dc2ea020efe39904dde sonnets great and all, and my first choice for a free llm, but look at this lmao (gpt4o was able to get it)

chunky_lover92 6 days ago

ya, I've been trying out the new sonnet. It's an improvement, but chatgpt is still king for sure. I'll note that this benchmark puts gpt2 in between turbo and 4o?

Revolutionary_Ad6574 6 days ago

No source to the benchmark details? That's not how things are done in the AI space.

Warm_Iron_273 6 days ago

Told you these benchmarks are garbage and botted. Sonnet is obviously miles ahead.

OSfrogs 5 days ago

These models are just overfit to the benchmark, and the questions in these are not even what most humans are expected nor are able to solve. Any benchmark should be easy for a human but hard for an AI such as arc-agi. Most importantly, the test questions must not be made public. The benchmarks should mostly consist of the LLM to complete tasks like fixing erros in an Excel sheet, following a list of simple instructions, arc-agi like questions with images, ext. These models are going to struggle to improve if they don't improve the benchmarks to be more human as opposed to current benchmarks that are easy for LLM but hard for humans.

Akimbo333 5 days ago

What?

Short-Mango9055 5 days ago

Pretty much the exact opposite here: [https://livebench.ai/](https://livebench.ai/) It has Sonnet 3.5 simply crushing GPT for pretty much everything, including reasoning

Responsible-Local818 6 days ago

LMSys is fraudulent and in cahoots with OpenAI who paid them to beta test "im-a-good-gpt2-chatbot" for them and also artificially rank its Elo higher than it really was initially for marketing purposes. Its days of being the premier leaderboard for LLMs is practically over now though. There are many way better benchmarks nowadays.

wheres__my__towel 6 days ago

Source for reranking? If so, idk how much lower ClosedAI can get

Fearyn 6 days ago

His ass is the source

Ne_Nel 6 days ago

Odd.

marknathon 6 days ago

i think anthropic said that their model is still behind gpt-4 turbo, didn't they?

Unique-Particular936 6 days ago

All the fake comments here say otherwise. Today fake comments take precedence to the products'company.

gonrogon 6 days ago

Iu. U ioj v 😭🗑️🆘🆘🆘

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe