Claude 3.5 Sonnet vs GPT-4o: Side-by-Side Tests

70,330
0
Published 2024-06-28
The ultimate showdown between two of the most advanced large language models on the market: OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet. In this video, I put these models to the test in a series of head-to-head challenges to determine which one truly reigns supreme. I evaluate their responses to various prompts, awarding points to the model that delivers the best performance in each category. Will Claude 3.5 Sonnet live up to its reputation as the best LLM available, or will GPT-4o take the crown? Join me for an in-depth comparison and find out which model comes out on top!

I hope you learn something from this video. Comment with any questions, and I'll make sure to respond!

***

Link to text responses from the video: gist.github.com/patrickstorm/346e17f193ae42036f890…

***

0:00 - Intro
0:27 - Highlights and Benchmarks of Claude 3.5 Sonnet
3:12 - Showdown rules
3:58 - Round 1: Creative Writing
6:55 - Round 2: Image Descriptions
9:09 - Round 3: Coding
15:31 - Round 4: Sentiment Analysis
17:05 - Round 5: Question Answering
20:45 - Round 6: Image Generation
21:07 - Round 7: Conversational Skills
22:26 - Round 8: Summarization
23:53 - Final results & What model am I going to use?

All Comments (21)
  • I think given no point when both are correct may bias the final result. Let's say, you've done 20 tests, 15 are the same results, 1 gpt4o is better, 4 Claude Sonnet is better. The score is then 4-1 for Clause Sonnet but actually it's more 19-16.
  • Summary: 1. both are great. 2. don't use either for fact finding. 3. Since they are both free, use both simultaneously.
  • @Ivan7Kovnovic
    The GDP 2018 question was actually answered correctly. According to every source I found on the internet, Germany was 4th and the UK was 5th.
  • @drlordbasil
    Claude Sonnet is wayyy better for complex tasks and assistance in debugging.
  • This was the best comparison video on YouTube. Great job man, subscribed.
  • @briankgarland
    I pay for both, primarily for coding, and haven't used 4o since Sonnet came out.
  • @vm_jayfus9332
    Your channel deserves sooooo Much more attention😮
  • @MrAmad3us
    Claude premium plan gives less messages / dollar. It’s significantly more consistent in long and complex convos, but you reach the 5h message limit quickly
  • @NithinJune
    for the coding tests you should do a plagiarism check to check if it is straight ripping someone’s code
  • @RanLM1
    Great video. Thank you. Subscribed
  • @Repz98
    This video was really well made, and I enjoyed it through the entire video! I thought I was watching someone with 200k plus subs, based on the quality of this content. Keep it up, I’m subscribing now!
  • Great job on the video dude! I also agree with your results for yourself at the end that discusses how you plan to use them. I like Claude but without those extra things, Chat GPT is my daily driver.
  • @zejdzglebiej
    The question is, what do you mean by writing a better text? I'm afraid that you evaluate texts too positively, where there is understatement, and a lack of logical structure with opening and closing. You perceive it as an aura of mystery. That's why Clodie cheated on you, because what he couldn't do, you interpreted as good writing.
  • 00:03 CLA 3.5 Sonet outperforms GPT 40 in benchmarks 02:27 Claude 3.5 Sonet outperforms GPT-40 in speed and live code demonstrations. 04:46 CLA 3.5 Sonnet outperforms GPT-40 in creative writing tests 07:13 Comparison of performance between CLA 3.5 and GPT-40 models 09:52 Comparison of Claude 3.5 Sonet and GPT-40 12:27 Difference in code review of Claude 3.5 Sonnet and GPT-4o 14:57 Comparing GPT-3.5 Sonnet and GPT-4o 17:28 Comparison of GPT-40 and Claude 3.5 Sonnet performance on trivia questions 19:49 GPT-40 performed better in factual accuracy 22:04 Claude 3.5 Sonnet outperformed GPT-40 in understanding and summarizing human emotions. 24:18 CLA 3.5 Sonet offers better performance and cost saving
  • @ktms1188
    Claude 3.5 and GPT-4o both have their strengths, and it’s fascinating to see how they differ. Claude feels more human, like it’s really trying to understand what I’m asking, but then I’ve noticed with the memories function in GPT the model I think knows a lot more when I’m trying to ask now so now has much better answers like Claude 3.5. My issue is sometimes it hits those frustrating blocks and says it’s unable to answer my question, which drives me nuts even when it’s nothing controversial and it clearly would know the answer. I noticed in one of their talking points that is one of their big things. They are working on as it is overly restrictive and they know it so improve that. GPT-4o, on the other hand, is super analytical but occasionally needs me to rephrase my questions to get the best answers. I’ve been using both for a while now, and here’s what I’ve found: Claude’s artifact mode is mind-blowing, it’s nice if you’re on an iPhone or iPad since no android app. GPT’s memory function is a game-changer, making it more accurate over time as it learns from our interactions. Wouldn’t it be amazing if they combined the best of both worlds? I’d love to see a deep dive comparison between custom GPTs like “Scholar” and the standard GPT-4o, especially for fact-based questions. Does the customization really boost accuracy?