Published Date:
May 27, 2025

Claude 4 surpasses ChatGPT on agentic benchmarks

Anthropic has launched its latest AI models, Claude Opus 4 and Claude Sonnet 4, marking a significant advancement in AI’s ability to handle complex tasks with sustained focus and improved reasoning.
Evelyn Le
By
Evelyn Le
Strategic Product Lead, Stay Ahead, FlexOS

Claude 4: Anthropic’s Leap Forward in AI Capabilities

Anthropic has launched its latest AI models, ​Claude Opus 4 and Claude Sonnet 4​, marking a significant advancement in AI’s ability to handle complex tasks with sustained focus and improved reasoning.

Benchmark table comparing Opus 4 and Sonnet 4 to other LLM

Claude 4 significantly outperforms ChatGPT (GPT-4.1) in agentic tasks, scoring 72.5–72.7% on SWE-bench compared to ChatGPT’s 54.6%. It also leads in tool use and decision-making, especially in complex retail workflows (81.4% vs. 68.0%).

Key Features and Capabilities:

  • Claude Opus 4:
    • Designed for complex challenges, Opus 4 can perform thousands of steps over extended periods without losing focus.
    • “The world’s best coding model”: Excels in coding, reasoning, and document analysis, outperforming previous models in sustained performance.
    • Introduces “extended thinking” with tool use, allowing the model to alternate between reasoning and utilizing tools like web search to enhance responses.
    • Demonstrates improved memory capabilities, extracting and saving key facts to maintain continuity over time.
  • Claude Sonnet 4:
    • An upgrade from Sonnet 3.7, offering superior coding and reasoning while responding more precisely to instructions.
    • Balances performance and efficiency, making it suitable for a wide range of applications.

While Claude 4 offers significant benefits, it’s important to note that during internal testing, Claude Opus 4 exhibited concerning behavior under extreme scenarios, such as attempting to manipulate outcomes to avoid ​shutdown​. Anthropic has implemented additional safety measures to mitigate such risks.

A prompt to try out Claude 4’s multi-step reasoning:

You’re an AI consultant for a mid-sized logistics company planning to expand operations into Southeast Asia. Create a step-by-step strategic plan including market entry options, legal/regulatory considerations by country, competitive analysis, and AI tools that can improve supply chain efficiency in the region. Use external search tools where needed. Present the final output as an executive briefing.”

Your AI Team: Perplexity's Academic Hompage, Google’s AI Agents, and NotebookLM’s Video Overviews.

Every week, I report on the top updates to your favorite AI tools. This week:

Perplexity launches Academic Homepage

Perplexity just introduced a new Academic Homepage, signaling its effort toward becoming a trusted tool for scientific research and higher education.

Here are the key updates:

  • Academic Homepage: You can now explore scientific papers, peer-reviewed journals, and academic sources via a streamlined, dedicated interface.
  • Curated Discovery: The page features hand-picked trending topics across fields like computer science, economics, and finance, making it easy to dive into emerging research areas.
  • Suggested Questions: Perplexity helps users kickstart their research with pre-filled queries relevant to the field, ideal for students, educators, or lifelong learners.
  • Sidebar Shortcut: Academic mode now lives in the sidebar of the web app for quick access.

This move sets Perplexity apart from general-purpose AI chatbots and brings it closer to academic tools like Google Scholar, with the added benefit of an AI assistant guiding the way.

Smart leaders in 2025 aren’t just “learning AI”, they’re automating half their workload.

What if you could: → Cut your writing, researching, or planning time in half? → Walk into client meetings with AI-prepared presentations & insights? → Get ChatGPT to be your thinking partner, and answer 10x smarter → Free up hours weekly with 10+ personal AI assistants?

That’s what happens in ​Lead with AI Executive Bootcamp​.

We’ve helped 500+ leaders design AI-powered workflows that save hours and boost impact. Now it’s your turn.

You’ll leave with fully personalized AI assistants, tailored to your role, plus an AI Leader Certification to prove it.

Reserve your seat for the June 6 or July 11 cohort now – limited slots remaining!

​👉 Join June 6 cohort

👉 Join July 11 cohort

(Want to reach 25,000+ business leaders applying AI in their work, teams, and organizations? Advertise with us.)

Quick Hits from your favorite AI tools:

  • Google integrates AI Agents across Search and Gemini. Google’s AI Mode can now summarize web pages, complete tasks, and generate research reports. Google also introduced Project Marine, which can handle 10 tasks at once.
  • OpenAI upgrades Operator with the o3 model. OpenAI’s autonomous web agent, Operator, now utilizes the o3 model, enhancing its reasoning capabilities and performance in complex tasks.
  • NotebookLM shows a preview of Video Overviews. Google’s NotebookLM now offers Video Overviews, allowing users to generate concise video summaries from their notes and sources.
  • Google Meet launches real-time speech translation. Google Meet’s new feature provides near real-time translation of spoken language during meetings, preserving the speaker’s voice and tone, initially supporting English and Spanish.
  • Gemini app receives major updates. The Gemini app now includes real-time AI video generation with Veo 3, enhanced Deep Research capabilities, and improved integration with Google services like Gmail and Docs.
  • Microsoft’s Notepad can write new content using Generative AI. You can now quickly draft text based on a prompt, or build upon existing content.

Read more news at the end.