Claude 3.7 Sonnet is Anthropic’s latest large language model (released Feb 2025), described as the first “hybrid reasoning model” that can operate in two modes. In fast mode, it gives near-instant answers; in extended “thinking” mode it allocates extra computation tokens to reason step-by-step before answering lesswrong.
Users can set a “thinking budget” to control how much time (tokens) they spend internally reasoning anthropic.com. Sonnet 3.7 also supports extremely long contexts and outputs (up to ~200K tokens input and 128K tokens output datapro.news), making it suitable for processing whole documents or large codebases in one query. Anthropic highlights several key capabilities:
- Extended reasoning mode: Sonnet 3.7 can explicitly break down problems into multi-step chains of thought. In extended mode it “produces a series of tokens which it can use to reason about a problem at length,” improving accuracy on hard tasks. Users can choose between quick answers or step-by-step solutions.
- Large-output context: The model handles very long inputs/outputs. Claude 3.7 supports up to 128K output tokens (over 15× more than previous versions) anthropic.com, so it can generate entire reports or code files in one response.
- Advanced coding and tool use: It is marketed as state-of-the-art for coding. Anthropic claims it can manage an entire software development lifecycle (planning, writing, refactoring, debugging) end-to-end anthropic.com. It also powers “Claude Code”, an experimental feature that lets Claude view a computer screen and click/type – essentially acting as an AI agent interacting with software (though this remains in limited research preview anthropic.com).
- Multimodal data extraction: Sonnet 3.7 can parse images and visuals. It “can extract information from visuals like charts, graphs, and complex diagrams with ease” anthropic.com, useful for data analysis and scientific applications.
- Improved response behavior: Compared to Claude 3.5, Sonnet 3.7 distinguishes benign from harmful requests more finely, resulting in about 45% fewer unnecessary refusals anthropic.com. It also maintains a large context window and (allegedly) low rates of hallucination for knowledge tasks.
Technical Benchmarks
Independent analyses and Anthropic’s tests show Claude 3.7 Sonnet generally exceeds its predecessors (and many rivals) on a variety of benchmarks. In coding and reasoning tasks the gains are especially large:
Software Engineering (Coding)
Figure: Software-engineering benchmark (SWE-Bench Verified) accuracy for Claude 3.7 Sonnet versus other models. Claude 3.7 (orange) scores ~62.3% (70.3% with scaffolding) vs 49.0% for Claude 3.5.
In code-centric benchmarks, Sonnet 3.7 far outperforms earlier models. On the SWE-Bench Verified coding task (which measures solving real-world software debugging problems), Sonnet scored 62.3% accuracy out-of-the-box – rising to 70.3% when given structured scaffolding prompts – versus only 49.0% for Claude 3.5 datacamp.com.
This was well above contemporaries like OpenAI’s GPT “o1” (48.9%) or “o3-mini” (49.3%) and DeepSeek R1 (~49%) datacamp.com. In practice, reviewers note Sonnet often arrives at working code in fewer iterations. For example, when challenged to write a two-player reaction game, Sonnet reached a correct solution with fewer attempts than models like Grok-3 or GPT decrypt.co. Anthropic itself reported that Sonnet “crushes existing models on SWE-Bench” lesswrong.com.
The trade-off is higher cost: review articles observe that Sonnet produces very large outputs and “burns through output tokens like nobody’s business”decrypt.co. In short, it is slower and more expensive to run, but yields more correct code per query.
Reasoning and Mathematics
Claude 3.7’s hybrid mode shows big improvements on structured reasoning tests. In a graduate-level general reasoning test (GPQA Diamond), enabling extended thinking mode lifted accuracy from 68.0% to 84.8% pageon.ai. Similarly, on high-school math (AIME 2024), Sonnet scored ~61.3% normally, rising to 80.0% in extended mode.
These results, while much better than Claude 3.5’s ~16% on the same AIME set, still lag behind models like Grok-3 which hit ~84–93% decrypt.co. Anthropic acknowledges that complex math remains an “Achilles’ heel” for Claude: its own reported scores on AIME were low (23.3% standard; improved with more compute).
On broad knowledge benchmarks, Sonnet performs at or near state-of-the-art. For example, on MMLU (multitask language understanding) it achieves scores in the high 70s (out of 100) pageon.ai, comparable to top GPT-4 models. A recent analysis characterizes Sonnet as “on par with the best of GPT-4” for general tasks pageon.ai.
Extended mode is a unique feature giving it an edge in tasks requiring multi-step logic (since no contemporary GPT model has a built-in thinking mode pageon.ai). In sum, Sonnet 3.7 usually outpaces Claude 3.5 in reasoning and is competitive with other SOTA LLMs on benchmarks – especially when allowed extra reasoning time.
Knowledge Q&A and Summarization
Thanks to its large context window and claimed factuality, Claude 3.7 excels at Q&A over long documents. Anthropic notes it is ideal for answering questions on large knowledge bases and “low rates of hallucination” anthropic.com. In one test, Sonnet accurately summarized a 47-page report without inventing any quotes (an improvement over Claude 3.5)decrypt.co.
However, its summary was extremely concise – essentially bullet points – missing many details decrypt.co. This reflects a pattern: Sonnet provides quick high-level overviews, which is useful for skimming but not for in-depth detail.
Overall measured hallucination (fabricated facts) is very low. For example, an independent “hallucination leaderboard” found roughly 4.4% hallucination when summarizing short documents github.com.
Datapro and Pageon analyses also report only ~2–3% hallucination rates on more complex reasoning prompts datapro.newspageon.ai. In practical terms, Sonnet’s answers are usually factually accurate, but any claim should still be verified.
Hallucination Rates and Notable Failures
Despite generally low error rates, Claude 3.7 can still hallucinate in difficult scenarios. Anthropic itself claims only “low rates of hallucination” on Q&A tasks anthropic.com. Independent tests are consistent: one report cites ~2.3% hallucination on math proofs pageon.ai, another measured ~4% on summaries github.com. In short, typical hallucination rates are on the order of a few percent, which is lower than many earlier models.
However, specific high-stakes cases reveal serious mistakes. In a public benchmarking of financial document summarization, Claude 3.7 produced fabricated citations, stock prices, and tables – blending real and made-up references. A human evaluation gave it only a C– grade, noting that it “frequently fabricates source documents” and generates false data linkedin.com.
Similarly, a user testing Sonnet on a very large multi-file code repository reported that when the input exceeded ~500 lines, Sonnet began outputting nonsense (random characters, irrelevant code) unless thinking mode was used github.com. These incidents underscore that even Sonnet’s state-of-the-art reasoning is not infallible: hallucinations can emerge when content is extremely large or specialized. In practice, developers using Sonnet should validate critical outputs (e.g. cite checks, regression tests) especially in high-risk domains.
Comparison to Other LLMs
Across the board, Claude 3.7 Sonnet competes strongly with other top models. In coding and reasoning it is often ahead of contemporaries. For example, independent reviewers note that Sonnet “outperforms every competitor” on coding tasks decrypt.co, and it regained the “creative writing crown” from others in story generation tests decrypt.co.
Against OpenAI’s ChatGPT/GPT-4 family, Sonnet holds its own or exceeds them on complex tasks. One analysis summarizes: Claude 3.7’s performance is “on par with the best of GPT-4”, and its coding and reasoning “beat or match” leading models pageon.ai. The key differentiator is Sonnet’s extended mode: it can explicitly perform chain-of-thought reasoning on demand, which ChatGPT lacks pageon.ai.
On everyday general-purpose tasks (dialogue, simple Q&A), ChatGPT and Claude are comparable, but Sonnet shines when deep logic or multi-step workflows are needed.
That said, Sonnet is relatively costly. Its token pricing is higher than many alternatives ($3 per million input, $15 per million output, compared to cheaper rates for models like GPT-4o) anthropic.com.
Users report that the slower speed (due to lengthy reasoning) and large outputs make per-query costs much higher. In return, however, they get more reliable solutions for complex problems. In short, Sonnet 3.7 tends to outperform peers on rich reasoning and coding tasks, while paying a premium in latency and pricing.
Known Weaknesses and Regressions
Despite its advances, Claude 3.7 Sonnet has clear weaknesses. Complex math is still its weakest area: without extended mode it scored poorly on some standardized math tests (e.g. ~23.3% on AIME) decrypt.co. Even with extended reasoning it did not match the ~90% scores of the best models.
Relatedly, Sonnet’s extended thinking can backfire on creative or open-ended tasks: one review found that turning on extended mode produced worse story-writing output (short, repetitive, and nonsensical) than the normal mode decrypt.co. In other words, Sonnet can over-reason itself into poor creative choices.
User reports also highlight regressions. Several AI developers noted that Claude 3.7 often “over-complicates things” compared to 3.5, adding unrequested features or verbose explanations. (One commenter quipped, “TIL Sonnet 3.7 is worse than 3.5… It absolutely over-complicates things when not even needed” lemmy.world.) Summarization outputs, while factual, can be too terse to be useful.
And on very long multi-file inputs, Sonnet 3.7 has exhibited stability issues (as noted above) that Claude 3.5 did not. Finally, like any LLM, Sonnet inherits general limitations: it has a fixed knowledge cutoff (Feb 2025) and can suffer from subtle biases or sensitivity to phrasing.
Real-World Feedback
Feedback from early adopters is overwhelmingly that Claude 3.7 Sonnet is a major upgrade, but not perfect. In developer communities and social media, many praise its logical coherence and coding ability: posts on forums and X/Twitter celebrate Sonnet’s knack for writing correct code and solving complex reasoning queries that stumped other models.
For example, several developers reported that Sonnet effortlessly built user interfaces from natural-language descriptions, something they found unreliable with ChatGPT. On the other hand, some users explicitly prefer the style of Claude 3.5 for conversational tasks; they find 3.7’s personality slightly more stilted and its answers more encyclopedic.
Professional reviewers echo this mixed picture. The LinkedIn benchmarking study mentioned above concluded that while Sonnet 3.7 produces deep, PhD-level insight, it also “frequently fabricates” details and must be double-checked linkedin.com. Safety evaluations have generally improved, but scrutinizing Sonnet’s reasoning (which is now made visible) can itself be tricky – the model’s stated chain-of-thought may not always reflect its true internal process.
Overall, AI researchers and practitioners view Claude 3.7 Sonnet as a powerful research assistant: its reasoning and tool-use skills are state-of-the-art, but users still monitor outputs carefully to guard against the residual errors.Sources: Anthropic’s announcements and system card anthropic.com; independent benchmarks and reviews linkedin.com; user reports and technical blogs github.com. Each data point above is supported by the referenced publications or firsthand analyses.