AI Catchup

GPT-5.5 Is Here: State-of-the-Art Agentic Coding, 1M Context, and a New Pro Tier

By 17 min read

OpenAI launched GPT-5.5 on April 23, 2026 -- its smartest model yet, with state-of-the-art scores on Terminal-Bench 2.0 (82.7%), GDPval (84.9%), and OSWorld-Verified (78.7%), GPT-5.4 per-token latency, and a new GPT-5.5 Pro tier for harder work. Available in ChatGPT and Codex today, with API at $5/M input and $30/M output coming soon.

OpenAI released GPT-5.5 on April 23, 2026, calling it their smartest and most intuitive model yet. The headline shift is not raw intelligence in isolation -- it is sustained, agentic intelligence: the model excels at messy, multi-part tasks that require planning, tool use, self-checking, and persistence across a long arc of work. Pricing is set, a new Pro tier is live in ChatGPT, and the rollout covers Plus through Enterprise users in ChatGPT and Codex today, with the API following soon.

Key Takeaways

  • Launch date: April 23, 2026. Rolling out today to Plus, Pro, Business, and Enterprise in ChatGPT and Codex; GPT-5.5 Pro in ChatGPT for Pro, Business, and Enterprise. API availability follows shortly.
  • Agentic coding lead: state-of-the-art 82.7% on Terminal-Bench 2.0, 73.1% on OpenAI's internal Expert-SWE eval (a long-horizon frontier coding test with a median 20-hour human completion time), and 58.6% on SWE-Bench Pro.
  • Knowledge work lead: 84.9% on GDPval wins-or-ties, 78.7% on OSWorld-Verified, and 98.0% on Tau2-bench Telecom without prompt tuning.
  • Scientific research lead: 80.5% on BixBench and 25.0% on GeneBench, up from GPT-5.4's 74.0% and 19.0%. An internal version of GPT-5.5 with a custom harness helped find a new proof about off-diagonal Ramsey numbers, later verified in Lean.
  • Latency preserved: GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while performing at a higher level of intelligence, and uses significantly fewer tokens to complete the same Codex tasks.
  • API pricing: $5 per million input tokens, $30 per million output tokens, 1M token context window. GPT-5.5 Pro: $30 per million input, $180 per million output.
  • Pro tier in ChatGPT: GPT-5.5 Pro is a separate ChatGPT product for harder questions and higher-accuracy work, with leading scores on GPQA and FrontierMath.
  • Cyber safeguards scaled up: GPT-5.5 ships stricter cyber classifiers, trusted-access routes for verified defenders, and is treated as High on both biological/chemical and cybersecurity capabilities under OpenAI's Preparedness Framework.

What OpenAI Actually Shipped

GPT-5.5 is a frontier model tuned for real work on a computer, not just for winning benchmarks. OpenAI frames it as the next step toward a new way of getting work done: you hand GPT-5.5 a messy, multi-part task and it plans, uses tools, checks its work, navigates ambiguity, and keeps going without careful step-by-step management.

The gains are especially strong in four areas:

  • Agentic coding -- writing and debugging, but also holding context across large systems, reasoning through ambiguous failures, and carrying changes through the surrounding codebase.
  • Computer use -- seeing what is on screen, clicking, typing, navigating interfaces, and moving across tools with precision.
  • Knowledge work -- generating documents, spreadsheets, slide presentations, research reports, and operational plans.
  • Early scientific research -- multi-stage genomics analysis, bioinformatics, and even contributing mathematical arguments in core research areas.

The second-order claim is the one to sit with: OpenAI says GPT-5.5 delivers this intelligence step-up without compromising on speed. Larger and more capable models are typically slower to serve, but GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while operating at a much higher level of intelligence. In Codex, the model is also more efficient -- it uses significantly fewer tokens to complete the same tasks.

Agentic Coding: Where the Benchmarks Land

GPT-5.5 is OpenAI's strongest agentic coding model to date, and the coding benchmarks are where the scoring gap opens most clearly.

Terminal-Bench 2.0 tests complex command-line workflows that require planning, iteration, and tool coordination. GPT-5.5 scores 82.7%, versus 75.1% for GPT-5.4, 69.4% for Claude Opus 4.7, and 68.5% for Gemini 3.1 Pro. That is a state-of-the-art result with a meaningful gap to the next frontier model.

SWE-Bench Pro evaluates real-world GitHub issue resolution. GPT-5.5 reaches 58.6%, up from GPT-5.4's 57.7%. Claude Opus 4.7 still leads this one at 64.3%, with Gemini 3.1 Pro at 54.2%. OpenAI notes that labs have observed evidence of memorization on this eval, so the score is worth treating as one signal among several rather than a single source of truth.

Expert-SWE is OpenAI's internal frontier eval for long-horizon coding tasks with a median estimated human completion time of 20 hours. GPT-5.5 scores 73.1% versus GPT-5.4's 68.5%. OpenAI specifically highlights that GPT-5.5 improves on GPT-5.4 across all three coding evals while using fewer tokens -- so the model is both smarter and more token-efficient on hard coding work.

What Testers Said

OpenAI's launch materials include specific testimony from partners that is useful for reading what "smarter" actually means in day-to-day use:

  • Dan Shipper, founder and CEO of Every, called GPT-5.5 "the first coding model I've used that has serious conceptual clarity." He tested it by rewinding a real bug his team had spent days debugging after launch -- work that eventually required one of his best engineers to rewrite part of the system. GPT-5.4 could not reproduce that rewrite. GPT-5.5 could.
  • Pietro Schirano, CEO of MagicPath, reported GPT-5.5 merging a branch with hundreds of frontend and refactor changes into a main branch that had also changed substantially, resolving the conflict in one shot in about 20 minutes. His takeaway: "It genuinely feels like I'm working with a higher intelligence, and there's almost a sense of respect."
  • Senior engineers reported GPT-5.5 was noticeably stronger than GPT-5.4 and Claude Opus 4.7 at reasoning and autonomy, catching issues in advance and predicting testing and review needs without explicit prompting.
  • One engineer at NVIDIA with early access said: "Losing access to GPT-5.5 feels like I've had a limb amputated."
  • Michael Truell, co-founder and CEO of Cursor, described GPT-5.5 as "noticeably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early, which matters most for the complex, long-running work our users delegate to Cursor."

The pattern in the qualitative feedback is persistence. GPT-5.5 does not just answer harder questions; it sticks with harder tasks without stopping early or needing hand-holding.

Knowledge Work and Computer Use

The same intent-understanding that makes GPT-5.5 strong at coding also makes it powerful for general computer work: finding information, using tools, checking output, and turning raw material into finished documents, spreadsheets, and decks.

In Codex, GPT-5.5 is better than GPT-5.4 at generating documents, spreadsheets, and slide presentations. Alpha testers highlighted gains on operational research, spreadsheet modeling, and turning messy business inputs into plans. Paired with Codex's computer-use skills, GPT-5.5 brings the model closer to the feeling that it is actually using the computer with you -- seeing the screen, clicking, typing, navigating interfaces, moving across tools.

OpenAI shared concrete internal examples where teams are already running on these workflows today, with more than 85% of the company using Codex every week across functions including software engineering, finance, communications, marketing, data science, and product management:

  • Communications used GPT-5.5 in Codex to analyze six months of speaking request data, build a scoring and risk framework, and validate an automated Slack agent that now handles low-risk requests while routing higher-risk requests to human review.
  • Finance used Codex to review 24,771 K-1 tax forms totaling 71,637 pages using a workflow that excluded personal information, accelerating the task by two weeks compared to the prior year.
  • Go-to-Market automated the weekly business report, saving one employee 5-10 hours a week.

The benchmark scores that back up this kind of work:

  • GDPval wins-or-ties: 84.9% for GPT-5.5, 83.0% for GPT-5.4, 82.3% for GPT-5.5 Pro, 80.3% for Claude Opus 4.7, 67.3% for Gemini 3.1 Pro. GDPval tests agents' ability to produce well-specified knowledge work across 44 occupations.
  • OSWorld-Verified: 78.7% for GPT-5.5, 75.0% for GPT-5.4, 78.0% for Claude Opus 4.7. This eval measures whether a model can operate real computer environments on its own.
  • Tau2-bench Telecom: 98.0% for GPT-5.5, 92.8% for GPT-5.4 -- both run without prompt tuning and with GPT-4.1 as the user model. This eval tests complex customer-service workflows.
  • FinanceAgent v1.1: 60.0% for GPT-5.5, 56.0% for GPT-5.4, 64.4% for Claude Opus 4.7, 59.7% for Gemini 3.1 Pro.
  • Investment Banking Modeling Tasks (internal): 88.5% for GPT-5.5, 87.3% for GPT-5.4, 88.6% for GPT-5.5 Pro.
  • OfficeQA Pro: 54.1% for GPT-5.5, 53.2% for GPT-5.4, 43.6% for Claude Opus 4.7, 18.1% for Gemini 3.1 Pro.

GPT-5.5 Pro in ChatGPT is aimed at the even-harder end of this work. Early testers reported responses were significantly more comprehensive, well-structured, accurate, relevant, and useful than GPT-5.4 Pro, with the clearest gains in business, legal, education, and data science.

Scientific Research Gains

Scientific and technical research benchmarks show another sharp step-up. The pattern is persistence across the full research loop -- explore an idea, gather evidence, test assumptions, interpret results, decide what to try next -- not just answering a single hard question.

  • GeneBench: 25.0% for GPT-5.5 versus 19.0% for GPT-5.4, and 33.2% for GPT-5.5 Pro versus 25.6% for GPT-5.4 Pro. GeneBench focuses on multi-stage scientific data analysis in genetics and quantitative biology, where tasks often correspond to multi-day projects for scientific experts.
  • BixBench: 80.5% for GPT-5.5 versus 74.0% for GPT-5.4. BixBench is a bioinformatics and data analysis benchmark, and GPT-5.5 achieves leading performance among models with published scores.
  • FrontierMath Tier 1-3: 51.7% for GPT-5.5, 47.6% for GPT-5.4, 52.4% for GPT-5.5 Pro, 43.8% for Claude Opus 4.7, 36.9% for Gemini 3.1 Pro.
  • FrontierMath Tier 4: 35.4% for GPT-5.5, 27.1% for GPT-5.4, 39.6% for GPT-5.5 Pro, 22.9% for Claude Opus 4.7, 16.7% for Gemini 3.1 Pro.
  • GPQA Diamond: 93.6% for GPT-5.5, 92.8% for GPT-5.4, 94.2% for Claude Opus 4.7, 94.3% for Gemini 3.1 Pro.
  • Humanity's Last Exam (no tools): 41.4% for GPT-5.5, 39.8% for GPT-5.4, 43.1% for GPT-5.5 Pro, 46.9% for Claude Opus 4.7, 44.4% for Gemini 3.1 Pro.
  • Humanity's Last Exam (with tools): 52.2% for GPT-5.5, 52.1% for GPT-5.4, 57.2% for GPT-5.5 Pro, 54.7% for Claude Opus 4.7, 51.4% for Gemini 3.1 Pro.

Beyond benchmarks, OpenAI shared two concrete research stories worth reading in full in the launch post:

  • Derya Unutmaz, an immunology professor and researcher at the Jackson Laboratory for Genomic Medicine, used GPT-5.5 Pro to analyze a gene-expression dataset with 62 samples and nearly 28,000 genes, producing a detailed research report that he said would have taken his team months.
  • Bartosz Naskręcki, assistant professor of mathematics at Adam Mickiewicz University in Poznań, used GPT-5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model.

The most striking result: an internal version of GPT-5.5 with a custom harness helped discover a new proof about off-diagonal Ramsey numbers, a core object in combinatorics. The result was later verified in Lean. OpenAI frames this as a concrete example of GPT-5.5 contributing a surprising and useful mathematical argument, not just code or explanation.

Inference Efficiency: Why Latency Held

The decision that stands out in the launch post is how OpenAI held latency constant while shipping a meaningfully smarter model. GPT-5.5 was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. Inference was rethought as an integrated system rather than a set of isolated optimizations.

OpenAI highlights that Codex and GPT-5.5 were instrumental in hitting the performance targets. One specific example: load balancing and partitioning heuristics. Before GPT-5.5, OpenAI split requests on an accelerator into a fixed number of chunks to balance work across computing cores. A pre-determined number of static chunks is not optimal for all traffic shapes. Codex analyzed weeks of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work, increasing token generation speeds by over 20%. GPT-5.5 helped find and implement further improvements in the serving stack itself.

The practical takeaway for anyone serving these models: the bottleneck at the frontier is increasingly systems engineering, not model architecture alone, and the frontier labs are using their own models to improve their own infrastructure.

Cybersecurity: Stricter Safeguards, Broader Defender Access

GPT-5.5 is an incremental but important step in AI cybersecurity capability. It did not reach the Critical level under OpenAI's Preparedness Framework, but evaluations showed a clear step up from GPT-5.4. OpenAI is treating both biological/chemical and cybersecurity capabilities as High under the framework for this release.

Three moves define how GPT-5.5 is being deployed:

  • Industry-leading safeguards for higher-risk cyber activity. GPT-5.5 introduces stricter classifiers for potential cyber risk, added protections for repeated misuse, and tighter controls around sensitive cyber requests. OpenAI acknowledges some users may find these classifiers annoying initially and plans to tune them over time.
  • Trusted Access for Cyber expansion. OpenAI is expanding access to cyber-permissive models through its Trusted Access for Cyber program, starting with Codex. Verified users who meet certain trust signals get expanded access to GPT-5.5's advanced cybersecurity capabilities with fewer restrictions. Organizations responsible for defending critical infrastructure can apply to access cyber-permissive models like GPT-5.4-Cyber under strict security requirements. Users can apply at chatgpt.com/cyber to reduce unnecessary refusals while using GPT-5.5 for verified defensive work.
  • Public-sector partnerships. OpenAI is working with government partners to explore how advanced AI can support the defensive work of trusted officials responsible for systems like taxpayer data, the power grid, and water supplies.

On benchmarks:

  • CyberGym: 81.8% for GPT-5.5, 79.0% for GPT-5.4, 73.1% for Claude Opus 4.7.
  • Capture-the-Flags challenge tasks (internal, expansion of the hardest CTFs from system cards with additional hard challenges): 88.1% for GPT-5.5, 83.7% for GPT-5.4.

GPT-5.5 went through OpenAI's full safety and governance process, including preparedness evaluations, domain-specific testing, new targeted evaluations for advanced biology and cybersecurity capabilities, and external red-teaming. OpenAI also collected feedback on real use cases from nearly 200 trusted early-access partners before release. More details are in the GPT-5.5 system card.

Long Context and Abstract Reasoning

Two more benchmark categories where the numbers matter:

Long context. GPT-5.5's biggest lifts over GPT-5.4 come at the extreme end of the context range.

  • Graphwalks BFS 1 million: 45.4% for GPT-5.5 versus 9.4% for GPT-5.4. Claude Opus 4.6 sits at 41.2%.
  • OpenAI MRCR v2 8-needle 512K-1M: 74.0% for GPT-5.5 versus 36.6% for GPT-5.4. Claude Opus 4.7 sits at 32.2%.
  • OpenAI MRCR v2 8-needle 256K-512K: 81.5% for GPT-5.5 versus 57.5% for GPT-5.4.
  • OpenAI MRCR v2 8-needle 128K-256K: 87.5% for GPT-5.5 versus 79.3% for GPT-5.4, Claude Opus 4.7 at 59.2%.

Claude Opus 4.7 still leads some shorter-range Graphwalks tests (76.9% on BFS 256k f1 vs GPT-5.5's 73.7%), but past the 128K boundary, GPT-5.5 is visibly ahead.

Abstract reasoning. ARC-AGI is split:

  • ARC-AGI-1 Verified: 95.0% for GPT-5.5, 93.5% for Claude Opus 4.7, 98.0% for Gemini 3.1 Pro.
  • ARC-AGI-2 Verified: 85.0% for GPT-5.5, 75.8% for Claude Opus 4.7, 77.1% for Gemini 3.1 Pro.

The harder ARC-AGI-2 shows a wider gap -- another data point for GPT-5.5's claim to frontier-level reasoning.

Availability, Pricing, and Pro Tier

In ChatGPT. GPT-5.5 Thinking is available today to Plus, Pro, Business, and Enterprise users. GPT-5.5 Pro -- aimed at harder questions and higher-accuracy work -- is available to Pro, Business, and Enterprise users.

In Codex. GPT-5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. Fast mode generates tokens 1.5x faster for 2.5x the cost, for cases where latency is the constraint.

In the API (coming soon). GPT-5.5 will be in the Responses and Chat Completions APIs with a 1M token context window at:

| Tier | Input ($ per 1M tokens) | Output ($ per 1M tokens) | |---|---|---| | GPT-5.5 standard | $5 | $30 | | GPT-5.5 Batch / Flex | $2.50 | $15 | | GPT-5.5 Priority | $12.50 | $75 | | GPT-5.5 Pro | $30 | $180 |

Batch and Flex run at half the standard rate. Priority processing runs at 2.5x the standard rate. OpenAI notes that API deployments require different safeguards than ChatGPT and is working with partners on the safety and security requirements for serving at scale.

Cost framing. GPT-5.5 is priced higher than GPT-5.4 -- $5 input and $30 output versus whatever GPT-5.4 was charging -- but OpenAI argues it is both more intelligent and more token-efficient. In Codex specifically, the experience is tuned so GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users, while maintaining generous usage across subscription levels. On the Artificial Analysis Intelligence Index, GPT-5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models.

How GPT-5.5 Compares to Claude Opus 4.7 and Gemini 3.1 Pro

The published head-to-head numbers, read one eval at a time:

| Eval | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro | |---|---|---|---| | Terminal-Bench 2.0 | 82.7% | 69.4% | 68.5% | | SWE-Bench Pro (Public) | 58.6% | 64.3% | 54.2% | | GDPval (wins or ties) | 84.9% | 80.3% | 67.3% | | OSWorld-Verified | 78.7% | 78.0% | -- | | Toolathlon | 55.6% | -- | 48.8% | | BrowseComp | 84.4% | 79.3% | 85.9% | | FrontierMath Tier 1-3 | 51.7% | 43.8% | 36.9% | | FrontierMath Tier 4 | 35.4% | 22.9% | 16.7% | | CyberGym | 81.8% | 73.1% | -- | | GPQA Diamond | 93.6% | 94.2% | 94.3% | | Humanity's Last Exam (no tools) | 41.4% | 46.9% | 44.4% | | Humanity's Last Exam (with tools) | 52.2% | 54.7% | 51.4% | | ARC-AGI-1 (Verified) | 95.0% | 93.5% | 98.0% | | ARC-AGI-2 (Verified) | 85.0% | 75.8% | 77.1% | | MCP Atlas | 75.3% | 79.1% | 78.2% | | MMMU Pro (no tools) | 81.2% | -- | 80.5% |

Where GPT-5.5 wins decisively: agentic CLI coding (Terminal-Bench), long-horizon knowledge work (GDPval), the hardest math (FrontierMath Tier 4), computer use (OSWorld-Verified), tool-rich workflows (Toolathlon), cyber defense capability (CyberGym), and the harder abstract reasoning (ARC-AGI-2).

Where Claude Opus 4.7 still wins: real-world GitHub issue resolution on SWE-Bench Pro (with the caveat that memorization has been observed on this eval), MCP tool use at scale on MCP Atlas, and Humanity's Last Exam with and without tools.

Where Gemini 3.1 Pro still leads: ARC-AGI-1 Verified (98.0%), BrowseComp (85.9%), and GPQA Diamond (94.3%, narrowly).

For most agentic coding and knowledge-work tasks, GPT-5.5 is now the top choice. Claude Opus 4.7 remains the stronger pick where SWE-Bench Pro-style GitHub issue work or MCP-heavy tool flows dominate. Read our Claude Opus 4.7 launch coverage for the other side of that comparison, and our top AI model pick for April 2026 for the current overall recommendation.

What to Try First

Three concrete first sessions that pressure-test what GPT-5.5 is actually good at, based on the strengths OpenAI highlighted:

  1. Re-run your hardest open coding task. Pick a PR or refactor that GPT-5.4 or Claude Opus 4.7 could not finish end-to-end and hand it to GPT-5.5 in Codex. The specific claim to test is the "conceptual clarity" signal: does it understand where a fix belongs and what else in the system will be affected?
  2. Give it a multi-tool knowledge-work arc. Analyze a real dataset, build a scoring framework, and produce a document or spreadsheet that a colleague can review. OpenAI's own Comms and Finance examples are the template -- multi-step work that used to take a human days.
  3. Try GPT-5.5 Pro on a long-form research task. If you have ChatGPT Pro, Business, or Enterprise, load a technical manuscript, a dense PDF, or a research notebook into GPT-5.5 Pro and ask it to critique, propose analyses, or stress-test the argument. Early testers described using it as a research partner rather than an answer engine.

What This Means for the Frontier

The step-up on the hardest coding, knowledge-work, and scientific research benchmarks is large enough to reset the competitive frame for April 2026. Terminal-Bench 2.0 is a 13.3-point jump over GPT-5.4 and 13.3 points over Claude Opus 4.7. GDPval takes a new state-of-the-art at 84.9%. FrontierMath Tier 4 -- the hardest math bucket -- nearly doubles Claude Opus 4.7's 22.9% to 35.4%. The latency preservation and token efficiency make the lift cheap in practice as well as on paper.

Claude Opus 4.7 remains competitive where it already was -- SWE-Bench Pro real-world issue resolution, some long-context evals, and specific knowledge-work niches -- but the frontier-model leaderboard has shifted. GPT-5.5 is the first release since GPT-5 that meaningfully extends OpenAI's lead on agentic coding, computer use, and knowledge work at the same time. If you are choosing one model for most tasks on April 23, 2026, that is the call.

Frequently Asked Questions

What is GPT-5.5?

GPT-5.5 is OpenAI's new flagship model, launched April 23, 2026. It is the smartest model OpenAI has released yet, with strong gains in agentic coding, computer use, knowledge work, and early scientific research. It matches GPT-5.4 per-token latency while using significantly fewer tokens to complete the same Codex tasks.

How is GPT-5.5 different from GPT-5.4?

GPT-5.5 beats GPT-5.4 on every published benchmark in the launch post, including Terminal-Bench 2.0 (82.7% vs 75.1%), Expert-SWE internal (73.1% vs 68.5%), OSWorld-Verified (78.7% vs 75.0%), GDPval wins-or-ties (84.9% vs 83.0%), and FrontierMath Tier 4 (35.4% vs 27.1%). It also performs these tasks with fewer output tokens on Terminal-Bench 2.0 and Expert-SWE, and the same per-token latency as GPT-5.4.

How much does GPT-5.5 cost in the API?

GPT-5.5 will be priced at $5 per million input tokens and $30 per million output tokens in the Responses and Chat Completions APIs, with a 1M token context window. Batch and Flex run at half the standard rate, Priority at 2.5x. GPT-5.5 Pro is $30 input and $180 output per million tokens.

Who can use GPT-5.5 today?

GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex on April 23, 2026. In Codex it is also on Edu and Go plans with a 400K context window plus a Fast mode (1.5x faster, 2.5x cost). GPT-5.5 Pro rolls out to Pro, Business, and Enterprise in ChatGPT. API access follows soon.

How does GPT-5.5 compare to Claude Opus 4.7 and Gemini 3.1 Pro?

GPT-5.5 leads Claude Opus 4.7 on Terminal-Bench 2.0 (82.7% vs 69.4%), GDPval (84.9% vs 80.3%), FrontierMath Tier 4 (35.4% vs 22.9%), and CyberGym (81.8% vs 73.1%). Claude Opus 4.7 still leads on SWE-Bench Pro (64.3% vs 58.6%). Against Gemini 3.1 Pro, GPT-5.5 leads most evals; Gemini still wins ARC-AGI-1.

Get the weekly AI Catchup

Tools, practices, and what matters -- in your inbox every Monday.