Anthropic released Claude Haiku 4.5 on Wednesday, a smaller and significantly cheaper artificial intelligence model that matches the coding capabilities of systems that were considered cutting-edge just months ago, marking the latest salvo in an intensifying competition to dominate enterprise AI.
The model costs $1 per million input tokens and $5 per million output tokens — roughly one-third the price of Anthropic's mid-sized Sonnet 4 model released in May, while operating more than twice as fast. In certain tasks, particularly operating computers autonomously, Haiku 4.5 actually surpasses its more expensive predecessor.
"Haiku 4.5 is a clear leap in performance and is now largely as smart as Sonnet 4 while being significantly faster and one-third of the cost," an Anthropic spokesperson told VentureBeat, underscoring how rapidly AI capabilities are becoming commoditized as the technology matures.
The launch comes just two weeks after Anthropic released Claude Sonnet 4.5, which the company bills as the world's best coding model, and two months after introducing Opus 4.1. The breakneck pace of releases reflects mounting pressure from OpenAI, whose $500 billion valuation dwarfs Anthropic's $183 billion, and which has inked a series of multibillion-dollar infrastructure deals while expanding its product lineup.
In an unusual move that could reshape competitive dynamics in the AI market, Anthropic is making Haiku 4.5 available for all free users of its Claude.ai platform. The decision effectively democratizes access to what the company characterizes as "near-frontier-level intelligence" — capabilities that would have been available only in expensive, premium models months ago.
"The launch of Claude Haiku 4.5 means that near-frontier-level intelligence is available for free to all users through Claude.ai," the Anthropic spokesperson told VentureBeat. "It also offers significant advantages to our enterprise customers: Sonnet 4.5 can handle frontier planning while Haiku 4.5 powers sub-agents, enabling multi-agent systems that tackle complex refactors, migrations, and large features builds with speed and quality."
This multi-agent architecture signals a significant shift in how AI systems are deployed. Rather than relying on a single, monolithic model, enterprises can now orchestrate teams of specialized AI agents: a more sophisticated Sonnet 4.5 model breaking down complex problems and delegating subtasks to multiple Haiku 4.5 agents working in parallel. For software development teams, this could mean Sonnet 4.5 plans a major code refactoring while Haiku 4.5 agents simultaneously execute changes across dozens of files.
The approach mirrors how human organizations distribute work, and could prove particularly valuable for enterprises seeking to balance performance with cost efficiency — a critical consideration as AI deployment scales.
The model launch coincides with revelations that Anthropic's business is experiencing explosive growth. The company's annual revenue run rate is approaching $7 billion this month, Anthropic told Reuters, up from more than $5 billion reported in August. Internal projections obtained by Reuters suggest the company is targeting between $20 billion and $26 billion in annualized revenue for 2026, representing growth of more than 200% to nearly 300%.
The company now serves more than 300,000 business customers, with enterprise products accounting for approximately 80% of revenue. Among Anthropic's most successful offerings is Claude Code, a code-generation tool that has reached nearly $1 billion in annualized revenue since launching earlier this year.
Those numbers come as artificial intelligence enters what many in the industry characterize as a critical inflection point. After two years of what Anthropic Chief Product Officer Mike Krieger recently described as "AI FOMO" — where companies adopted AI tools without clear success metrics — enterprises are now demanding measurable returns on investment.
"The best products can be grounded in some kind of success metric or evaluation," Krieger said on the "Superhuman AI" podcast. "I've seen that a lot in talking to companies that are deploying AI."
For enterprises evaluating AI tools, the calculus increasingly centers on concrete productivity gains. Google CEO Sundar Pichai claimed in June that AI had generated a 10% boost in engineering velocity at his company — though measuring such improvements across different roles and use cases remains challenging, as Krieger acknowledged.
Anthropic's launch comes amid heightened scrutiny of the company's approach to AI safety and regulation. On Tuesday, David Sacks, the White House's AI "czar" and a venture capitalist, accused Anthropic of "running a sophisticated regulatory capture strategy based on fear-mongering" that is "damaging the startup ecosystem."
The attack targeted remarks by Jack Clark, Anthropic's British co-founder and head of policy, who had described being "deeply afraid" of AI's trajectory. Clark told Bloomberg he found Sacks' criticism "perplexing."
Anthropic addressed such concerns head-on in its release materials, emphasizing that Haiku 4.5 underwent extensive safety testing. The company classified the model as ASL-2 — its AI Safety Level 2 standard — compared to the more restrictive ASL-3 designation for the more powerful Sonnet 4.5 and Opus 4.1 models.
"Our teams have red-teamed and tested our agentic capabilities to the limits in order to assess whether it can be used to engage in harmful activity like generating misinformation or promoting fraudulent behavior like scams," the spokesperson told VentureBeat. "In our automated alignment assessment, it showed a statistically significantly lower overall rate of misaligned behaviors than both Claude Sonnet 4.5 and Claude Opus 4.1 — making it, by this metric, our safest model yet."
The company said its safety testing showed Haiku 4.5 poses only limited risks regarding the production of chemical, biological, radiological and nuclear weapons. Anthropic has also implemented classifiers designed to detect and filter prompt injection attacks, a common method for attempting to manipulate AI systems into producing harmful content.
The emphasis on safety reflects Anthropic's founding mission. The company was established in 2021 by former OpenAI executives, including siblings Dario and Daniela Amodei, who left amid concerns about OpenAI's direction following its partnership with Microsoft. Anthropic has positioned itself as taking a more cautious, research-oriented approach to AI development.
According to Anthropic's benchmarks, Haiku 4.5 performs competitively with or exceeds several larger models across multiple evaluation criteria. On SWE-bench Verified, a widely used test measuring AI systems' ability to solve real-world software engineering problems, Haiku 4.5 scored 73.3% — slightly ahead of Sonnet 4's 72.7% and close to GPT-5 Codex's 74.5%.
The model demonstrated particular strength in computer use tasks, achieving 50.7% on the OSWorld benchmark compared to Sonnet 4's 42.2%. This capability allows the AI to interact directly with computer interfaces — clicking buttons, filling forms, navigating applications — which could prove transformative for automating routine digital tasks.
In coding-specific benchmarks like Terminal-Bench, which tests AI agents' ability to complete complex software tasks using command-line tools, Haiku 4.5 scored 41.0%, trailing only Sonnet 4.5's 50.0% among Claude models.
The model maintains a 200,000-token context window for standard users, with developers accessing the Claude Developer Platform able to use a 1-million-token context window. That expanded capacity means the model can process extremely large codebases or documents in a single request — roughly equivalent to a 1,500-page book.
When asked about the rapid succession of model releases, the Anthropic spokesperson emphasized the company's focus on execution rather than competitive positioning.
"We're focused on shipping the best possible products for our customers — and our shipping velocity speaks for itself," the spokesperson said. "What was state-of-the-art just five months ago is now faster, cheaper, and more accessible."
That velocity stands in contrast to the company's earlier, more measured release schedule. Anthropic appeared to have paused development of its Haiku line after releasing version 3.5 at the end of last year, leading some observers to speculate the company had deprioritized smaller models.
That rapid price-performance improvement validates a core promise of artificial intelligence: that capabilities will become dramatically cheaper over time as the technology matures and companies optimize their models. For enterprises, it suggests that today's budget constraints around AI deployment may ease considerably in coming years.
The practical applications of Haiku 4.5 span a wide range of enterprise functions, from customer service to financial analysis to software development. The model's combination of speed and intelligence makes it particularly suited for real-time, low-latency tasks like chatbot conversations and customer support interactions, where delays of even a few seconds can degrade user experience.
In financial services, the multi-agent architecture enabled by pairing Sonnet 4.5 with Haiku 4.5 could transform how firms monitor markets and manage risk. Anthropic envisions Haiku 4.5 monitoring thousands of data streams simultaneously — tracking regulatory changes, market signals and portfolio risks — while Sonnet 4.5 handles complex predictive modeling and strategic analysis.
For research organizations, the division of labor could compress timelines dramatically. Sonnet 4.5 might orchestrate a comprehensive analysis while multiple Haiku 4.5 agents parallelize literature reviews, data gathering and document synthesis across dozens of sources, potentially "compressing weeks of research into hours," according to Anthropic's use case descriptions.
Several companies have already integrated Haiku 4.5 and reported positive results. Guy Gur-Ari, co-founder of coding startup Augment, said the model "hit a sweet spot we didn't think was possible: near-frontier coding quality with blazing speed and cost efficiency." In Augment's internal testing, Haiku 4.5 achieved 90% of Sonnet 4.5's performance while matching much larger models.
Jeff Wang, CEO of Windsurf, another coding-focused startup, said Haiku 4.5 "is blurring the lines" on traditional trade-offs between speed, cost and quality. "It's a fast frontier model that keeps costs efficient and signals where this class of models is headed."
Jon Noronha, co-founder of presentation software company Gamma, reported that Haiku 4.5 "outperformed our current models on instruction-following for slide text generation, achieving 65% accuracy versus 44% from our premium tier model — that's a game-changer for our unit economics."
For enterprises evaluating AI strategies, Haiku 4.5 presents both opportunity and challenge. The opportunity lies in accessing sophisticated AI capabilities at dramatically lower costs, potentially making viable entire categories of applications that were previously too expensive to deploy at scale.
The challenge is keeping pace with a technology landscape that is evolving faster than most organizations can absorb. As Krieger noted in his recent podcast appearance, companies are moving beyond "AI FOMO" to demand concrete metrics and demonstrated value. But establishing those metrics and evaluation frameworks takes time — time that may be in short supply as competitors race ahead.
The shift from single-model deployments to multi-agent architectures also requires new ways of thinking about AI systems. Rather than viewing AI as a monolithic assistant, enterprises must learn to orchestrate multiple specialized agents, each optimized for particular tasks — more akin to managing a team than operating a tool.
The fundamental economics of AI are shifting with remarkable speed. Five months ago, Sonnet 4's capabilities commanded premium pricing and represented the cutting edge. Today, Haiku 4.5 delivers similar performance at a third of the cost. If that trajectory continues — and both Anthropic's release schedule and competitive pressure from OpenAI and Google suggest it will — the AI capabilities that seem remarkable today may be routine and inexpensive within a year.
For Anthropic, the challenge will be translating technical achievements into sustainable business growth while maintaining the safety-focused approach that differentiates it from competitors. The company's projected revenue growth to as much as $26 billion by 2026 suggests strong market traction, but achieving those targets will require continued innovation and successful execution across an increasingly complex product portfolio.
Whether enterprises will choose Claude over increasingly capable alternatives from OpenAI, Google and a growing field of competitors remains an open question. But Anthropic is making a clear bet: that the future of AI belongs not to whoever builds the single most powerful model, but to whoever can deliver the right intelligence, at the right speed, at the right price — and make it accessible to everyone.
In an industry where the promise of artificial intelligence has long outpaced reality, Anthropic is betting that delivering on that promise, faster and cheaper than anyone expected, will be enough to win. And with pricing dropping by two-thirds in just five months while performance holds steady, that promise is starting to look like reality.
As expected after days of leaks and rumors online, Google has unveiled Veo 3.1, its latest AI video generation model, bringing a suite of creative and technical upgrades aimed at improving narrative control, audio integration, and realism in AI-generated video.
While the updates expand possibilities for hobbyists and content creators using Google’s online AI creation app, Flow, the release also signals a growing opportunity for enterprises, developers, and creative teams seeking scalable, customizable video tools.
The quality is higher, the physics better, the pricing the same as before, and the control and editing features more robust and varied.
My initial tests showed it to be a powerful and performant model that immediately delights with each generation. However, the look is more cinematic, polished and a little more "artificial" than by default than rivals such as OpenAI's new Sora 2, released late last month, which may or may not be what a particular user is going after (Sora excels at handheld and "candid" style videos).
Veo 3.1 builds on its predecessor, Veo 3 (released back in May 2025) with enhanced support for dialogue, ambient sound, and other audio effects.
Native audio generation is now available across several key features in Flow, including “Frames to Video,” “Ingredients to Video,” and “Extend," which give users the ability to, respectively: turn still images into video; use items, characters and objects from multiple images in a single video; and generate longer clips than the initial 8 seconds, to more than 30 seconds or even 1+ plus when continuing from a prior clip's final frame.
Before, you had to add audio manually after using these features.
This addition gives users greater command over tone, emotion, and storytelling — capabilities that have previously required post-production work.
In enterprise contexts, this level of control may reduce the need for separate audio pipelines, offering an integrated way to create training content, marketing videos, or digital experiences with synchronized sound and visuals.
Google noted in a blog post that the updates reflect user feedback calling for deeper artistic control and improved audio support. Gallegos emphasizes the importance of making edits and refinements possible directly in Flow, without reworking scenes from scratch.
With Veo 3.1, Google introduces support for multiple input types and more granular control over generated outputs. The model accepts text prompts, images, and video clips as input, and also supports:
Reference images (up to three) to guide appearance and style in the final output
First and last frame interpolation to generate seamless scenes between fixed endpoints
Scene extension that continues a video’s action or motion beyond its current duration
These tools aim to give enterprise users a way to fine-tune the look and feel of their content—useful for brand consistency or adherence to creative briefs.
Additional capabilities like “Insert” (add objects to scenes) and “Remove” (delete elements or characters) are also being introduced, though not all are immediately available through the Gemini API.
Veo 3.1 is accessible through several of Google’s existing AI services:
Flow, Google’s own interface for AI-assisted filmmaking
Gemini API, targeted at developers building video capabilities into applications
Vertex AI, where enterprise integration will soon support Veo’s “Scene Extension” and other key features
Availability through these platforms allows enterprise customers to choose the right environment—GUI-based or programmatic—based on their teams and workflows.
The Veo 3.1 model is currently in preview and available only on the paid tier of the Gemini API. The cost structure is the same as Veo 3, the preceding generation of AI video models from Google.
Standard model: $0.40 per second of video
Fast model: $0.15 per second
There is no free tier, and users are charged only if a video is successfully generated. This model is consistent with previous Veo versions and provides predictable pricing for budget-conscious enterprise teams.
Veo 3.1 outputs video at 720p or 1080p resolution, with a 24 fps frame rate.
Duration options include 4, 6, or 8 secondsfrom a text prompt or uploaded images, with the ability to extend videos up to 148 seconds (more than 2 and half minutes!) when using the “Extend” feature.
New functionality also includes tighter control over subjects and environments. For example, enterprises can upload a product image or visual reference, and Veo 3.1 will generate scenes that preserve its appearance and stylistic cues across the video. This could streamline creative production pipelines for retail, advertising, and virtual content production teams.
The broader creator and developer community has responded to Veo 3.1’s launch with a mix of optimism and tempered critique—particularly when comparing it to rival models like OpenAI’s Sora 2.
Matt Shumer, an AI founder of Otherside AI/Hyperwrite, and early adopter, described his initial reaction as “disappointment,” noting that Veo 3.1 is “noticeably worse than Sora 2” and also “quite a bit more expensive.”
However, he acknowledged that Google’s tooling—such as support for references and scene extension—is a bright spot in the release.
Travis Davids, a 3D digital artist and AI content creator, echoed some of that sentiment. While he noted improvements in audio quality, particularly in sound effects and dialogue, he raised concerns about limitations that remain in the system.
These include the lack of custom voice support, an inability to select generated voices directly, and the continued cap at 8-second generations—despite some public claims about longer outputs.
Davids also pointed out that character consistency across changing camera angles still requires careful prompting, whereas other models like Sora 2 handle this more automatically. He questioned the absence of 1080p resolution for users on paid tiers like Flow Pro and expressed skepticism over feature parity.
On the more positive end, @kimmonismus, an AI newsletter writer, stated that “Veo 3.1 is amazing,” though still concluded that OpenAI’s latest model remains preferable overall.
Collectively, these early impressions suggest that while Veo 3.1 delivers meaningful tooling enhancements and new creative control features, expectations have shifted as competitors raise the bar on both quality and usability.
Since launching Flow five months ago, Google says over 275 million videos have been generated across various Veo models.
The pace of adoption suggests significant interest not only from individuals but also from developers and businesses experimenting with automated content creation.
Thomas Iljic, Director of Product Management at Google Labs, highlights that Veo 3.1’s release brings capabilities closer to how human filmmakers plan and shoot. These include scene composition, continuity across shots, and coordinated audio—all areas that enterprises increasingly look to automate or streamline.
Videos generated with Veo 3.1 are watermarked using Google’s SynthID technology, which embeds an imperceptible identifier to signal that the content is AI-generated.
Google applies safety filters and moderation across its APIs to help minimize privacy and copyright risks. Generated content is stored temporarily and deleted after two days unless downloaded.
For developers and enterprises, these features provide reassurance around provenance and compliance—critical in regulated or brand-sensitive industries.
Veo 3.1 is not just an iteration on prior models—it represents a deeper integration of multimodal inputs, storytelling control, and enterprise-level tooling. While creative professionals may see immediate benefits in editing workflows and fidelity, businesses exploring automation in training, advertising, or virtual experiences may find even greater value in the model’s composability and API support.
The early user feedback highlights that while Veo 3.1 offers valuable tooling, expectations around realism, voice control, and generation length are evolving rapidly. As Google expands access through Vertex AI and continues refining Veo, its competitive positioning in enterprise video generation will hinge on how quickly these user pain points are addressed.
The Dfinity Foundation on Wednesday released Caffeine, an artificial intelligence platform that allows users to build and deploy web applications through natural language conversation alone, bypassing traditional coding entirely. The system, which became publicly available today, represents a fundamental departure from existing AI coding assistants by building applications on a specialized decentralized infrastructure designed specifically for autonomous AI development.
Unlike GitHub Copilot, Cursor, or other "vibe coding" tools that help human developers write code faster, Caffeine positions itself as a complete replacement for technical teams. Users describe what they want in plain language, and an ensemble of AI models writes, deploys, and continually updates production-grade applications — with no human intervention in the codebase itself.
"In the future, you as a prospective app owner or service owner… will talk to AI. AI will give you what you want on a URL," said Dominic Williams, founder and chief scientist at the Dfinity Foundation, in an exclusive interview with VentureBeat. "You will use that, completely interact productively, and you'll just keep talking to AI to evolve what that does. The AI, or an ensemble of AIs, will be your tech team."
The platform has attracted significant early interest: more than 15,000 alpha users tested Caffeine before its public release, with daily active users representing 26% of those who received access codes — "early Facebook kind of levels," according to Williams. The foundation reports some users spending entire days building applications on the platform, forcing Dfinity to consider usage limits due to underlying AI infrastructure costs.
Caffeine's most significant technical claim addresses a problem that has plagued AI-generated code: data loss during application updates. The platform builds applications using Motoko, a programming language developed by Dfinity specifically for AI use, which provides mathematical guarantees that upgrades cannot accidentally delete user data.
"When AI is updating apps and services in production, a mistake cannot lose data. That's a guarantee," Williams said. "It's not like there are some safeguards to try and stop it losing data. This language framework gives it rails that guarantee if an upgrade, an update to its app's underlying logic, would cause data loss, the upgrade fails and the AI just tries again."
This addresses what Williams characterizes as critical failures in competing platforms. User forums for tools like Lovable and Replit, he notes, frequently report three major problems: applications that become irreparably broken as complexity increases, security vulnerabilities that allow unauthorized access, and mysterious data loss during updates.
Traditional tech stacks evolved to meet human developer needs — familiarity with SQL databases, preference for known programming languages, existing skill investments. "That's how the traditional tech stacks evolved. It's really evolved to meet human needs," Williams explained. "But in the future, it's going to be different. You're not going to care how the AI did it. Instead, for you, AI is the tech stack."
Caffeine's architecture reflects this philosophy. Applications run entirely on the Internet Computer Protocol (ICP), a blockchain-based network that Dfinity launched in May 2021 after raising over $100 million from investors including Andreessen Horowitz and Polychain Capital. The ICP uses what Dfinity calls "chain-key cryptography" to create what Williams describes as "tamper-proof" code — applications that are mathematically guaranteed to execute their written logic without interference from traditional cyberattacks.
"The code can't be affected by ransomware, so you don't have to worry about malware in the same way you do," Williams said. "Configuration errors don't result in traditional cyber attacks. That passive traditional cyber attacks isn't something you need to worry about."
At the heart of Caffeine's technical approach is a concept called "orthogonal persistence," which fundamentally reimagines how applications store and manage data. In traditional development, programmers must write extensive code to move data between application logic and separate database systems — marshaling data in and out of SQL servers, managing connections, handling synchronization.
Motoko eliminates this entirely. Williams demonstrated with a simple example: defining a blog post data type and declaring a variable to store an array of posts requires just two lines of code. "This declaration is all that's necessary to have the blog maintain its list of posts," he explained during a presentation on the technology. "Compare that to traditional IT where in order to persist the blog posts, you'd have to marshal them in and out of a database server. This is quite literally orders of magnitude more simple."
This abstraction allows AI to work at a higher conceptual level, focusing on application logic rather than infrastructure plumbing. "Logic and data are kind of the same," Williams said. "This is one of the things that enables AI to build far more complicated functionality than it could otherwise do."
The system also employs what Dfinity calls "loss-safe data migration." When AI needs to modify an application's data structure — adding a "likes" field to blog posts, for example — it must write migration logic in two passes. The framework automatically verifies that the transformation won't result in data loss, refusing to compile or deploy code that could delete information unless explicitly instructed.
Williams positions Caffeine as particularly transformative for enterprise IT, where he claims costs could fall to "1% of what they were before" while time-to-market shrinks to similar fractions. The platform targets a spectrum from individual creators to large corporations, all of whom currently face either expensive development teams or constraining low-code templates.
"A corporation or government department might want to create a corporate portal or CRM, ERP functionality," Williams said, referring to customer relationship management and enterprise resource planning systems. "They will otherwise have to obtain this by signing up for some incredibly expensive SaaS service where they become locked in, their data gets stuck, and they still have to spend a lot of money on consultants customizing the functionality."
Applications built through Caffeine are owned entirely by their creators and cannot be shut down by centralized parties — a consequence of running on the decentralized Internet Computer network rather than traditional cloud providers like Amazon Web Services. "When someone says built on the internet computer, it actually means built on the internet computer," Williams emphasized, contrasting this with blockchain projects that merely host tokens while running actual applications on centralized infrastructure.
The platform demonstrated this versatility during a July 2025 hackathon in San Francisco, where participants created applications ranging from a "Will Maker" tool for generating legal documents, to "Blue Lens," a voice-AI water quality monitoring system, to "Road Patrol," a gamified community reporting app for infrastructure problems. Critically, many of these came from non-technical participants with no coding background.
"I'm from a non-technical background, I'm actually a quality assurance professional," said the creator of Blue Lens in a video testimonial. "Through Caffeine I can build something really intuitive and next-gen to the public." The application integrated multiple external services — Eleven Labs for voice AI, real-time government water data through retrieval-augmented generation, and Midjourney-generated visual assets — all coordinated through conversational prompts.
Caffeine enters a crowded market of AI-assisted development tools, but Williams argues the competition isn't truly comparable. GitHub Copilot, Cursor, and similar tools serve human developers working with traditional technology stacks. Platforms like Replit and Lovable occupy a middle ground, offering "vibe coding" that mixes AI generation with human editing.
"If you're a Node.js developer, you know you're working with the traditional stack, and you might want to do your coding with Copilot or using Claude or using Cursor," Williams said. "That's a very different thing to what Caffeine is offering. There'll always be cases where you probably wouldn't want to hand over the logic of the control system for a new nuclear missile silo to AI. But there's going to be these holdout areas, right? And there's all the legacy stuff that has to be maintained."
The key distinction, according to Williams, lies in production readiness. Existing AI coding tools excel at rapid prototyping but stumble when applications grow complex or require guaranteed reliability. Reddit forums for these platforms document users hitting insurmountable walls where applications break irreparably, or where AI-generated code introduces security vulnerabilities.
"As the demands and the requirements become more complicated, eventually you can hit a limit, and when you hit that limit, not only can you not go any further, but sometimes your app will get broken and there's no way of going back to where you were before," Williams said. "That can't happen with productive apps, and it also can't be the case that you're getting hacked and losing data, because once you go hands-free, if you like, and there's no tech team, there's no technical people involved, who's going to run the backups and restore your app?"
The Internet Computer's architecture addresses this through Byzantine fault tolerance — even if attackers gain physical control over some network hardware, they cannot corrupt applications or their data. "This is the beginning of a compute revolution and it's also the perfect platform for AI to build on," Williams said.
Dfinity frames Caffeine within a broader vision it calls the "self-writing internet," where the web literally programs itself through natural language interaction. This represents what Williams describes as a "seismic shift coming to tech" — from human developers selecting technology stacks based on their existing skills, to AI selecting optimal implementations invisible to users.
"You don't care about whether some human being has learned all of the different platforms and Amazon Web Services or something like that. You don't care about that. You just care: Is it secure? Do you get security guarantees? Is it resilient? What's the level of resilience?" Williams said. "Those are the new parameters."
The platform demonstrated this during live demonstrations, including at the World Computer Summit 2025 in Zurich. Williams created a talent recruitment application from scratch in under two minutes, then modified it in real-time while the application ran with users already interacting with it. "You will continue talking to the AI and just keep on refreshing the URL to see the changes," he explained.
This capability extends to complex scenarios. During demonstrations, Williams showed building a tennis lesson booking system, an e-commerce platform, and an event registration system — all simultaneously, working on multiple applications in parallel. "We predict that as people get very proficient with Caffeine, they could be working on even 10 apps in parallel," he said.
The system writes substantial code: a simple personal blog generated 700 lines of code in a couple of minutes. More complex applications can involve thousands of lines across frontend and backend components, all abstracted away from the user who only describes desired functionality.
Caffeine's economic model differs fundamentally from traditional software-as-a-service platforms. Applications run on the Internet Computer Protocol, which uses a "reverse gas model" where developers pay for computation rather than users paying transaction fees. The platform includes an integrated App Market where creators can publish applications for others to clone and adapt — creating what Dfinity envisions as a new economic ecosystem.
"App stores today obviously operate on gatekeeping," said Pierre Samaties, chief business officer at Dfinity, during the World Computer Summit. "That's going to erode." Rather than purchasing applications, users can clone them and modify them for their own purposes — fundamentally different from Apple's App Store or Google Play models.
Williams acknowledges that Caffeine itself currently runs on centralized infrastructure, despite building applications on the decentralized Internet Computer. "Caffeine itself actually is centralized. It uses aspects of the Internet Computer. We want Caffeine itself to run on the Internet Computer in the future, but it's not there now," he said. The platform leverages commercially available foundation models from companies like Anthropic, whose Claude Sonnet model powers much of Caffeine's backend logic.
This pragmatic approach reflects Dfinity's strategy of using best-in-class AI models while focusing its own development on the specialized infrastructure and programming language designed for AI use. "These content models have been developed by companies with enormous budgets, absolutely enormous budgets," Williams said. "I don't think in the near future we'll run AI on the Internet Computer for that reason, unless there's a special case."
The Dfinity Foundation has pursued this vision since Williams began researching decentralized networks in late 2013. After involvement with Ethereum before its 2015 launch, Williams became fascinated with the concept of a "world computer"—a public blockchain network that could host not just tokens but entire applications and services.
"By 2015 I was talking about network-focused drivers, Dfinity back then, and that could really operate as an alternative tech stack, and eventually host even things like social networks and massive enterprise systems," Williams said. The foundation launched the Internet Computer Protocol in May 2021, initially focusing on Web3 developers. Despite not being among the highest-valued blockchain projects, ICP consistently ranks in the top 10 for developer numbers.
The pivot to AI-driven development came from recognizing that "in the future, the tech stack will be AI," according to Williams. This realization led to Caffeine's development, announced on Dfinity's public roadmap in March 2025 and demonstrated at the World Computer Summit in June 2025.
One successful example of the Dfinity vision running in production is OpenChat, a messaging application that runs entirely on the Internet Computer and is governed by a decentralized autonomous organization (DAO) with tens of thousands of participants voting on source code updates through algorithmic governance. "The community is actually controlling the source code updates," Williams explained. "Developers propose updates, community reads the updates, and if the community is happy, OpenChat updates itself."
The platform faces several challenges. Dfinity's crypto industry roots may create perception problems in enterprise markets, Williams acknowledges. "The Web3 industry's reputation is a bit tarnished and probably rightfully so," he said during the World Computer Summit. "Now people can, for themselves, experience what a decentralized network is. We're going to see self-writing take over the enterprise space because the speed and efficiency are just incredible."
The foundation's history includes controversy: ICP's token launched in 2021 at over $100 per token with an all-time high around $700, then crashed below $3 in 2023 before recovering. The project has faced legal challenges, including class action lawsuits alleging misleading investors, and Dfinity filed defamation claims against industry critics.
Technical limitations also remain. Caffeine cannot yet compile React front-ends on the Internet Computer itself, requiring some off-chain processing. Complex integrations with traditional systems — payment processing through Stripe, for example — still require centralized components. "Your app is running end-to-end on the Internet Computer, then when it needs to actually accept payment, it's going to hand over to your Stripe account," Williams explained.
The platform's claims about data loss prevention and security guarantees, while technically grounded in the Motoko language design and Internet Computer architecture, remain to be tested at scale with diverse real-world applications. The 26% daily active user rate from alpha testing is impressive but comes from a self-selected group of early adopters.
Williams rejects concerns that AI-driven development will eliminate software engineering jobs, arguing instead for market expansion. "The self-writing internet empowers eight billion non-technical people," he said. "Some of these people will enter roles in tech, becoming prompt engineers, tech entrepreneurs, or helping run online communities. Humanity will create millions of new custom apps and services, and a subset of those will require professional human assistance."
During his World Computer Summit demonstration, Williams was explicit about the scale of transformation Dfinity envisions. "Today there are about 35,000 Web3 engineers in the world. Worldwide there are about 15 million full-stack engineers," he said. "But tomorrow with the self-writing internet, everyone will be a builder. Today there are already about five billion people with internet-connected smartphones and they'll all be able to use Caffeine."
The hackathon results suggest this isn't pure hyperbole. A dentist built "Dental Tracks" to help patients manage their dental records. A transportation industry professional created "Road Patrol" for gamified infrastructure reporting. A frustrated knitting student built "Skill Sprout," a garden-themed app for learning new hobbies, complete with material checklists and step-by-step skill breakdowns—all without writing a single line of code.
"I was learning to knit. I got irritated because I had the wrong materials," the creator explained in a video interview. "I don't know how to do the stitches, so I have to individually search, and it's really intimidating when you're trying to learn something you don't—you don't even know what you don't know."
Whether Caffeine succeeds depends on factors still unknown: how production applications perform under real-world stress, whether the Internet Computer scales to millions of applications, whether enterprises can overcome their skepticism of blockchain-adjacent technology. But if Williams is right about the fundamental shift — that AI will be the tech stack, not just a tool for human developers — then someone will build what Caffeine promises.
The question isn't whether the future looks like this. It's who gets there first, and whether they can do it without losing everyone's data along the way.
2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.
But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark testsshow even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).
A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.
Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.
EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.
Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.
EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.
Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.
EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.
The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.
These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.
In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.
One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).
This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.
It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.
The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.
In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.
It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.
EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.
Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.
In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.
On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.
In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.
Even stronger gains were seen with more capable models.
For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.
In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.
Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.
Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.
Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.
This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.
As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.
VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.
While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.
Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.
EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.
Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.
For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.
For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.
Visa is introducing a new security framework designed to solve one of the thorniest problems emerging in artificial intelligence-powered commerce: how retailers can tell the difference between legitimate AI shopping assistants and the malicious bots that plague their websites.
The payments giant unveiled its Trusted Agent Protocol on Tuesday, establishing what it describes as foundational infrastructure for "agentic commerce" — a term for the rapidly growing practice of consumers delegating shopping tasks to AI agents that can search products, compare prices, and complete purchases autonomously.
The protocol enables merchants to cryptographically verify that an AI agent browsing their site is authorized and trustworthy, rather than a bot designed to scrape pricing data, test stolen credit cards, or carry out other fraudulent activities.
The launch comes as AI-driven traffic to U.S. retail websites has exploded by more than 4,700% over the past year, according to data from Adobe cited by Visa. That dramatic surge has created an acute challenge for merchants whose existing bot detection systems — designed to block automated traffic — now risk accidentally blocking legitimate AI shoppers along with bad actors.
"Merchants need additional tools that provide them with greater insight and transparency into agentic commerce activities to ensure they can participate safely," said Rubail Birwadker, Visa's Global Head of Growth, in an exclusive interview with VentureBeat. "Without common standards, potential risks include ecosystem fragmentation and the proliferation of closed loop models."
The stakes are substantial. While 85% of shoppers who have used AI to shop report improved experiences, merchants face the prospect of either turning away legitimate AI-powered customers or exposing themselves to sophisticated bot attacks. Visa's own data shows the company prevented $40 billion in fraudulent activity between October 2022 and September 2023, nearly double the previous year, much of it involving AI-powered enumeration attacks where bots systematically test combinations of card numbers until finding valid credentials.
Visa's Trusted Agent Protocol operates through what Birwadker describes as a "cryptographic trust handshake" between merchants and approved AI agents. The system works in three steps:
First, AI agents must be approved and onboarded through Visa's Intelligent Commerce program, where they undergo vetting to meet trust and reliability standards. Each approved agent receives a unique digital signature key — essentially a cryptographic credential that proves its identity.
When an approved agent visits a merchant's website, it creates a digital signature using its key and transmits three categories of information: Agent Intent (indicating the agent is trusted and intends to retrieve product details or make a purchase), Consumer Recognition (data showing whether the underlying consumer has an existing account with the merchant), and Payment Information (optional payment data to support checkout).
Merchants or their infrastructure providers, such as content delivery networks, then validate these digital signatures against Visa's registry of approved agents. "Upon proper validation of these fields, the merchant can confirm the signature is a trusted agent," Birwadker explained.
Crucially, Visa designed the protocol to require minimal changes to existing merchant infrastructure. Built on the HTTP Message Signature standard and aligned with Web Both Auth, the protocol works with existing web infrastructure without requiring merchants to overhaul their checkout pages. "This is no-code functionality," Birwadker emphasized, though merchants may need to integrate with Visa's Developer Center to access the verification system.
Visa developed the protocol in collaboration with Cloudflare, the web infrastructure and security company that already provides bot management services to millions of websites. The partnership reflects Visa's recognition that solving bot verification requires cooperation across the entire web stack, not just the payments layer.
"Trusted Agent Protocol supplements traditional bot management by providing merchants insights that enable agentic commerce," Birwadker said. "Agents are providing additional context they otherwise would not, including what it intends to do, who the underlying consumer is, and payment information."
The protocol arrives as multiple technology giants race to establish competing standards for AI commerce. Google recently introduced its Agent Protocol for Payments (AP2), while OpenAI and Stripe have discussed their own approaches to enabling AI agents to make purchases. Microsoft, Shopify, Adyen, Ant International, Checkout.com, Cybersource, Elavon, Fiserv, Nuvei, and Worldpay provided feedback during Trusted Agent Protocol's development, according to Visa.
When asked how Visa's protocol relates to these competing efforts, Birwadker struck a collaborative tone. "Both Google's AP2 and Visa's Trusted Agent Protocol are working toward the same goal of building trust in agent-initiated payments," he said. "We are engaged with Google, OpenAI, and Stripe and are looking to create compatibility across the ecosystem."
Visa says it is working with global standards bodies including the Internet Engineering Task Force (IETF), OpenID Foundation, and EMVCo to ensure the protocol can eventually become interoperable with other emerging standards. "While these specifications apply to the Visa network in this initial phase, enabling agents to safely and securely act on a consumer's behalf requires an open, ecosystem-wide approach," Birwadker noted.
The protocol raises important questions about authorization and liability when AI agents make purchases on behalf of consumers. If an agent completes an unauthorized transaction — perhaps misunderstanding a user's intent or exceeding its delegated authority — who bears responsibility?
Birwadker emphasized that the protocol helps merchants "leverage this information to enable experiences tied to existing consumer relationships and more secure checkout," but he did not provide specific details about how disputes would be handled when agents make unauthorized purchases. Visa's existing fraud protection and chargeback systems would presumably apply, though the company has not yet published detailed guidance on agent-initiated transaction disputes.
The protocol also places Visa in the position of gatekeeper for the emerging agentic commerce ecosystem. Because Visa determines which AI agents get approved for the Intelligent Commerce program and receive cryptographic credentials, the company effectively controls which agents merchants can easily trust. "Agents are approved and onboarded through the Visa Intelligent Commerce program, ensuring they meet our standards for trust and reliability," Birwadker said, though he did not detail the specific criteria agents must meet or whether Visa charges fees for approval.
This gatekeeping role could prove contentious, particularly if Visa's approval process favors large technology companies over startups, or if the company faces pressure to block agents from competitors or politically controversial entities. Visa declined to provide details about how many agents it has approved so far or how long the vetting process typically takes.
The protocol launch comes at a complex moment for Visa, which continues to navigate significant legal and regulatory challenges even as its core business remains robust. The company's latest earnings report for the third quarter of fiscal year 2025 showed a 10% increase in net revenues to $9.2 billion, driven by resilient consumer spending and strong growth in cross-border transaction volume. For the full fiscal year ending September 30, 2024, Visa processed 289 billion transactions, with a total payments volume of $15.2 trillion.
However, the company's legal headwinds have intensified. In July 2025, a federal judge rejected a landmark $30 billion settlement that Visa and Mastercard had reached with merchants over long-disputed credit card swipe fees, sending the parties back to the negotiating table and extending the long-running legal battle.
Simultaneously, Visa remains under investigation by the Department of Justice over its rules for routing debit card transactions, with regulators scrutinizing whether the company's practices unlawfully limit merchant choice and stifle competition. These domestic challenges are mirrored abroad, where European regulators have continued their own antitrust investigations into the fee structures of both Visa and its primary competitor, Mastercard.
Against this backdrop of regulatory pressure, Birwadker acknowledged that adoption of the Trusted Agent Protocol will take time. "As agentic commerce continues to rise, we recognize that consumer trust is still in its early stages," he said. "That's why our focus through 2025 is on building foundational credibility and demonstrating real-world value."
The protocol is available immediately in Visa's Developer Center and on GitHub, with agent onboarding already active and merchant integration resources available. But Birwadker declined to provide specific targets for how many merchants might adopt the protocol by the end of 2026. "Adoption is aligned with the momentum we're already seeing," he said. "The launch of our protocol marks another big step — it's not just a technical milestone, but a signal that the industry is beginning to unify."
Industry analysts say merchant adoption will likely depend on how quickly agentic commerce grows as a percentage of overall e-commerce. While AI-driven traffic has surged dramatically, much of that consists of agents browsing and researching rather than completing purchases. If AI agents begin accounting for a significant share of completed transactions, merchants will face stronger incentives to adopt verification systems like Visa's protocol.
Visa's move reflects broader strategic bets on AI across the financial services industry. The company has invested $10 billion in technology over the past five years to reduce fraud and increase network security, with AI and machine learning central to those efforts. Visa's fraud detection system analyzes over 500 different attributes for each transaction, using AI models to assign real-time risk scores to the 300 billion annual transactions flowing through its network.
"Every single one of those transactions has been processed by AI," James Mirfin, Visa's global head of risk and identity solutions, said in a July 2024 CNBC interview discussing the company's fraud prevention efforts. "If you see a new type of fraud happening, our model will see that, it will catch it, it will score those transactions as high risk and then our customers can decide not to approve those transactions."
The company has also moved aggressively into new payment territories beyond its core card business. In January 2025, Visa partnered with Elon Musk's X (formerly Twitter) to provide the infrastructure for a digital wallet and peer-to-peer payment service called the X Money Account, competing with services like Venmo and Zelle. That deal marked Visa's first major partnership in the social media payments space and reflected the company's recognition that payment flows are increasingly happening outside traditional e-commerce channels.
The agentic commerce protocol represents an extension of this strategy — an attempt to ensure Visa remains central to payment flows even as the mechanics of shopping shift from direct human interaction to AI intermediation. Jack Forestell, Visa's Chief Product & Strategy Officer, framed the protocol in expansive terms: "We believe the entire payments ecosystem has a responsibility to ensure sellers trust AI agents with the same confidence they place in their most valued customers and networks."
The real test for Visa's protocol won't be technical — it will be political. As AI agents become a larger force in retail, whoever controls the verification infrastructure controls access to hundreds of billions of dollars in commerce. Visa's position as gatekeeper gives it enormous leverage, but also makes it a target.
Merchants chafing under Visa's existing fee structure and facing multiple antitrust investigations may resist ceding even more power to the payments giant. Competitors like Google and OpenAI, each with their own ambitions in commerce, have little incentive to let Visa dictate standards. Regulators already scrutinizing Visa's market dominance will surely examine whether its agent approval process unfairly advantages certain players.
And there's a deeper question lurking beneath the technical specifications and corporate partnerships: In an economy increasingly mediated by AI, who decides which algorithms get to spend our money? Visa is making an aggressive bid to be that arbiter, wrapping its answer in the language of security and interoperability. Whether merchants, consumers, and regulators accept that proposition will determine not just the fate of the Trusted Agent Protocol, but the structure of AI-powered commerce itself.
For now, Visa is moving forward with the confidence of a company that has weathered disruption before. But in the emerging world of agentic commerce, being too trusted might prove just as dangerous as not being trusted enough.