Inside Code Orange: Fail Small – How Cloudflare Strengthened Its Network

By

Cloudflare recently completed a major internal engineering initiative called "Code Orange: Fail Small," aimed at making its infrastructure more resilient, secure, and reliable. Over the past two quarters, the team focused on safer configuration changes, reducing failure impact, revising emergency procedures, improving customer communication, and preventing long-term drift. Below, we answer key questions about what was done, why it mattered, and what it means for you.

What was Code Orange: Fail Small?

Code Orange: Fail Small was an intensive engineering effort spanning more than two quarters, designed to prevent outages like those on November 18 and December 5, 2025. The project centered on making configuration changes safer, limiting blast radius of failures, and enhancing operational procedures. By adopting health-mediated deployment for configurations—similar to how software releases are rolled out—Cloudflare now catches problems early and automatically rolls back changes before they impact customers. The initiative also introduced new tools like Snapstone, revised "break glass" protocols, and strengthened incident management and customer communication.

Inside Code Orange: Fail Small – How Cloudflare Strengthened Its Network
Source: blog.cloudflare.com

Why was Code Orange: Fail Small necessary?

The effort was prompted by two global outages in late 2025 that affected many customers. These outages highlighted weaknesses in how configuration changes were deployed across Cloudflare’s network. Previously, high-risk changes could reach the entire network instantly, with limited ability to detect and revert issues before they caused widespread disruption. Code Orange: Fail Small was created to systematically address these vulnerabilities by implementing progressive rollouts, real-time health monitoring, and automated rollbacks—ensuring that a failure in one part of the network does not cascade into a global incident.

How did Cloudflare make configuration changes safer?

Cloudflare now uses health-mediated deployment for all configuration changes that affect customer traffic. Instead of an instant global update, changes are grouped into packages and gradually released across the network. Real-time observability tools monitor health signals, and if a problem is detected, the system automatically reverts the change. This approach was previously difficult to apply to configurations because it required custom work per team. The new Snapstone component provides a unified platform that brings progressive rollout, health monitoring, and auto‑rollback to configuration deployments by default. Teams can dynamically define any unit of configuration—such as data files or control flags—that needs health mediation.

What is Snapstone and how does it work?

Snapstone is an internal system built to apply health-mediated deployment to configuration changes. It bundles configuration updates into packages and releases them gradually while monitoring health metrics in real time. If the package causes degradation, Snapstone automatically rolls back the change. Its flexibility allows teams to define exactly which configuration units require mediation—whether it’s a data file (like the one in the November outage) or a control flag (like in the December outage). Before Snapstone, teams had to build their own solutions, leading to inconsistent adoption. Snapstone makes safer deployments the default across Cloudflare’s entire network.

Inside Code Orange: Fail Small – How Cloudflare Strengthened Its Network
Source: blog.cloudflare.com

What changes were made to break glass procedures?

Cloudflare revised its "break glass" protocols—emergency procedures that allow operators to bypass normal safety controls during critical incidents. The goal was to ensure that even in urgent situations, changes are still subject to some level of oversight and can be easily reversed. The updated procedures include clear criteria for when break glass can be used, mandatory logging of actions, and predefined rollback steps. Additionally, the incident management process was strengthened to improve coordination and decision‑making during outages. These changes help prevent emergency actions from introducing new risks or making outages worse.

How did Cloudflare improve outage communication?

Cloudflare recognized that clear, timely communication during incidents is crucial for customer trust. As part of Code Orange: Fail Small, the team enhanced how they notify customers during outages. This includes faster status page updates, more detailed root‑cause explanations, and regular progress reports. The company also refined internal escalation processes to ensure that relevant customer‑facing teams receive accurate information as quickly as possible. While these changes don’t prevent all outages, they ensure that when incidents do occur, customers are kept informed in a transparent and helpful manner.

What does Code Orange: Fail Small mean for customers?

For most customers, the most visible change is increased reliability. Configuration changes that previously might have caused global issues are now rolled out progressively with automated safeguards. This means fewer unexpected outages and faster recovery when problems do occur. Cloudflare’s network is now more resilient to both internal errors and external threats. Additionally, improved communication means customers receive clearer, more frequent updates during incidents. While improving resiliency is an ongoing process, the completion of Code Orange: Fail Small represents a significant step forward in making Cloudflare’s infrastructure stronger for everyone.

Related Articles

Recommended

Discover More

Weekly Cyber Threat Roundup: Key Breaches and Vulnerabilities (April 27)When AI Eliminates the 'Bugs' in Teamwork: Are We Losing the Glue That Holds Teams Together?How to Participate in the Go Developer Survey 2025Strawberry Music Player: A Comprehensive Guide to Managing Your Music CollectionMotorola Unveils First Book-Style Foldable Smartphone: Razr Fold Packs Snapdragon 8 Gen 5, 6,000mAh Battery at $1,900