Taming CUBIC in QUIC: A Deep Dive into a Congestion Control Bug and Its Fix

By

Overview

The CUBIC congestion control algorithm, standardized in RFC 9438, is the default in Linux and therefore governs how most TCP and QUIC connections on the public internet probe for bandwidth, react to loss, and recover. Cloudflare's open-source QUIC implementation, quiche, uses CUBIC as its default congestion controller, putting it in the critical path for a significant share of served traffic. This guide tells the story of a bug where CUBIC's congestion window (cwnd) gets permanently pinned at its minimum after a congestion collapse, never recovering. The root cause was a Linux kernel change intended to align CUBIC with the app-limited exclusion described in RFC 9438 §4.2-12 — a valid TCP fix that, when ported to QUIC, exposed unexpected behaviors. The happy ending: an elegant near-one-line fix that broke the cycle.

Taming CUBIC in QUIC: A Deep Dive into a Congestion Control Bug and Its Fix
Source: blog.cloudflare.com

Prerequisites

To follow this tutorial, you should have:

Step-by-Step Instructions

1. Understand CUBIC's Core Logic

Before diving into the bug, grasp how CUBIC works. The central knob is the congestion window (cwnd) — a sender-side limit on bytes in flight. CUBIC, like all loss-based algorithms, grows cwnd when the network appears healthy and shrinks it on loss. Its key premise: no loss means increase sending rate; loss means capacity exceeded, so back off. However, RFC 9438 introduced an app-limited exclusion: if the sender is not fully utilizing the window (e.g., due to application limitations), CUBIC should not grow cwnd as aggressively. This is crucial for fairness.

2. Identify the Symptom: Intermittent Test Failures

Our investigation began with reports of erratic failures in the ingress proxy integration test pipeline. Tests involving heavy early loss in the connection showed that CUBIC's cwnd would never recover from congestion collapse. The test failed 61% of the time — a clear sign of a state machine bug. Most congestion control tests exercise steady-state growth; this one probed the rare but critical minimum-cwnd regime after heavy loss.

3. Trace the Root Cause: App-Limited Exclusion in TCP

A prior Linux kernel change aimed to fix CUBIC's compliance with RFC 9438 by adding app-limited exclusion logic. In TCP, this fix worked fine. However, when quiche ported that same logic to QUIC, it introduced a subtle bug. The bug surfaced because QUIC's loss recovery and acknowledgment semantics differ from TCP's. Specifically, the app-limited exclusion condition reset an internal state variable (epoch_start) at the wrong time, preventing cwnd from ever growing after a collapse.

4. Analyze the Bug: How cwnd Gets Pinned

Let's walk through the bug mechanism step by step:

  1. During heavy loss, CUBIC reduces cwnd to its minimum (typically 2 packets).
  2. The connection enters recovery; new data may be limited because the application hasn't queued more (app-limited).
  3. The app-limited exclusion logic, when triggered, sets epoch_start to the current time, effectively restarting the growth phase.
  4. Because epoch_start keeps getting reset (each time the sender is app-limited during recovery), CUBIC's window growth is constantly restarted — it never accumulates enough time to increase cwnd.
  5. The cwnd remains stuck at the minimum, even after the network recovers.

5. Implement the Near-One-Line Fix

The fix was elegant: only apply the app-limited exclusion when the connection is not in a loss recovery state. Adding a single check — if (!in_recovery) before resetting epoch_start — broke the cycle. In code, this might look like:

Taming CUBIC in QUIC: A Deep Dive into a Congestion Control Bug and Its Fix
Source: blog.cloudflare.com
if (app_limited && !in_recovery) {
    // Apply app-limited exclusion
    epoch_start = now;
}

This ensures that during the critical recovery phase, the CUBIC state machine is not interrupted. Once recovery ends, normal app-limited logic can safely apply.

6. Verify the Fix

After applying the fix, re-run the integration test with heavy early loss. The failure rate dropped to zero. Additionally, monitor throughput and cwnd traces to confirm the cwnd recovers after congestion events. Use tools like ss or QUIC logging to observe the cwnd evolution.

Common Mistakes

Mistake 1: Blindly Porting Kernel Code to QUIC

The original Linux kernel fix was correct for TCP, but QUIC's different loss recovery model (e.g., packet numbers instead of sequence numbers, faster acknowledgments) made the app-limited exclusion logic behave differently. Always test edge cases when porting congestion control code.

Mistake 2: Ignoring the Recovery State

Many congestion control implementations treat app-limited logic uniformly, without considering whether the connection is in recovery. This can lead to cwnd starvation. Ensure that state transitions are well-defined.

Mistake 3: Insufficient Testing of Minimum cwnd Regimes

Most tests focus on steady-state throughput. As this bug shows, the minimum cwnd regime is fragile. Incorporate soak tests that simulate severe loss and then clear conditions to verify recovery.

Summary

This tutorial covered the discovery and fix of a CUBIC bug in QUIC where the congestion window got stuck at its minimum after heavy loss. The culprit was an app-limited exclusion logic ported from TCP that reset a critical state variable during recovery. The fix was a single conditional check to skip the exclusion during recovery. Key takeaways: understand the nuances of protocol differences when porting CC code, test recovery scenarios thoroughly, and keep fixes simple. The fix has been merged into quiche and improves resilience for all QUIC traffic.

Related Articles

Recommended

Discover More

How to Evaluate AI-Generated Content: Lessons from a CEO's Commencement SpeechNew Research Reveals Precision Methods for 3D Printed Screw Holes – Eliminates GuessworkZara Data Breach: Personal Details of 197,000 Customers CompromisedHome Battery and Solar Boom Brings 82% Renewables Target Within Reach, Regulator SaysRatty: A Playful GPU-Accelerated Terminal Emulator That Breaks the Mold