Roblox outage

Roblox— Postmortem of outage on 28th October.

Luis David Escobedo Velasquez
4 min readFeb 23, 2022

--

Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage.

Issue Summary

The outage was unique in both duration and complexity. The team had to address a number of challenges in sequence to understand the root cause and bring the service back up.

The outage lasted 73 hours. Fifty million players regularly use Roblox every day and, to create the experience our players expect, our scale involves hundreds of internal online services. As with any large-scale service, we have service interruptions from time to time, but the extended length of this outage makes it particularly noteworthy.

The root cause was due to two issues. Enabling a relatively new streaming feature on Consul under unusually high read and write load led to excessive contention and poor performance. In addition, our particular load conditions triggered a pathological performance issue in BoltDB. The open source BoltDB system is used within Consul to manage write-ahead-logs for leader election and data replication.

Timeline

  • Initial Detection (10/28 13:37)
  • Early Triage (10/28 13:37–10/29 02:00)
  • Return to Service Attempt #1 (10/29 02:00–04:00)
  • Return to Service Attempt #2 (10/29 04:00–10/30 02:00)
  • Research Into Contention (10/30 02:00–10/30 12:00)
  • Root Causes Found (10/30 12:00–10/30 20:00)
  • Restoring Caching Service (10/30 20:00–10/31 05:00)
  • The Return of Players (10/31 05:00–10/31 16:00)

Root cause and resolution

Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%. The system had worked well with streaming at this level for a day before the incident started, so it wasn’t initially clear why it’s performance had changed. However through analysis of perf reports and flame graphs from Consul servers, we saw evidence of streaming code paths being responsible for the contention causing high CPU usage. We disabled the streaming feature for all Consul systems, including the traffic routing nodes. The config change finished propagating at 15:51, at which time the 50th percentile for Consul KV writes lowered to 300ms. We finally had a breakthrough.

The team worked through the night to identify and address these issues, ensure cache systems were properly deployed, and verify correctness. At 05:00 on October 31, 61 hours since the start of the outage, we had a healthy Consul cluster and a healthy caching system. We were ready to bring up the rest of Roblox.

Corrective and preventative measures

It has been 2.5 months since the outage. What have we been up to? We used this time to learn as much as we could from the outage, to adjust engineering priorities based on what we learned, and to aggressively harden our systems. One of our Roblox values is Respect The Community, and while we could have issued a post sooner to explain what happened, we felt we owed it to you, our community, to make significant progress on improving the reliability of our systems before publishing.

The full list of completed and in-flight reliability improvements is too long and too detailed for this write-up, but here are the key items:

  • Telemetry Improvements
  • Expansion Into Multiple Availability Zones and Data Centers
  • Consul Upgrades and Sharding
  • Improvements To Bootstrapping Procedures and Config Management
  • Reintroduction of Streaming

We have learned tremendously from this experience, and we are more committed than ever to make Roblox a stronger and more reliable platform going forward.

Thank you again.

At 16:35, the number of online players dropped to 50% of normal.

At 16:35, the number of online players dropped to 50% of normal.

--

--