Silav, kesê bijare, everyone working in large scale infrastructure (e.g. as SRE) experienced this situation: The system gets overloaded due to a one (or worst-case many consecutive) of the following conditions:
- Traffic Spike (e.g. Christmas Traffic)
- Datacenter failure (OVH, I am looking at you)
- Bug in the code, tearing down the application (e.g. Cloudflares regex issue in their web firewall)
In all of this cases, the system gets overloaded and degrades in performance, or entirely goes down. With managing the infrastructure for the biggest sites on the web, Facebook/Meta experiences this issues more than anyone, and they came up with a clever solution: On-demand turn-off switches for features on the server side and also in the clients.
This means, if they run into a situation where the systems are overloaded, they have easy knobs to turn, to degrade the users experience, but secure the system. (e.g. Turning of comments on posts, which is bad for the users, but a lot better than the entire system going down)
Exactly this system (called “Defcon”) is explained in this weeks paper. Definitely an inspiration for all SREs and Infra people out there.
Abstract:
Every day, billions of people depend on Internet services for communication, commerce, and entertainment. Yet planetary-scale data center infrastructures consisting of millions of servers experience unplanned capacity outages and unexpected demand for resources; how can such infrastructures remain reliable in the face of capacity and workload flux? In this paper, we introduce Defcon, a system for improving
the availability of large-scale, globally-distributed Internet services using graceful feature degradation. In response to overload conditions, Defcon enables site operators to gradually disable less-critical features in order to reduce resource demand. Defcon presents a common interface to product developers to define feature knobs that represent degradation capabilities. Defcon automatically tests knobs to understand each knob’s product- and infrastructure-level trade-offs. At Meta, we have used Defcon to improve global product availability in the face of worldwide demand-surges in addition to large-scale infrastructure failures
Download Link:
https://www.usenix.org/system/files/osdi23-meza.pdf
Additional Links:
- YouTube Video Showcasing and Explaining the Paper