How the big boys run their infrastructure

Ciao grande persona,

somewhat troughout the week the name Borg came to my ears for google running their clusters with that. As it has the same name as my favorite backup tool, I was keen to learn more about it. This weeks papers showcases the design and the thoughts that went into building borg. I kinda wondered if it is still in use (as the paper is from 2015, which is stone age in computer times), but regarding an article by the register (Q2, 2020) it is sill in use.

Also this paper showed me I should learn more about cgroups and chroot, some links regarding this two Linux features are enclosed in the links section.

Abstract:

Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

Download Link:

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43438.pdf

Additional Links:

chroot on Wikipedia
cgroups on Wikipedia
cgroups blog post by grant.pizza (what a great name for a blog :D)