OSDI '20 – Twine: A unified cluster management system for shared infrastructure

Channel	Publish Date	Thumbnail & View Count	Download Video
	Publish Date not found	0 Views	View on YouTube Download Video

Twine: A unified cluster management system for shared infrastructure

Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan and Peter Zhang, Facebook Inc.

Introducing Twine, Facebook’s cluster management system that has been in production for a decade. Twine has helped transform our infrastructure from a collection of silos of custom machines dedicated to individual workloads to a large-scale shared infrastructure with fungible hardware.

Our goal of ubiquitous shared infrastructure leads us to make some decisions that conflict with common practices. For example, instead of implementing an isolated control plane per cluster, Twine scales a single control plane to manage a million machines across all data centers in a geographic region and transparently moves workloads between clusters.

Twine provides workload-specific customizations on shared infrastructure, and this approach deviates further from common practices. The TaskControl API enables an application to work with Twine to handle container lifecycle events, for example by restarting the followers of a ZooKeeper deployment first and the leader last during a rolling upgrade. Host profiles capture hardware and OS settings that workloads can tune to improve performance and reliability; Twine dynamically assigns machines to workloads and switches host profiles accordingly.

Finally, we challenge the conventional wisdom that prioritizes stacking workloads on large machines to increase utilization. Instead, we deploy low-power, small machines equipped with a single CPU and 64GB of RAM throughout to achieve higher performance per watt. We also use autoscaling to improve machine utilization.

We describe the design of Twine and share our experiences migrating Facebook workloads to shared infrastructure.

View the full OSDI '20 program at https://www.usenix.org/conference/osdi20/technical-sessions