Challenges to Achieving High Availability at Scale
Facebook is a social network that connects more than 1.8 billion people. To serve these many users requires infrastructure which is composed of thousands of interdependent systems that span geographically distributed data centers. But what is the guiding principle for building and operating these systems?
For Facebook’s infrastructure teams the answer is: Systems must always be available and never lose data. This talk will explore this quest. We will focus on three aspects.
Availability and consistency. What form of consistency do Facebook’s systems guarantee? Strong consistency makes understanding easy but has latency penalties, weak consistency is fast but difficult to reason for developers and users. We describe our usage of eventual consistency and delve into how Facebook constructs its caching and replicated storage systems to minimize the duration for achieving consistency. We share empirical data that measures the effectiveness of our design.
Availability and correctness. With network partitions, relaxed forms of consistency, and software bugs, how do we guarantee a consistent state? We present two systems to find and repair structural errors in Facebook’s social graph, one batch and one real-time.
Availability and scale. Sharding is one of the standard answers to operate at scale. But how can we develop one system that can shard storage as well as compute? We will introduce a new Sharding-as-a-Service component. We will show and evaluate how its design and service policies control for latency, failure tolerance and operationally efficiency.