Lunch Seminar Talk by Marko Vukolic, IBM Research: XFT: Practical System Reliability Beyond Crashes

27.02.2015 12:15

Marko Vukolic, IBM Research Rüschlikon

Title: XFT: Practical System Reliability Beyond Crashes

27.2.2015 12:15

CAB  E 72


The number of nines of reliability is possibly the key characteristic of a modern distributed system. As an illustration, a 5-nines reliability of a system means that the system is up and running 99.999% of the time. Increasing the number of nines of reliability is a difficult and fundamental problem in distributed computing in the face of different machine and network faults.

State-of-the-art distributed systems that power modern clouds, guarantee several nines of reliability by employing sophisticated distributed protocols that tolerate network faults (partitions) and crash machine faults. However, in modern large-scale cloud systems, tolerating only machines and components that crash is simply not sufficient. Machines tend to fail in complex ways, and non-crash faults (e.g,. bugs, data corruptions, hardware errors) are becoming a real problem. Byzantine fault-tolerance (BFT), that promises to handle such faults, is practically useless even after 30 years of research due to its resource and operation cost. This cost has to do with BFT considering a practically irrelevant adversary that can control both the non-crash faulty machines and the entire network at will.

In this talk we propose XFT (as in "cross fault tolerance"), a novel and disruptive approach to building reliable distributed systems, that decouples faults *across* machine and network faults dimensions, treating machine and network faults independently. As the showcase for XFT, we present Paxos++: the first state-machine replication protocol in the XFT model. Paxos++ efficiently tolerates faults beyond crashes, featuring many more nines of reliability than the celebrated Paxos protocol, without impacting its performance and resource/operation costs.