NetPilot: Automating Datacenter Network Failure Mitigation

failure 1

http://research.microsoft.com/en-us/people/mzh/netpilot.pdf SIGCOMM’12, cited 30+

resolving failures still requires significant human interventions … NetPilot aims to quickly mitigate rather than resolve failures.

NetPilot mitigates failures in much the same way operators do — by deactivating or restarting suspected offending components… circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach …

The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials.

For instance, in April 2011, a failure in Amazon’s AWS service impaired the operations of many cloud services for hours [29].

we advocate a four-step process to react to failures: 1) detection; 2) mitigation; 3) diagnosis; and 4) repair. We argue that it is more important to mitigate failures than to fix them in real-time. Here “mitigate” means taking action(s) that alleviate the symptoms of a failure, possibly at the cost of temporarily reducing spare bandwidth or redundancy.

2. Redundancy In Datacenter Networks

In this section, we motivate NetPilot’s design with the observation that today’s DCNs have plenty of redundancy at the device level, protocol level, and application level.

NetPilot takes advantage of these redundancies to automatically mitigate failures.

2.3 Application-Level Redundancy

Modern DCNs also deploy application-level redundancy for fault tolerance.

3. Redundancy Warrants Automated Failure Mitigation

Figure 2: This figure shows the CDFs of how long it takes for DCNsp’s operators to mitigate and to repair critical failures.

3.3 Spare Capacity for Mitigation Actions

From the analysis above, we find that simple actions are highly effective in mitigating failures and also lead themselves to automation.

reference

[29] A. A. Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648/
[31] Y. Wang, H. Wang, A. Mahimkar, R. Alimi, Y. Zhang, L. Qiu, and Y. R. Yang. R3: resilient routing reconfiguration. In SIGCOMM ’10.