Rollback Recovery for Middleboxes

middlebox 7

http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p227.pdf

Network middleboxes must offer high availability, with automatic failover when a device fails … challenging because failover must correctly restore lost state (e.g., activity logs, port mappings) but must do so quickly (e.g., in less than typical transport timeout values to minimize disruption to applications) and with little overhead to failure-free operation (e.g., additional per-packet latencies of 10-100s of μs).

middleboxes typically involve proprietary monolithic software running on dedicated hardware, they can be expensive to deploy and manage ... To rectify this situation, network operators are moving towards Network Function Virtualization (NFV), in which middlebox functionality is moved out of dedicated physical boxes into virtual appliances that can be run on commodity processors [32]

While the NFV vision solves the dedicated hardware problem, it presents some technical challenges of its own.

challenges: performance [38, 45, 52, 55, 58]; management [33, 35, 49]

We argue that an equally important challenge — one that has received far less attention — is that of fault-tolerance.

common approach to fault tolerance: combination of careful engineering to avoid faults, and deploying a backup appliance to rapidly restart when faults occur; migration to NFV will only exacerbate their problematic aspects

traditional middleboxes ... limiting the introduction of faults ... will not apply to NFV: vendor diversity in hardware and applications will explode the test space

greater openness and agility in middlebox infrastructure

With current middleboxes, operators often maintain a dedicated per-appliance backup. This is inefficient and offers only a weak form of recovery for the many middlebox applications that are stateful — e.g., NAT …

dynamic state about flows, users, and network conditions

correct recovery from failures …

low-latency
general
passive

tailor the classic approach of rollback recovery to the middlebox domain and achieves correct recovery in a general and passive manner

FIMB — achieves rapid recovery

low additional latency on failure-free operation (adding only 30μs to median per-packet latencies
(reconstructing lost state in between 40-275ms for practical system configurations)

reference

[32] EuropeanTelecommunicationsStandardsInstitute.NFVWhitepaper. https://portal.etsi.org/NFV/NFV_White_Paper.pdf