08 Oct 2015

A Network-state Management Service

http://research.microsoft.com/en-US/people/mzh/statesman.pdf

multiple network management apps to operate independently, while maintaining network-wide safety and performance invariants

e.g., TE application + firmware upgrade: depending on which action happens first, either the TE application fails to create the tunnel (because the [machine] is already down), or the already-established tunnel ultimately drops traffic during the firmware upgrade.

thus, DCN (datacenter networks) … need a way to keep the applications separate. … the monolithic applications would be highly complex, and worse, it would need to be extended as new needs arise … coupling greatly increases application complexity

typically the “control loop”: each management application measures the state of the network, performs a computation, and then reconfigures the network. …

these applications can conflict with each other, even if they interact with the network at different levels, such as establishing network paths, assigning IP addresses to interfaces, or installing firmware switches … running multiple … also raises the risk of network-wide failures … while each application alone is fine, their joint actions would disconnect the ToR (top rack) …

explicit coordination among applications Corybantic1 … a general solution to the problem of co-existence imposes — require each application to understand the intended network changes of all others. … worse, each time an app is changed or a new is developed, DCN operators would need to test again, and potentially retrofit some existing applications … advocate … loosely coupled … conflict resolution and invariant enforcement should be handled by a separate management system.

3 system

Statesman uses three views of the network state.
In observed state, it maintains an up-to-date view of the actual network staet. Applications read this state and propose state changes based on their individual goals. Using a model of dependencies among state variables, Statesman merges these proposed states into a target state that is guaranteed to maintain the safety and performance invariants. It then updates the network to the target state.
storage service
checker
monitor
periodically collects the current network state from the switches and links, transforms it into OS variables …
updater

4.1 state dependency model

exposing “controllability” to applications … denoting whether the parent state variables is currently controllable, and its value is computed by Statesman based on lower-level dependencies.

e.g., DeviceFirmwareVersion is controllable only if … switches’ power and admin states are appropriate. firware-upgrade application can work with DeviceFirmwareVersion only if it is controllable.

5 check network state

5.1 resolving conflicts

two mechanism for PS-TS conflicts
(basic one) last-writer wins
(advanced) priority-based locking … applications can acquire a low-priority or high-priority lock before proposing PS … high-priority overrides low-priority …

5.2 choosing checking invariants

… minimum safety and performance requirements, independent of what applications are currently running.

checking invariants

6.3 network monitors

split the monitoring responsibility across many monitor instances, so each instance covers roughly 1,000 switches … currently the monitors run periodically to collect all switches’ power states, firmware versions, device configurations, and various counters (and forwarding states for a subset of switches) …

depends a long line of prior on SDN [1, 2, 3, 6, 8, 9, 22]. … in contrast, statesman supports a wider range of network management functions (e.g., switch upgrade, link failure mitigation, elastic scaling, etc) …

Mesos [11] schedules competing applications using the cluster-resource abstraction, which is quite different from our network-state abstraction (e.g., no cross-variable dependency).

Onix [18, 19] and Hercules [16] provide a shared network-state platform for all applications. But these systems neither resolve application conflicts, in particular those caused by state variable dependency, nor enforce network-wide invariants.

Corybantic (loacl copy) [23] proposes a different way of resolving conflicts …

reference

[23] J. Mogul, A. AuYoung, S. Banerjee, J. Lee, J. Mudigonda, L. Popa, P. Sharma, and Y. Turner. Corybantic: Towards Modular Composition of SDN Control Programs1. In ACM HotNets Workshop, November 2013.

[1] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, and J. van der Merwe. Design and Implementation of a Routing Control Platform. In USENIX NSDI, May 2005.

[2] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and S. Shenker. Ethane: Taking Control of the Enterprise. ACM SIGCOMM CCR, 37(4):1–12, August 2007.

[3] M. Casado, T. Garfinkel, A. Akella, M. J. Freedman, D. Boneh, N. McKeown, and S. Shenker. SANE: A Protection Architecture for Enterprise Networks. In USENIX Security Symposium, July 2006.

[6] N. Feamster, J. Rexford, and E. Zegura. The Road to SDN. ACM Queue, 11(12):20:20–20:40, December 2013.

[8] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers, J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang. A Clean Slate 4D Approach to Network Control and Management. ACM SIGCOMM CCR, 35(5):41–54, October 2005

[9] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, and S. Shenker. NOX: Towards an Operating System for Networks. ACM SIGCOMM CCR, 38(3):105–110, July 2008.

[22] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. OpenFlow: Enabling Innovation in Campus Networks. ACM SIGCOMM CCR, 38(2):69–74, March 2008.