Automating network application dependency discovery: experiences, limitations, and new solutions

dependency 2

http://dl.acm.org/citation.cfm?id=1855750 OSDI'08

Abstract — Large enterprise networks consist of thousands of services and applications. The performance and reliability of any particular application may depend on multiple services, spanning many hosts and network components.

automated discovery of dependencies from network traffic [8, 18]

We introduce a new system, Orion, that discovers dependencies using packet headers and timing information in network traffic based on a novel insight of delay spike based analysis.

Modern enterprise IT infrastructures comprise of large numbers of network services and user applications. Typical applications, such as web, email, instant messaging, file sharing, and audio/video conferencing, operate on a distributed set of clients and servers. They also rely on many supporting services, such as Active Directory (AD), Domain Name System (DNS), Kerberos, and Windows Internet Name Service (WINS). The complexity quickly adds up as different applications and services must interact with each other in order to function properly.

We say one service depends on the other if the former requires the latter to operate properly. Knowledge of service dependencies provides a basis for serving critical network management tasks, including fault localization, reconfiguration planning, and anomaly detection. For instance, Sherlock encapsulates the services and network components that applications depend on in an inference graph [8]. This graph is combined with end-user observations of application performance to localize faults in an enterprise network.

When IT managers need to upgrade, reorganize, or consolidate their existing applications, they can leverage the knowledge of dependencies of their applications to identify the services and hosts that may potentially be affected, and to prevent unexpected consequences [9].

… not have proven tools that help to discover the web of dependencies among different services and applications… commonly rely on the knowledge from application designers and owners to specify these dependencies… requires significant human effort to keep up with the evolution of the applications and their deployment environment.

requires significant human effort to keep up with the evolution of the applications and their deployment environment.

a few attempts to automate dependency discovery by observing network traffic patterns [8, 9, 18].

we ... introduce a new dependency discovery technique based on traffic delay distributions.

There is a large body of prior work on tracing execution paths among different components in distributed applications.

require too much manual effort and are often restricted to a particular set of applications from the same vendor.
Pinpoint instruments the J2EE middleware on every host to track requests as they flow through the system [15]. It focuses on mining the collections of these paths to locate faults and understand system changes. X-Trace is a cross-layer, cross-application framework for tracing the network operations resulting from a particular task [16].
Magpie is a toolchain that correlates events generated by operating system, middleware, and application to extract individual requests and their resource usage [10].
Brown et al. propose to use active perturbation to infer dependencies between system components in distributed applications [14].

3 Goal & Approach

3.1 Services and dependencies

We define service A to depend on service B: denoted as A → B, if A requires B to satisfy certain requests from its clients.; A → B does not mean A must depend on B to answer all the client requests. In the example above, clients may bypass the DNS service if they have cached the web server IP address.; For instance, a web service depends on DNS service because web clients need to lookup the IP address of the web server to access a webpage. Simi- larly, a web service may also depend on database services to retrieve contents requested by its clients.

3.2 Discovering dependencies from traffic

consider three options in designing Orion to discover dependencies of enterprise applications: - i) instrumenting applications or middlewares; … We bypass the first option because we want Orion to be easily deployable. Requiring changes to existing applications or middlewares will deter adoption. - ii) mining application configuration files; … However, the configuration files of different applications may be stored in different locations and have different formats. - iii) analyzing application traffic.

we take the third approach of discovering dependencies by using packet headers … and timing information in network traffic.

developing parsers for every application requires extensive human effort and domain knowledge. For this reason, we refrain from using any packet content information besides IP, UDP, and TCP headers.

Orion discovers dependencies based on the observation that the traffic delay distribution between dependent services often exhibits “typical” spikes that reflect the underlying delay for using or providing these services.

5 Service Dependency Discovery

5.1 Overview

Orion discovers service dependencies by observing the time correlation of messages between different services.

Our key assumption is if service A depends on service B, the delay distribution between their messages should not be random.

Orion focuses on discovering service dependencies from an individual host’s perspective. Given a host, it aims to identify dependencies only between services that are either used or provided by that host… By combining the dependencies extracted from multiple hosts, Orion can construct the dependency graphs of multi-tier applications.

5.2 Delay distribution calculation

Suppose a host uses m remote services and provides n local services, Orion needs to maintain delay distributions for (m × m) RR service pairs and (m × n) LR service pairs for that host in the worse case.

5.3 Service dependency extraction

Intuitively, the number of typical spikes corresponds to the number of commonly-executed paths in the services, which is at most a few for all the services we study.

5.4 Discussion

various types of noise (e.g., different hardware, software, configuration, and workload on the hosts and load variation in the network) or unexpected service pair interaction (e.g., although service A → B, the messages of A and B could be triggered by other services). While the impact of random noise can be mitigated by taking a large number of statistical samples, unexpected service pair interaction is more problematic. … We emphasize that the issues above are not just specific to Orion but to the class of dependency discovery techniques based on traffic patterns.

Orion requires a large number of statistical samples to reliably extract service dependencies. This makes it less applicable to services which are newly deployed or infrequently used. It may also miss dependencies that rarely occur, such as DHCP. One possible solution is to proactively inject workloads to these services to help accumulate sufficient number of samples.

7 Experimental Results

reference

[8] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. In Proc. ACM SIGCOMM, 2007.
[18] S. Kandula, R. Chandra, and D. Katabi. What’s Going On? Extracting Communication Rules In Edge Networks. In Proc. ACM SIGCOMM, 2008.
[15] M. Chen, A. Accardi, E. Kcman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based Failure and Evolution Management. In NSDI, 2004. to check
[9] P. V. Bahl, P. Barham, R. Black, R. Chandra, M. Goldszmidt, R. Isaacs, S. Kandula, L. Li, J. MacCormick, D. Maltz, R. Mortier, M. Wawrzoniak, and M. Zhang. Discovering Dependencies for Network Management. In HotNets, 2006.