https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
ABSTRACT. Resilient Distributed Datasets (RDDs), a distributed memory abstraction … perform in-memory computations (on large clusters) in a fault-tolerant manner. … handle inefficiency: iterative algorithms and interactive data mining tools. (In both cases), keeping data in memory can improve performance by an order of magnitude. To achieve faul tolerance efficiently, RDDs provide a restricted form of shared memory, … (implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.)
The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage (on clusters, such as distributed shared memory, key-value store, databases, and Picolo,) offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). …
In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) …
(lineage)
[5] R.Bose and J.Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37:1–28, 2005.
[9] J.Cheney, L.Chiticariu, and W.-C.Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009.