Resilient Distributed Datasets: A Fault Tolerant. Abstraction for In Memory Cluster Computing

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

ABSTRACT. Resilient Distributed Datasets (RDDs), a distributed memory abstraction … perform in-memory computations (on large clusters) in a fault-tolerant manner. … handle inefficiency: iterative algorithms and interactive data mining tools. (In both cases), keeping data in memory can improve performance by an order of magnitude. To achieve faul tolerance efficiently, RDDs provide a restricted form of shared memory, … (implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.)

The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage (on clusters, such as distributed shared memory, key-value store, databases, and Picolo,) offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). …

In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) …

2 Resilient Distributed Datasets (RDDs)

lineage: for applications such as explaining results … [5], [9]
relational databases: RDDs are conceptually similar to views in a database and persistent RDDs resemble materialized views … database typically allow fine-grained read-write access to all records, requiring logging of operations and data for fault-tolerance and additional overhead to maintain consistency.

reference

(lineage)

[5] R.Bose and J.Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37:1–28, 2005.
[9] J.Cheney, L.Chiticariu, and W.-C.Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009.

Resilient Distributed Datasets: A Fault Tolerant. Abstraction for In Memory Cluster Computing

2 Resilient Distributed Datasets (RDDs)

8 related work

reference