22 Oct 2015

Resilient Distributed Datasets: A Fault Tolerant. Abstraction for In Memory Cluster Computing

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

ABSTRACT. Resilient Distributed Datasets (RDDs), a distributed memory abstraction … perform in-memory computations (on large clusters) in a fault-tolerant manner. … handle inefficiency: iterative algorithms and interactive data mining tools. (In both cases), keeping data in memory can improve performance by an order of magnitude. To achieve faul tolerance efficiently, RDDs provide a restricted form of shared memory, … (implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.)

The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently. Existing abstractions for in-memory storage (on clusters, such as distributed shared memory, key-value store, databases, and Picolo,) offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). …

In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) …

2 Resilient Distributed Datasets (RDDs)

lineage
for applications such as explaining results … [5], [9]
relational databases
RDDs are conceptually similar to views in a database and persistent RDDs resemble materialized views … database typically allow fine-grained read-write access to all records, requiring logging of operations and data for fault-tolerance and additional overhead to maintain consistency.

reference

(lineage)