Tweet `#ACMLearning`
Presenter: Matei Zaharia
https://sparkhub.databricks.com
http://learning.acm.org/
As data volumes grow, we need programming tools for parallel applications that are as easy to use and versatile as those for single machines. The Spark project started at UC Berkeley to meet these goals. Spark is based on two main ideas. First, it has a language-integrated API in Python, Java, Scala and R, based on functional programming, that makes it easy to build applications out of functions to run on a cluster. Second, it offers a general engi ne that can support streaming, batch, and interactive computations, as well as advanced analytics such as machine learning, and lets users combine them in one program. Since its rele ase in 2010, Spark has become a highly active open source project, with over 900 contribut ors and a broad set of built-in libraries. This talk will cover the main ideas behind the Spark programming model, and recent additions to the project.
generality: general engine, do different batch of ...
complex (multi-pass algorithm), interactive (ad-hoc queries), real-time stream processing
specialized system for these workloads
a dominant cost
common problem: lack of data-sharing; sharing across multi steps
RDD track into lineage info to rebuild lost data
Spark trys to be a very general engine
two courses