Making Big Data Processing Simple with Spark

data 2

Tweet `#ACMLearning`
Presenter: Matei Zaharia

https://sparkhub.databricks.com
http://learning.acm.org/

As data volumes grow, we need programming tools for parallel applications that are as easy to use and versatile as those for single machines. The Spark project started at UC Berkeley to meet these goals. Spark is based on two main ideas. First, it has a language-integrated API in Python, Java, Scala and R, based on functional programming, that makes it easy to build applications out of functions to run on a cluster. Second, it offers a general engi ne that can support streaming, batch, and interactive computations, as well as advanced analytics such as machine learning, and lets users combine them in one program. Since its rele ase in 2010, Spark has become a highly active open source project, with over 900 contribut ors and a broad set of built-in libraries. This talk will cover the main ideas behind the Spark programming model, and recent additions to the project.

unified engine

why

2004, Mapreduce, Google

generality: general engine, do different batch of ...

beyond MapReduce (generalize Mapreduce)

complex (multi-pass algorithm), interactive (ad-hoc queries), real-time stream processing

specialized system for these workloads

Problems with specialized system

More systems to manage, tune, deploy
can’t easily combine processing types

a dominant cost

common problem: lack of data-sharing; sharing across multi steps

Data sharing in MapReduce

… slow

What we’d like

faster, immediately

(change) fault-tolerant

sharing data …

programming model

Spark programming model: RDD (Resilient distributed datasets); interesting properties: transformed parallel … rebuilt on failure
Example: Log Mining: full-text search of Wikipedia

	
lines = spark.textFile ("hdfs://...")
errors = lines.filter (lambda s: s.startswitch ("ERROR"))
messages = error.map (lambda s: s.split ('\t')[2])
messages.cache ()

messages.filter (lambda s: "MySQL" in s).count ()

fault tolerant: (cool thing underneath RDD); lineage: recompute the missing data when something fails
RDD track into lineage info to rebuild lost data
On-disk performance: Time to sort 100TB

Builtin libraries

Combining processing types

load data with SQL
train a machine learning model
apply it to a stream

nice (compared to separate systems)

usability
performance

Performance vs Specialized Systems

…

libraries built on spark

the most active part of Spark

Application

Spark trys to be a very general engine

Spark community: 1000 deployments; clusters up to 8000 nodes
top applications: …
Spark components used: Spark SQL 69% (most)

two courses