17 Dec 2015

Making Big Data Processing Simple with Spark

Tweet `#ACMLearning`
Presenter: Matei Zaharia

https://sparkhub.databricks.com
http://learning.acm.org/

As data volumes grow, we need programming tools for parallel applications that are as easy to use and versatile as those for single machines. The Spark project started at UC Berkeley to meet these goals. Spark is based on two main ideas. First, it has a language-integrated API in Python, Java, Scala and R, based on functional programming, that makes it easy to build applications out of functions to run on a cluster. Second, it offers a general engi ne that can support streaming, batch, and interactive computations, as well as advanced analytics such as machine learning, and lets users combine them in one program. Since its rele ase in 2010, Spark has become a highly active open source project, with over 900 contribut ors and a broad set of built-in libraries. This talk will cover the main ideas behind the Spark programming model, and recent additions to the project.

unified engine

why
2004, Mapreduce, Google

generality: general engine, do different batch of ...

beyond MapReduce (generalize Mapreduce)

complex (multi-pass algorithm), interactive (ad-hoc queries), real-time stream processing

specialized system for these workloads

Problems with specialized system
  1. More systems to manage, tune, deploy
  2. can’t easily combine processing types

a dominant cost

common problem: lack of data-sharing; sharing across multi steps

Data sharing in MapReduce
… slow
What we’d like
faster, immediately
(change) fault-tolerant
sharing data …

programming model

Spark programming model
RDD (Resilient distributed datasets)
interesting properties: transformed parallel … rebuilt on failure
Example: Log Mining
full-text search of Wikipedia
	
lines = spark.textFile ("hdfs://...")
errors = lines.filter (lambda s: s.startswitch ("ERROR"))
messages = error.map (lambda s: s.split ('\t')[2])
messages.cache ()

messages.filter (lambda s: "MySQL" in s).count ()
fault tolerant
(cool thing underneath RDD)
lineage: recompute the missing data when something fails

RDD track into lineage info to rebuild lost data

On-disk performance
Time to sort 100TB

Builtin libraries

Combining processing types
  1. load data with SQL
  2. train a machine learning model
  3. apply it to a stream
nice (compared to separate systems)
  • usability
  • performance
Performance vs Specialized Systems
libraries built on spark
the most active part of Spark

Application

Spark trys to be a very general engine

Spark community
1000 deployments
clusters up to 8000 nodes
top applications
Spark components used
Spark SQL 69% (most)

two courses