Resilient Distributed Datasets
A Fault­­Tolerant Abstraction for
In­Memory Cluster Computing
Motivation
•RDDs are motivated by two types of applications that current computing
frameworks handle inefficiently:
1. Iterative algorithms:
­iterative machine learning
­graph algorithms
2. Interative data mining
­ad­hoc query
•In MapReduce, the only way to share data across jobs is stable storage
slow!
Examples
Slow due to replication and disk I/O, but
necessary for fault tolerance
Goal:In-Memory Data Sharing
Solution: Resilient
Distributed Datasets (RDDs)
•Restriced form of distributed shared memory
­­ Immutable,partitioned collections of records
­­ Can only be built through coarse­grained derterminstic
transformations(map,filter,join,…)
•Efficient fault recovery using lineage
­­log one operation to apply to many elenments
­­Recompute lost partitions on failure
­­No cost if nonthing fails
Solution: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory
for efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications
• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in
RAM,on disk,etc)
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Load error messages from a log into memory, then
interactively search for various patterns
9
Fault Recovery
• RDD track the grapth of transformations that
built them (their lineage) to rebuild lost data
10
Example:PageRank
Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g.hash
both on URL) to avoid shuffles
Can also use app knowledge,
e.g.,hash on DNS name
links = links.partitionBy(new
URLPartitioner())
PageRank Performance
Representing RDDs
• a set of partitions, which are atomic pieces of the
dataset
• a set of dependencies on parent RDDs
• a function for computing the dataset based on its
parents
• metadata about its partitioning scheme
• data placement
04/25/14
Representing RDDs
04/25/14
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can
be accessed faster due to data
locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of
partition p given iterators for its
parent partitions
partitioner() Return metadata specifying
whether the RDD is hash/range
partitioned
Interface used to represent RDDs in Spark
Dependencies
• narrow dependencies
---where each partition of the parent RDD is used by
at most one partition of the child RDD
• wide dependencies
---where multiple child partitions may depend on it.
• For example
---map leads to a narrow dependency,
---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/25/14
Dependencies
04/25/14
Examples of narrow and wide dependencies. Each box is an RDD, with
partitions shown as shaded rectangles
Narrow VS Wide dependencies
• Narrow dependencies
---allow for pipelined execution on one cluster node, which can compute all the
parent partitions.
---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies
--- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation
--- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/25/14
Job Scheduler
• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory
• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute
• Each stage contains as many pipelined transformations with narrow
dependencies as possible
Boundary of the stages
---shuffle operations required for wide dependencies
---any already computed partitions(shortcircuit the computation of a
parent RDD)
• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/25/14
Job Scheduler
04/25/14
Dryad-like DAGs
Pipelines functions
within a stage
Locality & data
reuse aware
Partitioning-aware
to avoid shuffles
Task Assignment
• scheduler assigns tasks to machines based on data locality
using delay scheduling
---if a task needs to process a partition that is available in
memory on a node, then send it to that node
---otherwise, a task processes a partition for which the
containing RDD provides preferred locations (e.g., an HDFS
file), then send it to those
04/25/14
Memory Management
• in-memory storage as deserialized Java objects
---The first option provides the fastest performance, because the Java
VM can access each RDD element natively
• in-memory storage as serialized data
---The second option lets users choose a more memory-efficient
representation than Java object graphs when space is limited, at the
cost of lower performance
• on-disk storage
---The third option is useful for RDDs that are too large to keep in RAM
but costly to recompute on each use.
04/25/14
Not Suitable for RDDs
• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset
• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web
application or an incremental web crawler
04/25/14
04/25/14
Programming Models
Implemented on Spark
RDDs can express many existing parallel models
04/25/14
Open Source Community
15contributors,5+companies using Spark,
3+applications projects at Berkeley
User applications:
» Data mining 40x faster than Hadoop(Conviva)
» Exploratory log analysis (Foursquare)
» Traffic prediction via EM(Mobile Millennium)
» Twitter spam classification (Monarch)
» DNA sequence analysis(SNAP)
04/25/14
Conclusion
RDDs offer a simple and efficient programming model for a broad range of
Applications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)
Leverage the coarse-grained nature of many parallel algorithms for low-
overhead recovery
Let user controls each RDD’s partitioning (layout across nodes) and
persistence (storage in RAM,on disk,etc)

BDAS RDD study report v1.2

  • 1.
    Resilient Distributed Datasets AFault­­Tolerant Abstraction for In­Memory Cluster Computing
  • 2.
    Motivation •RDDs are motivatedby two types of applications that current computing frameworks handle inefficiently: 1. Iterative algorithms: ­iterative machine learning ­graph algorithms 2. Interative data mining ­ad­hoc query •In MapReduce, the only way to share data across jobs is stable storage slow!
  • 3.
    Examples Slow due toreplication and disk I/O, but necessary for fault tolerance
  • 4.
  • 5.
    Solution: Resilient Distributed Datasets(RDDs) •Restriced form of distributed shared memory ­­ Immutable,partitioned collections of records ­­ Can only be built through coarse­grained derterminstic transformations(map,filter,join,…) •Efficient fault recovery using lineage ­­log one operation to apply to many elenments ­­Recompute lost partitions on failure ­­No cost if nonthing fails
  • 6.
    Solution: Resilient Distributed Datasets(RDDs) • Allow apps to keep working sets in memory for efficient reuse • Retain the attractive properties of MapReduce – Fault tolerance, data locality, scalability • Support a wide range of applications • Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)
  • 7.
    RDD Operations Transformations (define anew RDD) map filter sample groupByKey reduceByKey sortByKey flatMap union join cogroup cross mapValues Actions (return a result to driver program) collect reduce count save lookupKey
  • 8.
    Example: Log Mining lines= spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Load error messages from a log into memory, then interactively search for various patterns
  • 9.
    9 Fault Recovery • RDDtrack the grapth of transformations that built them (their lineage) to rebuild lost data
  • 10.
  • 11.
    Optimizing Placement links &ranks repeatedly joined Can co-partition them (e.g.hash both on URL) to avoid shuffles Can also use app knowledge, e.g.,hash on DNS name links = links.partitionBy(new URLPartitioner())
  • 12.
  • 13.
    Representing RDDs • aset of partitions, which are atomic pieces of the dataset • a set of dependencies on parent RDDs • a function for computing the dataset based on its parents • metadata about its partitioning scheme • data placement 04/25/14
  • 14.
    Representing RDDs 04/25/14 Operation Meanning partitions()Return a list of Partition objects preferredLocations(p) List nodes where partition p can be accessed faster due to data locality dependencies() Return a list of dependencies iterator(p, parentIters) Compute the elements of partition p given iterators for its parent partitions partitioner() Return metadata specifying whether the RDD is hash/range partitioned Interface used to represent RDDs in Spark
  • 15.
    Dependencies • narrow dependencies ---whereeach partition of the parent RDD is used by at most one partition of the child RDD • wide dependencies ---where multiple child partitions may depend on it. • For example ---map leads to a narrow dependency, ---while join leads to wide dependencies (unless the parents are hash-partitioned) 04/25/14
  • 16.
    Dependencies 04/25/14 Examples of narrowand wide dependencies. Each box is an RDD, with partitions shown as shaded rectangles
  • 17.
    Narrow VS Widedependencies • Narrow dependencies ---allow for pipelined execution on one cluster node, which can compute all the parent partitions. ---recovery after a node failure is more efficient, as only the lost parent partitions need to be recomputed, can be recomputed in parallel on different nodes • Wide dependencies --- require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation --- in a lineage graph, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution 04/25/14
  • 18.
    Job Scheduler • Similarto Dryad’s, but takes into account which partitions of persistent RDDS available in memory • When runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute • Each stage contains as many pipelined transformations with narrow dependencies as possible Boundary of the stages ---shuffle operations required for wide dependencies ---any already computed partitions(shortcircuit the computation of a parent RDD) • The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD 04/25/14
  • 19.
    Job Scheduler 04/25/14 Dryad-like DAGs Pipelinesfunctions within a stage Locality & data reuse aware Partitioning-aware to avoid shuffles
  • 20.
    Task Assignment • schedulerassigns tasks to machines based on data locality using delay scheduling ---if a task needs to process a partition that is available in memory on a node, then send it to that node ---otherwise, a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), then send it to those 04/25/14
  • 21.
    Memory Management • in-memorystorage as deserialized Java objects ---The first option provides the fastest performance, because the Java VM can access each RDD element natively • in-memory storage as serialized data ---The second option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance • on-disk storage ---The third option is useful for RDDs that are too large to keep in RAM but costly to recompute on each use. 04/25/14
  • 22.
    Not Suitable forRDDs • RDDs are best suited for batch applications that apply the same operation to all elements of a dataset • RDDs would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler 04/25/14
  • 23.
    04/25/14 Programming Models Implemented onSpark RDDs can express many existing parallel models
  • 24.
    04/25/14 Open Source Community 15contributors,5+companiesusing Spark, 3+applications projects at Berkeley User applications: » Data mining 40x faster than Hadoop(Conviva) » Exploratory log analysis (Foursquare) » Traffic prediction via EM(Mobile Millennium) » Twitter spam classification (Monarch) » DNA sequence analysis(SNAP)
  • 25.
    04/25/14 Conclusion RDDs offer asimple and efficient programming model for a broad range of Applications(immutable nature and coarse-grained transformations, suitable for a wide class of applications) Leverage the coarse-grained nature of many parallel algorithms for low- overhead recovery Let user controls each RDD’s partitioning (layout across nodes) and persistence (storage in RAM,on disk,etc)

Editor's Notes

  • #9 Key idea: add “variables” to the “functions” in functional programming
  • #18 Pepieline execution: For example, one can apply a map followed by a filter on an element-by-element basis
  • #20 Example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles, in black if they are already in memory. To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow transformations inside each stage. In this case, stage 1’s output RDD is already in RAM, so we run stage 2 and then 3.
  • #26 自己总结: 1.简单 高效 应用范围较广 2.降低了粗粒度并行算法容恢复的代价 3.由用户决定哪些数据是需要重复利用而需要长久保存以及保存的策略,用户可以控制数据分布的策略来避免shuffle以提高效率(如co-partition,shuffle的过程是比较慢,比较耗时间的操作) 4.比一般的模型更通用,现有的模型大多解决的是MapReduce在某些领域性能表现的不好而专门位置设计的专用模型,如Google的Pregel,与之相比,Pregel提供的数据共享模型隐含的适用于图计算的模型,而RDD的模型则提供了一种更通用的数据共享模型(不仅仅能表达出Pregel的计算模型,还能用在其他的应用场景,更通用,更灵活。) 与Pregel的区别: A third class of systems provide high-level interfaces for specific classes of applications requiring data sharing. For example, Pregel [22] supports iterative graph applications, while Twister [11] and HaLoop [7] are iterative MapReduce runtimes. However, these frameworks perform data sharing implicitly for the pattern of computation they support, and do not provide a general abstraction that the user can employ to share data of her choice among operations of her choice. For example, a user cannot use Pregel or Twister to load a dataset into memory and then decide what query to run on it. RDDs provide a distributed storage abstraction explicitly and can thus support applications that these specialized systems do not capture, such as interactive data mining. 与 MR的区别(shark论文总结): 1. Like Dryad and Tenzing [17, 9], it supports general computation DAGs, not just the two-stage MapReduce topology. 2. It provides an in-memory storage abstraction called Resilient Distributed Datasets (RDDs) that lets applications keep data in memory across queries, and automatically reconstructs it after failures [33]. 3. The engine is optimized for low latency. It can efficiently manage tasks as short as 100 milliseconds on clusters of thousands of cores, while engines like Hadoop incur a latency of 5–10 seconds to launch each task. RDD的四个特点(shark论文总结): The RDD model offers several key benefits our large-scale in memory computing setting. First, RDDs can be written at the speed of DRAM instead of the speed of the network, because there is no need to replicate each byte written to another machine for fault tolerance. DRAM in a modern server is over 10 faster than even a 10-Gigabit network. Second, Spark can keep just one copy of each RDD partition in memory, saving precious memory over a replicated system, since it can always recover lost data using lineage. Third, when a node fails, its lost RDD partitions can be rebuilt in parallel across the other nodes, allowing speedy recovery. Fourth,even if a node is just slow (a “straggler”), we can recompute necessary partitions on other nodes because RDDs are immutable so there are no consistency concerns with having two copies of a partition.