Classical Data Processing
Memory
Disk
CPU
SingleNode Machine
1. Data fits into memory
• Load data from disk into memory and then process from memory
2. Data does not fit into memory
• Load part of the data from disk into memory
• Process the data
3.
Motivation: Simple Example
•10 Billion web pages
• Average size of webpage: 20KB
• Total 200 TB
• Disk read bandwidth = 50MB/sec
• Time to read = 4 million second = 46+ days
• Longer, if you want to do useful analytics with the data
4.
What is theSolution?
• Use multiple interconnected machine as follows
BIG DATA
Known as Distributed Data processing in Cluster of Computers
1. Split data into small chunks
2. Send different chunks to
different machines and process
3. Collect the results from
different machines
Cluster Architecture: RackServers
switch switch
machine
machine
machine machine
switch
Rack 1 Rack 2
Backbone switch
Typically 2-10 gbps
1 gbps
between any
pair of nodes
• Each rack contains 16-64 commodity (low cost) computers (also called nodes)
• In 2011, Google has roughly 1 million nodes
Challenge # 1
•Node failures
– Single server lifetime: 1000 days
– 1000 servers in cluster => 1 failure/day
– 1M servers in clusters => 1000 failures/day
• Consequences of node failure
– Data loss
– Node failure in the middle of long and expensive
computation
• Need to restart the computation from scratch
10.
Challenge # 2
•Network bottleneck
– Computers in a cluster exchanges data through
network
– Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day
11.
Challenge # 3
•Distributed Programming is hard!
• Why?
1.Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2.Avoiding race conditions
• Given two tasks T1 and T2,
– Correctness of result depends on the sequence of execution of
task
– For example, T1 before T2 is must, but NOT T2 before T1
12.
What is theSolution?
• Map-Reduce
– It is a simple programming model for processing really big data using
cluster of computers
13.
How Map-Reduce addressesthe
challenges?
1. Data loss prevention
• By keeping multiple copies of data in different machines
2. Data movement minimization
• By moving computation to the data
– (send your computer program to machines containing data)
3. Simple programing model
• Mainly using two functions
1. Map
2. Reduce
Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem
You DO NOT need to worry about other things
14.
Redundant Storage Infrastructure
•Distributed File System
– Global file namespaces, redundancy
– Multiple copies of data and in different nodes
– Example: Google file system (GFS), HDFS (Hadoop, a open-
source map-reduce system)
• Typical usage pattern
– Data is rarely updated in place
15.
Distributed File System:Inside Look
• Data is kept in chunks, spread across machines
• Each chunk is replicated on different machines
– Ensures persistence
– Example:
• We have two files, A and B
• 3 computers
• 2 times replication of data
a1 a2
a3 b1
b1 a1
b2 a2
a3
b2
Here are the Chunk Servers
Chunk servers also serve as compute nodes
Bring computation to the data
a1 a2 a3 b1 b2
A B
16.
Distributes File System:Summary
• Chunk servers
– File is split into contiguous chunks (16-64 mb)
– Each chunk is replicated (usually 2 times or 3 times)
– Try to keep replicas in different racks
• Master node
– Stores metadata about where the files are stored
Example Problem: CountingWords
• We have a huge text document and count the number of times each distinct
word appears in the file
• Sample application
– Analyze web server logs to find popular URLs
• How you solve this using a single machine?
19.
Word Count
• Case1: File too large for memory, but all <word, count> pairs fit in memory
• You can create a big string array OR you can create a hash table
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk
• A possible approach (write computer programs/functions for
each step)
1. Break the text document into sequence of words
2. Sort the words
• This will bring same words together
3. Count the frequencies in a single pass
getWords(textFile)
sort
count
20.
Map-Reduce: In aNutshell
– getWords(dataFile) sort count
Map
extract something you care about
(here word and count)
Group by key
sort and shuffle
Reduce
Aggregate, summarize, etc
Save the results
Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem
21.
MapReduce: The MapStep
c2
f2
k1 v1
k2 v2
map
c1
f1
c3
f3
…
k3 v3
map
Input key-value pairs
(file name and its content)
Intermediate key-value pairs
(word and count)
…
k4 v4
map
Map-reduce: Word Count
Thecrew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers
of a new era of space
exploration. Crew members
at ……………………..
Big document
(the, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(endeavor, 1)
(recently, 1)
(returned, 1)
(to, 1)
(earth, 1)
(as, 1)
(ambassadors, 1)
….
(crew, 1)
……..
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to the
key and output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)
24.
Word Count UsingMapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)
25.
Map-reduce System: Underthe Hood
All phases are distributed with many tasks doing the work in parallel
Moving data across machines
26.
Map-Reduce Algorithm Design
•Programmer’s responsibility is to design two
functions:
1. Map
2. Reduce
• A very important issue
– Often network is the bottleneck
– Your design should minimize data communications
27.
Problems Suitable forMap-reduce
• Map-reduce is suitable for batch processing
– Updates are made after whole batch of data is
processed
– The mappers do not need data from one another
while they are running
– Example
1. Word count
28.
Problems NOT Suitablefor Map-reduce
• In general, when the machines need to
exchange data too often during computation
• Examples
1.Applications that require very quick response time
• In IR, indexing is okay, but query processing is not suitable for map-reduce
2.Machine learning algorithms that require frequent
parameter update
• Stochastic gradient descent
Warm-up Exercise
• MatrixAddition
• Can it be done in map-reduce?
– YES
• What is the map function (key and value)?
– Key = row number; value = elements of row (as an array)
• What is the reduce function?
– For each key, reducer will have two arrays
– Reduce function simply adds numbers, position-wise
31.
Advanced Exercise: JoinBy Map-Reduce
• Compute the natural join T1(A,B) ⋈
T2(B,C)
(combine rows from T1 and T2 such that rows have common value in column B)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
T1
T2
32.
Map-Reduce Join
• Mapprocess
– Each row (a,b) from T1 into key-value pair (b,
(a,T1))
– Each row (b,c) from T2 into (b,(c,T2))
• Reduce process
– Each reduce process matches all the pairs (b,
(a,T1)) with all (b,(c,T2)) and outputs (a,b,c)
33.
Advanced Exercise
• Youhave a dataset with thousands of features. Find the most co-related features in
that data.
features
34.
Take Home Exercises
•Design Map and Reduce functions for the
following
1.Pagerank
2.HITS