Introduction to map-reduce
Classical Data Processing
Memory
Disk
CPU
Single Node Machine
1. Data fits into memory
• Load data from disk into memory and then process from memory
2. Data does not fit into memory
• Load part of the data from disk into memory
• Process the data
Motivation: Simple Example
• 10 Billion web pages
• Average size of webpage: 20KB
• Total 200 TB
• Disk read bandwidth = 50MB/sec
• Time to read = 4 million second = 46+ days
• Longer, if you want to do useful analytics with the data
What is the Solution?
• Use multiple interconnected machine as follows
BIG DATA
Known as Distributed Data processing in Cluster of Computers
1. Split data into small chunks
2. Send different chunks to
different machines and process
3. Collect the results from
different machines
How to Organize Cluster of Computers?
Cluster Architecture: Rack Servers
switch switch
machine
machine
machine machine
switch
Rack 1 Rack 2
Backbone switch
Typically 2-10 gbps
1 gbps
between any
pair of nodes
• Each rack contains 16-64 commodity (low cost) computers (also called nodes)
• In 2011, Google has roughly 1 million nodes
Google Data Centre
Challenges in Cluster
Computing
Challenge # 1
• Node failures
– Single server lifetime: 1000 days
– 1000 servers in cluster => 1 failure/day
– 1M servers in clusters => 1000 failures/day
• Consequences of node failure
– Data loss
– Node failure in the middle of long and expensive
computation
• Need to restart the computation from scratch
Challenge # 2
• Network bottleneck
– Computers in a cluster exchanges data through
network
– Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day
Challenge # 3
• Distributed Programming is hard!
• Why?
1.Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2.Avoiding race conditions
• Given two tasks T1 and T2,
– Correctness of result depends on the sequence of execution of
task
– For example, T1 before T2 is must, but NOT T2 before T1
What is the Solution?
• Map-Reduce
– It is a simple programming model for processing really big data using
cluster of computers
How Map-Reduce addresses the
challenges?
1. Data loss prevention
• By keeping multiple copies of data in different machines
2. Data movement minimization
• By moving computation to the data
– (send your computer program to machines containing data)
3. Simple programing model
• Mainly using two functions
1. Map
2. Reduce
Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem
You DO NOT need to worry about other things
Redundant Storage Infrastructure
• Distributed File System
– Global file namespaces, redundancy
– Multiple copies of data and in different nodes
– Example: Google file system (GFS), HDFS (Hadoop, a open-
source map-reduce system)
• Typical usage pattern
– Data is rarely updated in place
Distributed File System: Inside Look
• Data is kept in chunks, spread across machines
• Each chunk is replicated on different machines
– Ensures persistence
– Example:
• We have two files, A and B
• 3 computers
• 2 times replication of data
a1 a2
a3 b1
b1 a1
b2 a2
a3
b2
Here are the Chunk Servers
Chunk servers also serve as compute nodes
Bring computation to the data
a1 a2 a3 b1 b2
A B
Distributes File System: Summary
• Chunk servers
– File is split into contiguous chunks (16-64 mb)
– Each chunk is replicated (usually 2 times or 3 times)
– Try to keep replicas in different racks
• Master node
– Stores metadata about where the files are stored
Map-Reduce Programming
Model
Example Problem: Counting Words
• We have a huge text document and count the number of times each distinct
word appears in the file
• Sample application
– Analyze web server logs to find popular URLs
• How you solve this using a single machine?
Word Count
• Case 1: File too large for memory, but all <word, count> pairs fit in memory
• You can create a big string array OR you can create a hash table
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk
• A possible approach (write computer programs/functions for
each step)
1. Break the text document into sequence of words
2. Sort the words
• This will bring same words together
3. Count the frequencies in a single pass
getWords(textFile)
sort
count
Map-Reduce: In a Nutshell
– getWords(dataFile) sort count
Map
extract something you care about
(here word and count)
Group by key
sort and shuffle
Reduce
Aggregate, summarize, etc
Save the results
Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem
MapReduce: The Map Step
c2
f2
k1 v1
k2 v2
map
c1
f1
c3
f3
…
k3 v3
map
Input key-value pairs
(file name and its content)
Intermediate key-value pairs
(word and count)
…
k4 v4
map
MapReduce: The Reduce Step
Group
by key
reduce
reduce
k1 𝑣 ′
k2 𝑣′′
k3 𝑣′′′
…
k3 v4
…
k1 v1
k2 v2 v5
v3 v6
Key-value groups
Output
key-value pairs
reduce
k1 v6
Intermediate
key-value pairs
k3 v4
…
k1 v1
v2
k2
v5
k2
k1 v3
Map-reduce: Word Count
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers
of a new era of space
exploration. Crew members
at ……………………..
Big document
(the, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(endeavor, 1)
(recently, 1)
(returned, 1)
(to, 1)
(earth, 1)
(as, 1)
(ambassadors, 1)
….
(crew, 1)
……..
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to the
key and output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)
Word Count Using MapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)
Map-reduce System: Under the Hood
All phases are distributed with many tasks doing the work in parallel
Moving data across machines
Map-Reduce Algorithm Design
• Programmer’s responsibility is to design two
functions:
1. Map
2. Reduce
• A very important issue
– Often network is the bottleneck
– Your design should minimize data communications
Problems Suitable for Map-reduce
• Map-reduce is suitable for batch processing
– Updates are made after whole batch of data is
processed
– The mappers do not need data from one another
while they are running
– Example
1. Word count
Problems NOT Suitable for Map-reduce
• In general, when the machines need to
exchange data too often during computation
• Examples
1.Applications that require very quick response time
• In IR, indexing is okay, but query processing is not suitable for map-reduce
2.Machine learning algorithms that require frequent
parameter update
• Stochastic gradient descent
Exercises
Warm-up Exercise
• Matrix Addition
• Can it be done in map-reduce?
– YES
• What is the map function (key and value)?
– Key = row number; value = elements of row (as an array)
• What is the reduce function?
– For each key, reducer will have two arrays
– Reduce function simply adds numbers, position-wise
Advanced Exercise: Join By Map-Reduce
• Compute the natural join T1(A,B) ⋈
T2(B,C)
(combine rows from T1 and T2 such that rows have common value in column B)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
T1
T2
Map-Reduce Join
• Map process
– Each row (a,b) from T1 into key-value pair (b,
(a,T1))
– Each row (b,c) from T2 into (b,(c,T2))
• Reduce process
– Each reduce process matches all the pairs (b,
(a,T1)) with all (b,(c,T2)) and outputs (a,b,c)
Advanced Exercise
• You have a dataset with thousands of features. Find the most co-related features in
that data.
features
Take Home Exercises
• Design Map and Reduce functions for the
following
1.Pagerank
2.HITS

COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION

  • 1.
  • 2.
    Classical Data Processing Memory Disk CPU SingleNode Machine 1. Data fits into memory • Load data from disk into memory and then process from memory 2. Data does not fit into memory • Load part of the data from disk into memory • Process the data
  • 3.
    Motivation: Simple Example •10 Billion web pages • Average size of webpage: 20KB • Total 200 TB • Disk read bandwidth = 50MB/sec • Time to read = 4 million second = 46+ days • Longer, if you want to do useful analytics with the data
  • 4.
    What is theSolution? • Use multiple interconnected machine as follows BIG DATA Known as Distributed Data processing in Cluster of Computers 1. Split data into small chunks 2. Send different chunks to different machines and process 3. Collect the results from different machines
  • 5.
    How to OrganizeCluster of Computers?
  • 6.
    Cluster Architecture: RackServers switch switch machine machine machine machine switch Rack 1 Rack 2 Backbone switch Typically 2-10 gbps 1 gbps between any pair of nodes • Each rack contains 16-64 commodity (low cost) computers (also called nodes) • In 2011, Google has roughly 1 million nodes
  • 7.
  • 8.
  • 9.
    Challenge # 1 •Node failures – Single server lifetime: 1000 days – 1000 servers in cluster => 1 failure/day – 1M servers in clusters => 1000 failures/day • Consequences of node failure – Data loss – Node failure in the middle of long and expensive computation • Need to restart the computation from scratch
  • 10.
    Challenge # 2 •Network bottleneck – Computers in a cluster exchanges data through network – Example • Network bandwidth = 1 gbps • Moving 10TB of data takes 1 day
  • 11.
    Challenge # 3 •Distributed Programming is hard! • Why? 1.Data distributions across machines is non-trivial • (It is desirable that machines have roughly the same load) 2.Avoiding race conditions • Given two tasks T1 and T2, – Correctness of result depends on the sequence of execution of task – For example, T1 before T2 is must, but NOT T2 before T1
  • 12.
    What is theSolution? • Map-Reduce – It is a simple programming model for processing really big data using cluster of computers
  • 13.
    How Map-Reduce addressesthe challenges? 1. Data loss prevention • By keeping multiple copies of data in different machines 2. Data movement minimization • By moving computation to the data – (send your computer program to machines containing data) 3. Simple programing model • Mainly using two functions 1. Map 2. Reduce Programmer’s responsibility: Write only two functions, Map and Reduce suitable for your problem You DO NOT need to worry about other things
  • 14.
    Redundant Storage Infrastructure •Distributed File System – Global file namespaces, redundancy – Multiple copies of data and in different nodes – Example: Google file system (GFS), HDFS (Hadoop, a open- source map-reduce system) • Typical usage pattern – Data is rarely updated in place
  • 15.
    Distributed File System:Inside Look • Data is kept in chunks, spread across machines • Each chunk is replicated on different machines – Ensures persistence – Example: • We have two files, A and B • 3 computers • 2 times replication of data a1 a2 a3 b1 b1 a1 b2 a2 a3 b2 Here are the Chunk Servers Chunk servers also serve as compute nodes Bring computation to the data a1 a2 a3 b1 b2 A B
  • 16.
    Distributes File System:Summary • Chunk servers – File is split into contiguous chunks (16-64 mb) – Each chunk is replicated (usually 2 times or 3 times) – Try to keep replicas in different racks • Master node – Stores metadata about where the files are stored
  • 17.
  • 18.
    Example Problem: CountingWords • We have a huge text document and count the number of times each distinct word appears in the file • Sample application – Analyze web server logs to find popular URLs • How you solve this using a single machine?
  • 19.
    Word Count • Case1: File too large for memory, but all <word, count> pairs fit in memory • You can create a big string array OR you can create a hash table • Case 2: All <word, count> pairs do not fit in memory, but fit into disk • A possible approach (write computer programs/functions for each step) 1. Break the text document into sequence of words 2. Sort the words • This will bring same words together 3. Count the frequencies in a single pass getWords(textFile) sort count
  • 20.
    Map-Reduce: In aNutshell – getWords(dataFile) sort count Map extract something you care about (here word and count) Group by key sort and shuffle Reduce Aggregate, summarize, etc Save the results Summary 1. Outline stays the same 2. Map and Reduce to be defined to fit the problem
  • 21.
    MapReduce: The MapStep c2 f2 k1 v1 k2 v2 map c1 f1 c3 f3 … k3 v3 map Input key-value pairs (file name and its content) Intermediate key-value pairs (word and count) … k4 v4 map
  • 22.
    MapReduce: The ReduceStep Group by key reduce reduce k1 𝑣 ′ k2 𝑣′′ k3 𝑣′′′ … k3 v4 … k1 v1 k2 v2 v5 v3 v6 Key-value groups Output key-value pairs reduce k1 v6 Intermediate key-value pairs k3 v4 … k1 v1 v2 k2 v5 k2 k1 v3
  • 23.
    Map-reduce: Word Count Thecrew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Crew members at …………………….. Big document (the, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (endeavor, 1) (recently, 1) (returned, 1) (to, 1) (earth, 1) (as, 1) (ambassadors, 1) …. (crew, 1) …….. (crew, 1) (crew, 1) (space, 1) (the, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output (key, value) Provided by the programmer Provided by the programmer (key, value) (key, value)
  • 24.
    Word Count UsingMapReduce: Pseudocode map(key, value): // key: document name; value: text of the document for each word w in value emit(w, 1) reduce(key, values): // key: a word; value: set of counts values for the word result = 0 for each count v in values: result += v emit(key, result)
  • 25.
    Map-reduce System: Underthe Hood All phases are distributed with many tasks doing the work in parallel Moving data across machines
  • 26.
    Map-Reduce Algorithm Design •Programmer’s responsibility is to design two functions: 1. Map 2. Reduce • A very important issue – Often network is the bottleneck – Your design should minimize data communications
  • 27.
    Problems Suitable forMap-reduce • Map-reduce is suitable for batch processing – Updates are made after whole batch of data is processed – The mappers do not need data from one another while they are running – Example 1. Word count
  • 28.
    Problems NOT Suitablefor Map-reduce • In general, when the machines need to exchange data too often during computation • Examples 1.Applications that require very quick response time • In IR, indexing is okay, but query processing is not suitable for map-reduce 2.Machine learning algorithms that require frequent parameter update • Stochastic gradient descent
  • 29.
  • 30.
    Warm-up Exercise • MatrixAddition • Can it be done in map-reduce? – YES • What is the map function (key and value)? – Key = row number; value = elements of row (as an array) • What is the reduce function? – For each key, reducer will have two arrays – Reduce function simply adds numbers, position-wise
  • 31.
    Advanced Exercise: JoinBy Map-Reduce • Compute the natural join T1(A,B) ⋈ T2(B,C) (combine rows from T1 and T2 such that rows have common value in column B) A B a1 b1 a2 b1 a3 b2 a4 b3 B C b2 c1 b2 c2 b3 c3 ⋈ A B C a3 b2 c1 a3 b2 c2 a4 b3 c3 = T1 T2
  • 32.
    Map-Reduce Join • Mapprocess – Each row (a,b) from T1 into key-value pair (b, (a,T1)) – Each row (b,c) from T2 into (b,(c,T2)) • Reduce process – Each reduce process matches all the pairs (b, (a,T1)) with all (b,(c,T2)) and outputs (a,b,c)
  • 33.
    Advanced Exercise • Youhave a dataset with thousands of features. Find the most co-related features in that data. features
  • 34.
    Take Home Exercises •Design Map and Reduce functions for the following 1.Pagerank 2.HITS