COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION

Classical Data Processing
Memory
Disk
CPU
Single Node Machine
1. Data fits into memory
• Load data from disk into memory and then process from memory
2. Data does not fit into memory
• Load part of the data from disk into memory
• Process the data

Motivation: Simple Example
• 10 Billion web pages
• Average size of webpage: 20KB
• Total 200 TB
• Disk read bandwidth = 50MB/sec
• Time to read = 4 million second = 46+ days
• Longer, if you want to do useful analytics with the data

What is the Solution?
• Use multiple interconnected machine as follows
BIG DATA
Known as Distributed Data processing in Cluster of Computers
1. Split data into small chunks
2. Send different chunks to
different machines and process
3. Collect the results from
different machines

How to Organize Cluster of Computers?

Cluster Architecture: Rack Servers
switch switch
machine
machine
machine machine
switch
Rack 1 Rack 2
Backbone switch
Typically 2-10 gbps
1 gbps
between any
pair of nodes
• Each rack contains 16-64 commodity (low cost) computers (also called nodes)
• In 2011, Google has roughly 1 million nodes

Challenges in Cluster
Computing

Challenge # 1
• Node failures
– Single server lifetime: 1000 days
– 1000 servers in cluster => 1 failure/day
– 1M servers in clusters => 1000 failures/day
• Consequences of node failure
– Data loss
– Node failure in the middle of long and expensive
computation
• Need to restart the computation from scratch

Challenge # 2
• Network bottleneck
– Computers in a cluster exchanges data through
network
– Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day

Challenge # 3
• Distributed Programming is hard!
• Why?
1.Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2.Avoiding race conditions
• Given two tasks T1 and T2,
– Correctness of result depends on the sequence of execution of
task
– For example, T1 before T2 is must, but NOT T2 before T1

What is the Solution?
• Map-Reduce
– It is a simple programming model for processing really big data using
cluster of computers

How Map-Reduce addresses the
challenges?
1. Data loss prevention
• By keeping multiple copies of data in different machines
2. Data movement minimization
• By moving computation to the data
– (send your computer program to machines containing data)
3. Simple programing model
• Mainly using two functions
1. Map
2. Reduce
Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem
You DO NOT need to worry about other things

Redundant Storage Infrastructure
• Distributed File System
– Global file namespaces, redundancy
– Multiple copies of data and in different nodes
– Example: Google file system (GFS), HDFS (Hadoop, a open-
source map-reduce system)
• Typical usage pattern
– Data is rarely updated in place

Distributed File System: Inside Look
• Data is kept in chunks, spread across machines
• Each chunk is replicated on different machines
– Ensures persistence
– Example:
• We have two files, A and B
• 3 computers
• 2 times replication of data
a1 a2
a3 b1
b1 a1
b2 a2
a3
b2
Here are the Chunk Servers
Chunk servers also serve as compute nodes
Bring computation to the data
a1 a2 a3 b1 b2
A B

Distributes File System: Summary
• Chunk servers
– File is split into contiguous chunks (16-64 mb)
– Each chunk is replicated (usually 2 times or 3 times)
– Try to keep replicas in different racks
• Master node
– Stores metadata about where the files are stored

Example Problem: Counting Words
• We have a huge text document and count the number of times each distinct
word appears in the file
• Sample application
– Analyze web server logs to find popular URLs
• How you solve this using a single machine?

Word Count
• Case 1: File too large for memory, but all <word, count> pairs fit in memory
• You can create a big string array OR you can create a hash table
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk
• A possible approach (write computer programs/functions for
each step)
1. Break the text document into sequence of words
2. Sort the words
• This will bring same words together
3. Count the frequencies in a single pass
getWords(textFile)
sort
count

Map-Reduce: In a Nutshell
– getWords(dataFile) sort count
Map
extract something you care about
(here word and count)
Group by key
sort and shuffle
Reduce
Aggregate, summarize, etc
Save the results
Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem

MapReduce: The Map Step
c2
f2
k1 v1
k2 v2
map
c1
f1
c3
f3
…
k3 v3
map
Input key-value pairs
(file name and its content)
Intermediate key-value pairs
(word and count)
…
k4 v4
map

MapReduce: The Reduce Step
Group
by key
reduce
reduce
k1 𝑣 ′
k2 𝑣′′
k3 𝑣′′′
…
k3 v4
…
k1 v1
k2 v2 v5
v3 v6
Key-value groups
Output
key-value pairs
reduce
k1 v6
Intermediate
key-value pairs
k3 v4
…
k1 v1
v2
k2
v5
k2
k1 v3

Map-reduce: Word Count
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers
of a new era of space
exploration. Crew members
at ……………………..
Big document
(the, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(endeavor, 1)
(recently, 1)
(returned, 1)
(to, 1)
(earth, 1)
(as, 1)
(ambassadors, 1)
….
(crew, 1)
……..
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs
with same key
Reduce:
Collect all values
belonging to the
key and output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)

Word Count Using MapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)

Map-reduce System: Under the Hood
All phases are distributed with many tasks doing the work in parallel
Moving data across machines

Map-Reduce Algorithm Design
• Programmer’s responsibility is to design two
functions:
1. Map
2. Reduce
• A very important issue
– Often network is the bottleneck
– Your design should minimize data communications

Problems Suitable for Map-reduce
• Map-reduce is suitable for batch processing
– Updates are made after whole batch of data is
processed
– The mappers do not need data from one another
while they are running
– Example
1. Word count

Problems NOT Suitable for Map-reduce
• In general, when the machines need to
exchange data too often during computation
• Examples
1.Applications that require very quick response time
• In IR, indexing is okay, but query processing is not suitable for map-reduce
2.Machine learning algorithms that require frequent
parameter update
• Stochastic gradient descent

Warm-up Exercise
• Matrix Addition
• Can it be done in map-reduce?
– YES
• What is the map function (key and value)?
– Key = row number; value = elements of row (as an array)
• What is the reduce function?
– For each key, reducer will have two arrays
– Reduce function simply adds numbers, position-wise

Advanced Exercise: Join By Map-Reduce
• Compute the natural join T1(A,B) ⋈
T2(B,C)
(combine rows from T1 and T2 such that rows have common value in column B)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
T1
T2

Map-Reduce Join
• Map process
– Each row (a,b) from T1 into key-value pair (b,
(a,T1))
– Each row (b,c) from T2 into (b,(c,T2))
• Reduce process
– Each reduce process matches all the pairs (b,
(a,T1)) with all (b,(c,T2)) and outputs (a,b,c)

Advanced Exercise
• You have a dataset with thousands of features. Find the most co-related features in
that data.
features

Take Home Exercises
• Design Map and Reduce functions for the
following
1.Pagerank
2.HITS

COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION

More Related Content

Similar to COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION

Recently uploaded

COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION