Intro to HDFS and MapReduce

Introduction to HDFS
and MapReduce

Copyright © 2012-2013, Think Big Analytics, All
Rights Reserved
Thursday, January 10, 13

Who Am I
- Ryan Tabora
- Data Developer at Think
Big Analytics

- Big Data Consulting
- Experience working with
Hadoop, HBase, Hive,
Solr, Cassandra, etc.

2 Rights Reserved

Think Big is the leading professional services ﬁrm that’s purpose built
for Big Data.
• One of Silicon Valley’s Fastest Growing Big Data start ups
• 100% Focus on Big Data consulting & Data Science solution services
• Management Background:
Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast,
Accenture
C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999
• Clients: 40+
• North America Locations
• US East: Boston, New York, Washington D.C.
• US Central: Chicago, Austin
• US West: HQ Mountain View, San Diego, Salt Lake City
• EMEA & APAC

Conﬁdential Think Big Analytics
3

Think Big Recognized as a Top Pure-Play Big Data Vendor

Source: Forbes February
2012

Conﬁdential Think Big Analytics
01/04/13
4

Agenda
- Big Data
- Hadoop Ecosystem
- HDFS
- MapReduce in Hadoop
- The Hadoop Java API
- Conclusions

5 Rights Reserved

Big Data

6 Rights Reserved

A Data Shift...

Source: EMC Digital Universe Study*
7 Rights Reserved

Motivation

“Simple algorithms and lots
of data trump complex
models. ”
Halevy, Norvig, and Pereira
(Google), IEEE Intelligent Systems

8 Rights Reserved

Pioneers
• Google and Yahoo:
- Index 850+ million websites, over one
trillion URLs.

• Facebook ad targeting:
- 840+ million users, > 50% of whom are
active daily.

9 Rights Reserved

Hadoop
Ecosystem

10 Rights Reserved

Common Tool?
• Hadoop
- Cluster: distributed computing
platform.

- Commodity*, server-class hardware.
- Extensible Platform.

11 Rights Reserved

Hadoop Origins
• MapReduce and Google File System (GFS)
pioneered at Google.

• Hadoop is the commercially-supported
open-source equivalent.

12 Rights Reserved

What Is Hadoop?
• Hadoop is a platform.

• Distributes and replicates data.

• Manages parallel tasks created by users.

• Runs as several processes on a cluster.

• The term Hadoop generally refers to a toolset, not a
single tool.

13 Rights Reserved

Why Hadoop?
• Handles unstructured to semi-structured to
structured data.

• Handles enormous data volumes.
• Flexible data analysis and machine learning
tools.

• Cost-effective scalability.

14 Rights Reserved

The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.

• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to
manipulate.

• HBase - A NoSQL, non-sequential data store.

15 Rights Reserved

HDFS

16 Rights Reserved

What Is HDFS?
• Hadoop Distributed File System.
• Stores ﬁles in blocks across many nodes in a
cluster.

• Replicates the blocks across nodes for
durability.

• Master/Slave architecture.

17 Rights Reserved

HDFS Traits
• Not fully POSIX compliant.
• No ﬁle updates.
• Write once, read many times.
• Large blocks, sequential read patterns.
• Designed for batch processing.

18 Rights Reserved

HDFS Master
• NameNode
- Runs on a single node as a master process
‣ Holds ﬁle metadata (which blocks are where)
‣ Directs client access to ﬁles in HDFS
• SecondaryNameNode
- Not a hot failover
- Maintains a copy of the NameNode metadata
19 Rights Reserved

HDFS Slaves
• DataNode
- Generally runs on all nodes in the cluster
‣ Block creation/replication/deletion/reads
‣ Takes orders from the NameNode

20 Rights Reserved

HDFS Illustrated
NameNode

Put File

File

DataNode 1 DataNode 2 DataNode 3


21 Rights Reserved

HDFS Illustrated
NameNode
1
Put File 2
3



21 Rights Reserved

HDFS Illustrated
NameNode
1,4,6
Put File 2
3



21 Rights Reserved

HDFS Illustrated
NameNode
1,4,6
Put File 2 ,5,3
3



21 Rights Reserved

HDFS Illustrated
NameNode
1,4,6
Put File 2 ,5,3
3,2,6



21 Rights Reserved

Power of Hadoop
NameNode
1,4,6
Read File 2 ,5,3
3 ,2,6



22 Rights Reserved

Power of Hadoop
NameNode
,4,6
Read File 2 ,5,3
3 ,2,6

DataNode 2 DataNode 3


22 Rights Reserved

Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6

DataNode 2 DataNode 3


22 Rights Reserved

Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6

Read time
=
Transfer DataNode 2 DataNode 3

Rate x
Number of
Machines*

22 Rights Reserved

Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6

Read time
100 MB/s
=
x
Transfer DataNode 2 DataNode 3

3
Rate x
=
Number of
300MB/s
Machines*

22 Rights Reserved

HDFS Shell
• Easy to use command line interface.
• Create, copy, move, and delete files.
• Administrative duties - chmod, chown, chgrp.
• Set replication factor for a file.
• Head, tail, cat to view files.

23 Rights Reserved



manipulate.


24 Rights Reserved

MapReduce
in
Hadoop

25 Rights Reserved

MapReduce Basics
• Logical functions: Mappers and Reducers.
• Developers write map and reduce functions,
then submit a jar to the Hadoop cluster.

• Hadoop handles distributing the Map and
Reduce tasks across the cluster.

• Typically batch oriented.

26 Rights Reserved

MapReduce
Daemons
•JobTracker (Master)
- Manages MapReduce jobs, giving tasks to
different nodes, managing task failure

•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker

27 Rights Reserved

MapReduce in
Hadoop

28 Rights Reserved

MapReduce in
Hadoop
Let’s look at how MapReduce
actually works in Hadoop,
using WordCount.

28 Rights Reserved

Input Mappers Sort, Reducers Output
Shufﬂe

Hadoop uses (hadoop, 1)
MapReduce
a2
(mapreduce, 1) hadoop 1
is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(map, 1),(phase,1)
(there, 1) map 1
mapreduce 1
phase 2

(phase,1)
(is, 1), (a, 1) reduce 1
(there, 1), there 2
There is a
Reduce phase (reduce 1) uses 1

29 Rights Reserved

Shufﬂe

MapReduce
a2
(mapreduce, 1) hadoop 1
is 2
(uses, 1)

We need to convert
(is, 1), (a, 1)
There is a
Map phase
(map, 1),(phase,1)

the Input
(there, 1) map 1
mapreduce 1
phase 2

into the Output.
(phase,1)
(is, 1), (a, 1) reduce 1
(there, 1), there 2
There is a
Reduce phase (reduce 1) uses 1

29 Rights Reserved

Shufﬂe

Hadoop uses
MapReduce
a2
hadoop 1
is 2

There is a
Map phase
map 1
mapreduce 1
phase 2

reduce 1
there 2
There is a
Reduce phase uses 1

30 Rights Reserved

Input Mappers

Hadoop uses
MapReduce
(doc1, "…")

There is a
Map phase
(doc2, "…")

(doc3, "")

There is a
Reduce phase
(doc4, "…")

31 Rights Reserved

Input Mappers

(hadoop, 1)
Hadoop uses
MapReduce
(doc1, "…") (uses, 1)
(mapreduce, 1)

(there, 1)
(is, 1)
There is a
Map phase
(doc2, "…") (a, 1)
(map, 1)
(phase, 1)

(doc3, "")

(there, 1)
(is, 1)
There is a
Reduce phase
(doc4, "…") (a, 1)
(reduce, 1)
(phase, 1)

32 Rights Reserved

Input Mappers Sort, Reducers
Shufﬂe
0-9, a-l
MapReduce
(doc1, "…")
(mapreduce, 1)
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1)

(doc3, "")
(phase,1) r-z
(is, 1), (a, 1)
(there, 1),
There is a
Reduce phase
(doc4, "…") (reduce 1)

33 Rights Reserved

Input Mappers Sort, Reducers
Shufﬂe
0-9, a-l
MapReduce
(doc1, "…") (a, [1,1]),
(mapreduce, 1) (hadoop, [1]),
(is, [1,1])
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]),
(mapreduce, [1]),
(phase, [1,1])
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),
There is a
Reduce phase
(doc4, "…") (reduce 1) (uses, 1)

34 Rights Reserved

Shufﬂe
0-9, a-l
MapReduce
(doc1, "…") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]), reduce 1
(there, 1), (there, [1,1]), there 2
There is a
Reduce phase
(doc4, "…") (reduce 1) (uses, 1) uses 1

35 Rights Reserved

Shufﬂe
0-9, a-l
MapReduce (doc1, "…") (a, [1,1]), a2
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),

36 Rights Reserved

Shufﬂe
0-9, a-l
MapReduce (doc1, "…") (a, [1,1]), a2
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
Map: (doc3, "")

•
(phase,1) r-z
Transform one input 1), (a, 1)
(is,
to 0-N
(reduce, [1]),
outputs. (there, 1), (there, [1,1]),

36 Rights Reserved

Shufﬂe
0-9, a-l
MapReduce (doc1, "…") (a, [1,1]), a2
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
Map: (doc3, "") Reduce:

• •
(phase,1) r-z
Transform one input 1), (a, 1)
(is,
to 0-N Collect multiple inputs into
(reduce, [1]),
outputs. (there, 1), one output.
(there, [1,1]),

36 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar
TaskTracker TaskTracker TaskTracker

DataNode DataNode DataNode

37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

M M M


37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

Map Phase M M M


37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

* Intermediate Data Is
Map Phase k,v M k,v k,v M k,v M k,v
Stored Locally


37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

Map Phase k,v k,v k,v k,v k,v


37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

k,v k,v k,v k,v k,v

Shufﬂe/Sort


37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

k,v R k,v k,v R k,v R k,v

Reduce Phase

37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

R R R

Reduce Phase

37 Rights Reserved

Cluster View of
MapReduce
NameNode

M R
JobTracker

jar

Job Complete! DataNode DataNode DataNode

37 Rights Reserved

The
Hadoop
Java API

38 Rights Reserved

MapReduce in Java

39 Rights Reserved

MapReduce in Java

Let’s look at WordCount
written in the
MapReduce Java API.

39 Rights Reserved

Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text();
static final IntWritable one = new IntWritable(1);

@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}

40 Rights Reserved

Map Code


@Override
} Let’s drill into this code...
}
}
}

40 Rights Reserved

Map Code


@Override
}
}
}
}

41 Rights Reserved

Map Code
public class SimpleWordCountMapper Mapper class with 4
extends MapReduceBase implements type parameters for the
Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and
output types.

@Override
}
}
}
}

41 Rights Reserved

Map Code

static final Text word = new Text(); Output key-value objects
static final IntWritable one = new IntWritable(1); we’ll reuse.

@Override
}
}
}
}

42 Rights Reserved

Map Code

static final Text word = new Text(); Map method with input,
static final IntWritable one = new IntWritable(1); output “collector”, and
reporting object.
@Override
}
}
}
}

43 Rights Reserved

Map Code


@Override
collector.collect(word, one); Tokenize the line,
} “collect” each
} (word, 1)
}
}

44 Rights Reserved

Reduce Code
public class SimpleWordCountReducer
Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}

45 Rights Reserved

Reduce Code

@Override
int count = 0;
}
}
}
Let’s drill into this code...

45 Rights Reserved

Reduce Code

@Override
int count = 0;
}
}
}

46 Rights Reserved

Reduce Code
public class SimpleWordCountReducer Reducer class with 4
extends MapReduceBase implements type parameters for the
Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and
output types.
@Override
int count = 0;
}
}
}

46 Rights Reserved

Reduce Code
extends MapReduceBase implements Reduce method with
Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”,
and reporting object.
@Override
int count = 0;
}
}
}

47 Rights Reserved

Reduce Code

@Override
int count = 0;
Count the counts per
} word and emit
output.collect(key, new IntWritable(count)); (word, N)
}
}

48 Rights Reserved

Other Options


manipulate.


49 Rights Reserved

Conclusions

50 Rights Reserved

Hadoop Beneﬁts

• A cost-effective, scalable way to:
- Store massive data sets.
- Perform arbitrary analyses on
those data sets.

51 Rights Reserved

Hadoop Tools

• Offers a variety of tools for:
- Application development.
- Integration with other platforms
(e.g., databases).

52 Rights Reserved

Hadoop
Distributions
• A rich, open-source ecosystem.
- Free to use.
- Commercially-supported
distributions.

53 Rights Reserved

Thank You!
- Feel free to contact me at
‣ ryan.tabora@thinkbiganalytics.com
- Or our solutions consultant
‣ matt.mcdevitt@thinkbiganalytics.com
- As always, THINK BIG!

54 Rights Reserved

Bonus
Content

55 Rights Reserved



manipulate.


56 Rights Reserved

Hive:
SQL for
Hadoop

57 Rights Reserved

Hive

58 Rights Reserved

Hive

written in Hive,
the SQL for Hadoop.

58 Rights Reserved

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs'
OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;

59 Rights Reserved



FROM docs) w
GROUP BY word ORDER BY word; Let’s drill into this code...

59 Rights Reserved



FROM docs) w

60 Rights Reserved

Create a table to hold
CREATE TABLE docs (line STRING); the raw text we’re
counting. Each line is a
“column”.

FROM docs) w

60 Rights Reserved


LOAD DATA INPATH 'docs' Load the text in the
“docs” directory into the
OVERWRITE INTO TABLE docs; table.

FROM docs) w

61 Rights Reserved

Create the final table
LOAD DATA INPATH 'docs' and fill it with the results
OVERWRITE INTO TABLE docs; from a nested query of
the docs table that
performs WordCount
CREATE TABLE word_counts AS on the fly.
FROM docs) w

62 Rights Reserved

Hive

63 Rights Reserved

Hive

Because so many Hadoop users
come from SQL backgrounds,
Hive is one of the most
essential tools in the ecosystem!!

63 Rights Reserved



manipulate.


64 Rights Reserved

Pig:
Data Flow
for Hadoop

65 Rights Reserved

Pig

66 Rights Reserved

Pig

written in Pig,
the Data Flow language
for Hadoop.

66 Rights Reserved

inpt = LOAD 'docs' using TextLoader
AS (line:chararray);

words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd
GENERATE group, COUNT(words);

STORE cntd INTO 'output';

67 Rights Reserved




cntd = FOREACH grpd

STORE cntd INTO 'output'; Let’s drill into this code...

67 Rights Reserved




cntd = FOREACH grpd


68 Rights Reserved

AS (line:chararray); Like the Hive example,
load “docs” content,
each line is a “ﬁeld”.


cntd = FOREACH grpd


68 Rights Reserved

AS (line:chararray); Tokenize into words (an
array) and “ﬂatten” into
separate records.


cntd = FOREACH grpd


69 Rights Reserved


Collect the same words
grpd = GROUP words BY word; together.

cntd = FOREACH grpd


70 Rights Reserved




cntd = FOREACH grpd
Count each word.


71 Rights Reserved




cntd = FOREACH grpd
Save the results.
STORE cntd INTO 'output'; Proﬁt!

72 Rights Reserved

Pig

73 Rights Reserved

Pig

Pig and Hive overlap,
but Pig is popular for ETL,
e.g., data transformation,
cleansing, ingestion, etc.

73 Rights Reserved

Questions?

74 Rights Reserved

Intro to HDFS and MapReduce

More Related Content

What's hot

Similar to Intro to HDFS and MapReduce

Recently uploaded

Intro to HDFS and MapReduce