The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.
"The name mykid gave a stuffed yellow
elephant. Short, relatively easy to spell and
pronounce, meaningless and not used
elsewhere: those are my naming criteria.
Kids are good at generating such."
- Doug Cutting, Creator of Hadoop
4.
“Hadoop is thepopular open
source implementation of
MapReduce, a powerful tool
designed for deep analysis
and transformation of very
large data sets.”
https://hadoop.apache.org/
5.
When to UseHadoop?
1. For Processing Really BIG Data.
2. For Storing a Diverse Set of Data.
3. For Parallel Data Processing.
When NOT to Use Hadoop?
1. For Real-Time Data Analysis.
2. For a Relational Database System.
3. For a General Network File System.
4. For Non-Parallel Data Processing.
What is JobTracker?
JobTrackeris a daemon which
runs on Apache Hadoop's
MapReduce engine.
JobTracker is an essential
service which farms out all
MapReduce tasks to the
different nodes in the cluster,
ideally to those nodes which
already contain the data, or at
the very least are located in
the same rack as nodes
containing the data.
10.
What is NameNode?
NameNode-It is also known as Master in Hadoop cluster.
Below listed are the main function performed by NameNode:
NameNode stores metadata of actual data. e.g. filename,
path, No. of Blocks, Block IDs, Block location, no. of
replicas, and also Slave related configuration.
It manages Filesystem namespace.
NameNode regulates client access to files.
It assigns work to Slaves (DataNode).
It executes file system namespace operation like
opening/closing files, renaming files/directories.
As NameNode keep metadata in memory for fast retrieval.
So it requires the huge amount of memory for its
operation.
11.
What is SecondaryNameNode?
Secondary NameNode, by its name we assume that it as a backup
node but its not. First let me give a brief about NameNode.
NameNode holds the metadata for HDFS like Block information,
size etc. This Information is stored in main memory as well as disk
for persistence storage.
The information is stored in 2 different files .They are
Editlogs- It keeps track of each and every changes to HDFS.
Fsimage- It stores the snapshot of the file system.
12.
What is DataNode?
DataNode is also known as Slave node.
In Hadoop HDFS Architecture, DataNode stores
actual data in HDFS.
DataNodes responsible for serving, read and write
requests for the clients.
DataNodes can deploy on commodity hardware.
DataNodes sends information to the NameNode
about the files and blocks stored in that node and
responds to the NameNode for all filesystem
operations.
When a DataNode starts up it announce itself to
the NameNode along with the list of blocks it is
responsible for.
DataNode is usually configured with a lot of hard
disk space. Because the actual data is stored in
the DataNode.
13.
What is HDFS?
HDFSis a distributed file system allowing multiple files to be stored and
retrieved at the same time at an unprecedented speed. It is one of the basic
components of Hadoop framework.
Big Data HadoopReal Life Use Cases:
1. Healthcare
2. Wildlife
3. Retail Industry
4. Income Tax to scrutinize bank accounts
5. Fraud Detection
6. Sentimental Security
7. Networking Security
8. Education etc.
Why Hadoop?
1. Abilityto store and process huge amounts of any kind of data, quickly.
2. Computing model processes big data fast
3. Fault tolerance
4. Flexibility
5. Low Cost
6. Scalability
Vertical scaling doesn’t cut it
Disk seek times
Hardware failures
Processing times
Horizontal scaling is linear
7. It’s not just for batch processing anymore
18.
Hadoop Timeline
• Googlepublished GFS and MapReduce papers in 2003-2004.
• Yahoo! Was building “Nutch”, an open source web search engine at the same time.
• Hadoop was primarily driven by Doug Cutting and Tom White in 2006.
• It’s been evolving ever since
19.
What is BIG-DATA?
Bigdata is a term that describes the
large volume of data – both
structured and unstructured – that
inundates a business on a day-to-day
basis. But it’s not the amount of data
that’s important. It’s what
organizations do with the data that
matters. Big data can be analyzed for
insights that lead to better decisions
and strategic business moves.
20.
Big Data CurrentConsiderations
Volume. Organizations collect data from a variety of sources, including business transactions, social media
and information from sensor or machine-to-machine data.
Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags,
sensors and smart metering are driving the need to deal with torrents of data in near-real time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Variability. In addition to the increasing velocities and varieties of data, data flows can be highly
inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered
peak data loads can be challenging to manage. Even more so with unstructured data.
Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and
transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies
and multiple data linkages or your data can quickly spiral out of control.
21.
What is MapReduce?
MapReduceis a programming
model or pattern within the
Hadoop framework that is used to
access big data stored in the
Hadoop File System (HDFS). It is a
core component, integral to the
functioning of the Hadoop
framework.