Big Data
Tech Stack
Big Data 2015
by Abdullah Cetin CAVDAR
Me :)
Graduated from
@HU
PhD Student
@METU
Ex Entrepreneur
I had 3 start-ups
Senior Software
Engineer
@Udemy
Founder and Organizer of
meetup.com/ankara-big-data-meetup
What's Big Data
Big data is data that exceeds the processing capacity
of conventional database systems.
What's Big Data
Big data is when the data itself becomes part of the
problem.
4V's of Big Data
Multitude of Data
Types
Structured
Semi-structured
Unstructured
Data Data Data
What We Need?
Store
Join
Index
Analytics
Aggregate
Visualize
Challenge
The challenge in big data analytics is to
dig deeply
quickly (real time?)
and widely
"ilities" or NFR?
Availability
Scalability
Security
Performance
...
Solution?
Big Data Tech
Stack
What're essential
components?
Data Sources
Multiple internal
& external
data sources
Creates a
data lake
Different
Volume, Variety,
Velocity
Aim is to create
a funnel after
proper validation
and cleaning
Ingestion Layer
Signal-to-Noise
ratio
10:90
separate the
noise from
relevant info
It has capability to
Validate
Cleanse
Transform
Reduce
Integrate
Distributed
Storage Layer
Fault tolerance
Parallelization
HDFS
massively scalable distributed
file system
HDFS
HDFS Architecture
Non-relational,
distributed data?
NoSQL
CAP theorem
Consistency, Availability,
Partition Tolerance
Ingestion to DFS
Sqoop, Flume, MapReduce, ETL
Infrastructure &
Platform Layer
Computing &
Scalability
Hadoop?
Vertical Scaling
Vertical Scaling
Vertical Scaling
Horizontal Scaling
Horizontal Scaling
Horizontal Scaling
MapReduce
is the main computation paradigm
MapReduce
Hadoop 2
What's new?
What's new?
H1 vs. H2
One cluster,
distributed storage,
distributed scheduler,
many types of applications.
Blueprints
NoSQL with HBase
Stream Processing with Storm/Spark
Graph Processing with Giraph
SQL on Hadoop with Impala
Columnar Data Formats
Security Layer
Data need to be protected
Meet compliance requirements
Individual's privacy
Proper
authorization and
authentication
needed
What can we do?
Authentication protocol like Kerberos
Enable file layer encryption
Use SSL, certificates and trusted keys
Provision with Chef, Puppet or Ansible like tools
Log all the communication for detecting anomalies
Monitor whole system
Monitoring Layer
Get a complete
picture
of our Big Data tech stack
Satisfy SLAs with
min downtime
DataDog
New Relic (Overview)
New Relic (Databases)
Analytics Engine
Co-Existence
with Traditional
BI
Data warehouse in the traditional way
Distributed MR processing on big data stores
Mediate data in either direction
i.e use Hive/HBase with Sqoop
Real-time analysis can leverage
low-latency NoSQL stores
i.e Cassandra, Vertica, ...
R may be used for complex
statistical algorithms
Search Engines
Huge volume and
variety of data
“needle in a
haystack”
Need blazing fast search
mechanism
to index and search for big data
analytics
Elastic Search,
Solr, ...
Real-time
Processing
In memory?
Apache Spark
Storm, Kinesis,
Flink, ...
Visualization
Layer
Gain insight faster
Look at different aspects of
data visually
Tableau
ChartIO
Lambda
Architecture
Lambda Architecture / MapR
Don't forget
There is no
"One Size Fits All"
solution
We need
Continuous
Development
Thank You :)

Big Data Tech Stack