Big Data Logging Pipeline with Apache Spark and Kafka

Shipping YaaS logs with Apache Spark and Kafka
Dogukan Sonmez
Senior Software Engineer @hybris Software
@dogukansonmez

Agenda
² Introduction to Yaas
² Architecture of Logging pipeline
² Technology behind logging pipeline
² Challenges
² Recap
² Q&A

SAP hybris as a Service
(YaaS)
A micro-service based Business PaaS
Integrated with hybris and SAP Solutions
Build
Publish
Fast

Architecture of Logging pipeline

Technology behind logging pipeline
High Throughput messaging
Broker
Distributed
Scalable
Fault Tolerant
Topic
Partition
Replicated
Offset

Technology behind logging pipeline
Micro Batching RDD
Streaming
DAG
Reliable
ML
Scalable
Graph
Fast

Big Data pipeline challenges
Reliability of Kafka
v 3 Brokers
v 3 Zookeeper instances
v default.replication.factor=2
v Mainly with Default Configurations
v 5 Brokers
v 5 Zookeeper instances
v unclean.leader.election.enable=false
v min.insync.replicas=2
v default.replication.factor=3
BEFORE AFTER

Spark Streaming Checkpointing
v Spark checkpointing
v All RDD serialized and stored at HDFS
v Custom kafka checkpointing
(Only latest offset stored at kafka)
BEFORE AFTER

Elasticsearch indexing big data
v Default mapping
v index.refresh_interval = 1s
v Indices.memory_index_buffer_size= 10%
v Custom mapping with disabled norms
v Mapping using simple analyzer
v index.refresh_interval = 30s
v Indices.memory_index_buffer_size= 30%
v spark.streaming.kafka.maxRatePerPartition=10000
BEFORE AFTER

Big Data Logging Pipeline with Apache Spark and Kafka

More Related Content

What's hot

Viewers also liked

Recently uploaded

Big Data Logging Pipeline with Apache Spark and Kafka