Shipping YaaS logs with Apache Spark and Kafka
Dogukan Sonmez
Senior Software Engineer @hybris Software
@dogukansonmez
Agenda
² Introduction to Yaas
² Architecture of Logging pipeline
² Technology behind logging pipeline
² Challenges
² Recap
² Q&A
What is YaaS
SAP hybris as a Service
(YaaS)
A micro-service based Business PaaS
Integrated with hybris and SAP Solutions
Build
Publish
Fast
yaas.io
Architecture of Logging pipeline
Architecture of Logging pipeline
Technology behind logging pipeline
High Throughput messaging
Broker
Distributed
Scalable
Fault Tolerant
Topic
Partition
Replicated
Offset
Technology behind logging pipeline
Micro Batching RDD
Streaming
DAG
Reliable
ML
Scalable
Graph
Fast
Big Data pipeline challenges
Reliability of Kafka
v 3 Brokers
v 3 Zookeeper instances
v default.replication.factor=2
v Mainly with Default Configurations
v 5 Brokers
v 5 Zookeeper instances
v unclean.leader.election.enable=false
v min.insync.replicas=2
v default.replication.factor=3
BEFORE AFTER
Big Data pipeline challenges
Spark Streaming Checkpointing
v Spark checkpointing
v All RDD serialized and stored at HDFS
v Custom kafka checkpointing
(Only latest offset stored at kafka)
BEFORE AFTER
Big Data pipeline challenges
Elasticsearch indexing big data
v Default mapping
v index.refresh_interval = 1s
v Indices.memory_index_buffer_size= 10%
v Custom mapping with disabled norms
v Mapping using simple analyzer
v index.refresh_interval = 30s
v Indices.memory_index_buffer_size= 30%
v spark.streaming.kafka.maxRatePerPartition=10000
BEFORE AFTER
Recap
Recap
Q&A
https://hackingat.hybris.com

Big Data Logging Pipeline with Apache Spark and Kafka

  • 2.
    Shipping YaaS logswith Apache Spark and Kafka Dogukan Sonmez Senior Software Engineer @hybris Software @dogukansonmez
  • 3.
    Agenda ² Introduction toYaas ² Architecture of Logging pipeline ² Technology behind logging pipeline ² Challenges ² Recap ² Q&A
  • 4.
  • 5.
    SAP hybris asa Service (YaaS) A micro-service based Business PaaS Integrated with hybris and SAP Solutions Build Publish Fast
  • 6.
  • 7.
  • 8.
  • 9.
    Technology behind loggingpipeline High Throughput messaging Broker Distributed Scalable Fault Tolerant Topic Partition Replicated Offset
  • 10.
    Technology behind loggingpipeline Micro Batching RDD Streaming DAG Reliable ML Scalable Graph Fast
  • 11.
    Big Data pipelinechallenges Reliability of Kafka v 3 Brokers v 3 Zookeeper instances v default.replication.factor=2 v Mainly with Default Configurations v 5 Brokers v 5 Zookeeper instances v unclean.leader.election.enable=false v min.insync.replicas=2 v default.replication.factor=3 BEFORE AFTER
  • 12.
    Big Data pipelinechallenges Spark Streaming Checkpointing v Spark checkpointing v All RDD serialized and stored at HDFS v Custom kafka checkpointing (Only latest offset stored at kafka) BEFORE AFTER
  • 13.
    Big Data pipelinechallenges Elasticsearch indexing big data v Default mapping v index.refresh_interval = 1s v Indices.memory_index_buffer_size= 10% v Custom mapping with disabled norms v Mapping using simple analyzer v index.refresh_interval = 30s v Indices.memory_index_buffer_size= 30% v spark.streaming.kafka.maxRatePerPartition=10000 BEFORE AFTER
  • 14.
  • 15.
  • 16.
  • 17.