Moving From MySQL to Elasticsearch for Analytics

Yannick Dawant & Vinh Nguyen
MovingfromMySQLto
ElasticsearchforAnalytics

— What is Analytics, and why is it important to Percolate?
— Analytics 1.0 - MySQL
— Analytics 2.0 - Elasticsearch
— Next Steps
Agenda

WhatdoesAnalyticsmeanto
Percolate? 
Howdoesitwork?

Analytics1.0-Design
Crawlers MySQL
API
UI
Facebook
Twitter
Instagram
LinkedIn
[…]
metrics

MySQLDataModel
post_id service_id tag created_at
1 1 blog 2016-01-01 10:11:15
2 1 blog, video 2016-01-01 12:12:30
3 2 election 2016 2016-01-01 10:10:57
metric_id service_id name
1 1 likes
2 1 comments
3 1 follows
4 2 follows
5 2 mentions
6 2 retweets
post_id metric_id metric_value captured_at
1 1 10 2016-01-01 10:11:15
1 1 20 2016-01-01 12:12:30
2 2 5 2016-01-01 10:10:57
2 2 10 2016-01-01 13:12:20
3 1 15 2016-01-01 13:12:45
3 2 30 2016-01-01 17:05:11
[post]
service_id name
1 facebook
2 twitter
3 instagram
[service]
[post_metrics] [metric_names]

— Relational data models
— Very well known pattern
— Application-level objects map cleanly to DB tables
— Joins are easy to do
— Easy to use
— Amazon RDS for managed hosting/deployment/monitoring
— Very familiar to Ops team and other developers, shared knowledge base
— Lots of support available online
— Met product requirements
WhyMySQL?

Seemsreasonable. 
Whatarethetradeoffs?

— Data Modeling Issues
— Starts easy but becomes complex over time (increasing number of tables)
— Schema inflexibility (dynamic changes, unused columns)
— Hard to modify live schemas, may require downtime
— Slow Queries
— Lots of joins at query time
— Tables grow larger and larger over time
— Hard to partition Time series data
— Expensive post-processing on application side
MySQLTradeoffs

— Scalability Issues
— Database grows larger and larger over time
— Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime
— Hard to scale horizontally
— Not suitable for our Search needs
MySQLTradeoffs

Analytics2.0-Design
Crawlers Elasticsearch
API
UI
Facebook
Twitter
Instagram
LinkedIn
[…]
MySQL
Kafka Data Transformation
metrics
Data Transformation

— Decouples data collection from storage
— Enhances reliability of our data pipelines
— Message queue persistence, replay
— Enhances horizontal scalability of our data pipelines
— Multiple brokers, parallel consumers/producers
WhyKafka?

— Applies data transformation rules
— Validation, enrichment, denormalization, rollups
— Writes data to various indexes in ES
— Error handling
— Network issues, ES load/timeout issues, mapping conflicts
— Multiple workers to increase overall throughput
— Real time and asynchronous workers
DataTransformation

{ 
"_index" : "analytics_2016-11-01", 
"_type" : "post", 
"_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9", 
"post_id": "19398339",
"service": "facebook", 
"captured_at": "2016-10-31T20:32:17+00:00", 
"metrics": { 
"comments": 13, 
"consumptions": 132,
“engaged": 24,
"impressions": 132,
"likes": 50,
“negative_feedback": 5,
"reach": 93, 
"shares": 76
“video_views": 42 
}, 
"tags": ["blog","video"] 
}
ElasticsearchDataModel

— Document based datastore
— Flexible schemas, dynamic mapping, mapping templates
— JSON, rich data structures, nested objects
— REST APIs make integration simple
— Query performance
— Shards spread across nodes (versus entire MySQL DB/table on single node)
— Rolling indexes for Time series data == querying only the indexes needed (versus entire
MySQL table)
WhyElasticsearch?

— Search
— Rich set of built-in queries
— Powerful aggregations (and sub aggregations)
— Scalability
— More control over shards and indexes
— Horizontally scale by adding more nodes and clusters
— Easy to archive old data/indexes to free up resources
— Meets current and *new* product requirements
WhyElasticsearch?

— Data updates are more complex
— Update by query, upserts, script security issues
— Not truly schema-less
— Reindexing is time consuming
— Adding fields, mapping conflicts
— Still need custom, index management layer
— Index mappings, settings, templates, naming patterns, data retention, backup/restore
— Operating ES requires effort
— Deployment, configuration, performance tuning, monitoring
ElasticsearchTradeoffs

— More index management
— Better support for different types of indexes, each with own settings
— Add APIs + Tools for operations
— Avoid oversharding, which causes cluster stability issues
— More focus on UPDATE operations
— Field updates (i.e. tags) require update by query/script
— Faster reindexing (i.e. adding new fields, changing field mappings)
— Slow updates/reindexing can affect other system operations/transactions
— Data denormalization vs joins
— More production monitoring
NextSteps

https://percolate.com/careers/
We’reHiring!

Moving From MySQL to Elasticsearch for Analytics

More Related Content

Similar to Moving From MySQL to Elasticsearch for Analytics

More from Percolate

Recently uploaded

Moving From MySQL to Elasticsearch for Analytics