Yannick Dawant & Vinh Nguyen
MovingfromMySQLto
ElasticsearchforAnalytics
— What is Analytics, and why is it important to Percolate?
— Analytics 1.0 - MySQL
— Analytics 2.0 - Elasticsearch
— Next Steps
Agenda
TheSystemofRecordforMarketing
WhatdoesAnalyticsmeanto
Percolate?

Howdoesitwork?
Analytics1.0-Design
Crawlers MySQL
API
UI
Facebook
Twitter
Instagram
LinkedIn
[…]
metrics
MySQLDataModel
post_id service_id tag created_at
1 1 blog 2016-01-01 10:11:15
2 1 blog, video 2016-01-01 12:12:30
3 2 election 2016 2016-01-01 10:10:57
metric_id service_id name
1 1 likes
2 1 comments
3 1 follows
4 2 follows
5 2 mentions
6 2 retweets
post_id metric_id metric_value captured_at
1 1 10 2016-01-01 10:11:15
1 1 20 2016-01-01 12:12:30
2 2 5 2016-01-01 10:10:57
2 2 10 2016-01-01 13:12:20
3 1 15 2016-01-01 13:12:45
3 2 30 2016-01-01 17:05:11
[post]
service_id name
1 facebook
2 twitter
3 instagram
[service]
[post_metrics] [metric_names]
— Relational data models
— Very well known pattern
— Application-level objects map cleanly to DB tables
— Joins are easy to do
— Easy to use
— Amazon RDS for managed hosting/deployment/monitoring
— Very familiar to Ops team and other developers, shared knowledge base
— Lots of support available online
— Met product requirements
WhyMySQL?
Seemsreasonable.

Whatarethetradeoffs?
— Data Modeling Issues
— Starts easy but becomes complex over time (increasing number of tables)
— Schema inflexibility (dynamic changes, unused columns)
— Hard to modify live schemas, may require downtime
— Slow Queries
— Lots of joins at query time
— Tables grow larger and larger over time
— Hard to partition Time series data
— Expensive post-processing on application side
MySQLTradeoffs
— Scalability Issues
— Database grows larger and larger over time
— Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime
— Hard to scale horizontally
— Not suitable for our Search needs
MySQLTradeoffs
Wheredowegofromhere?
Analytics1.0-Design
Crawlers MySQL
API
UI
Facebook
Twitter
Instagram
LinkedIn
[…]
metrics
Analytics2.0-Design
Crawlers Elasticsearch
API
UI
Facebook
Twitter
Instagram
LinkedIn
[…]
MySQL
Kafka Data Transformation
metrics
Data Transformation
— Decouples data collection from storage
— Enhances reliability of our data pipelines
— Message queue persistence, replay
— Enhances horizontal scalability of our data pipelines
— Multiple brokers, parallel consumers/producers
WhyKafka?
— Applies data transformation rules
— Validation, enrichment, denormalization, rollups
— Writes data to various indexes in ES
— Error handling
— Network issues, ES load/timeout issues, mapping conflicts
— Multiple workers to increase overall throughput
— Real time and asynchronous workers
DataTransformation
{

"_index" : "analytics_2016-11-01",

"_type" : "post",

"_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9",

"post_id": "19398339",
"service": "facebook",

"captured_at": "2016-10-31T20:32:17+00:00",

"metrics": {

"comments": 13,

"consumptions": 132,
“engaged": 24,
"impressions": 132,
"likes": 50,
“negative_feedback": 5,
"reach": 93,

"shares": 76
“video_views": 42

},

"tags": ["blog","video"]

}
ElasticsearchDataModel
— Document based datastore
— Flexible schemas, dynamic mapping, mapping templates
— JSON, rich data structures, nested objects
— REST APIs make integration simple
— Query performance
— Shards spread across nodes (versus entire MySQL DB/table on single node)
— Rolling indexes for Time series data == querying only the indexes needed (versus entire
MySQL table)
WhyElasticsearch?
— Search
— Rich set of built-in queries
— Powerful aggregations (and sub aggregations)
— Scalability
— More control over shards and indexes
— Horizontally scale by adding more nodes and clusters
— Easy to archive old data/indexes to free up resources
— Meets current and *new* product requirements
WhyElasticsearch?
Seemsreasonable.

Whatarethetradeoffs?
— Data updates are more complex
— Update by query, upserts, script security issues
— Not truly schema-less
— Reindexing is time consuming
— Adding fields, mapping conflicts
— Still need custom, index management layer
— Index mappings, settings, templates, naming patterns, data retention, backup/restore
— Operating ES requires effort
— Deployment, configuration, performance tuning, monitoring
ElasticsearchTradeoffs
— More index management
— Better support for different types of indexes, each with own settings
— Add APIs + Tools for operations
— Avoid oversharding, which causes cluster stability issues
— More focus on UPDATE operations
— Field updates (i.e. tags) require update by query/script
— Faster reindexing (i.e. adding new fields, changing field mappings)
— Slow updates/reindexing can affect other system operations/transactions
— Data denormalization vs joins
— More production monitoring
NextSteps
https://percolate.com/careers/
We’reHiring!

Moving From MySQL to Elasticsearch for Analytics

  • 1.
    Yannick Dawant &Vinh Nguyen MovingfromMySQLto ElasticsearchforAnalytics
  • 2.
    — What isAnalytics, and why is it important to Percolate? — Analytics 1.0 - MySQL — Analytics 2.0 - Elasticsearch — Next Steps Agenda
  • 3.
  • 4.
  • 5.
  • 6.
    MySQLDataModel post_id service_id tagcreated_at 1 1 blog 2016-01-01 10:11:15 2 1 blog, video 2016-01-01 12:12:30 3 2 election 2016 2016-01-01 10:10:57 metric_id service_id name 1 1 likes 2 1 comments 3 1 follows 4 2 follows 5 2 mentions 6 2 retweets post_id metric_id metric_value captured_at 1 1 10 2016-01-01 10:11:15 1 1 20 2016-01-01 12:12:30 2 2 5 2016-01-01 10:10:57 2 2 10 2016-01-01 13:12:20 3 1 15 2016-01-01 13:12:45 3 2 30 2016-01-01 17:05:11 [post] service_id name 1 facebook 2 twitter 3 instagram [service] [post_metrics] [metric_names]
  • 7.
    — Relational datamodels — Very well known pattern — Application-level objects map cleanly to DB tables — Joins are easy to do — Easy to use — Amazon RDS for managed hosting/deployment/monitoring — Very familiar to Ops team and other developers, shared knowledge base — Lots of support available online — Met product requirements WhyMySQL?
  • 8.
  • 9.
    — Data ModelingIssues — Starts easy but becomes complex over time (increasing number of tables) — Schema inflexibility (dynamic changes, unused columns) — Hard to modify live schemas, may require downtime — Slow Queries — Lots of joins at query time — Tables grow larger and larger over time — Hard to partition Time series data — Expensive post-processing on application side MySQLTradeoffs
  • 10.
    — Scalability Issues —Database grows larger and larger over time — Scaling is mostly vertical (add more CPU/RAM/disk to same node), may require downtime — Hard to scale horizontally — Not suitable for our Search needs MySQLTradeoffs
  • 11.
  • 12.
  • 13.
  • 14.
    — Decouples datacollection from storage — Enhances reliability of our data pipelines — Message queue persistence, replay — Enhances horizontal scalability of our data pipelines — Multiple brokers, parallel consumers/producers WhyKafka?
  • 15.
    — Applies datatransformation rules — Validation, enrichment, denormalization, rollups — Writes data to various indexes in ES — Error handling — Network issues, ES load/timeout issues, mapping conflicts — Multiple workers to increase overall throughput — Real time and asynchronous workers DataTransformation
  • 16.
    {
 "_index" : "analytics_2016-11-01",
 "_type": "post",
 "_id" : "f6065582-a2d7-11e6-bee7-22000ae51cc9",
 "post_id": "19398339", "service": "facebook",
 "captured_at": "2016-10-31T20:32:17+00:00",
 "metrics": {
 "comments": 13,
 "consumptions": 132, “engaged": 24, "impressions": 132, "likes": 50, “negative_feedback": 5, "reach": 93,
 "shares": 76 “video_views": 42
 },
 "tags": ["blog","video"]
 } ElasticsearchDataModel
  • 17.
    — Document baseddatastore — Flexible schemas, dynamic mapping, mapping templates — JSON, rich data structures, nested objects — REST APIs make integration simple — Query performance — Shards spread across nodes (versus entire MySQL DB/table on single node) — Rolling indexes for Time series data == querying only the indexes needed (versus entire MySQL table) WhyElasticsearch?
  • 18.
    — Search — Richset of built-in queries — Powerful aggregations (and sub aggregations) — Scalability — More control over shards and indexes — Horizontally scale by adding more nodes and clusters — Easy to archive old data/indexes to free up resources — Meets current and *new* product requirements WhyElasticsearch?
  • 19.
  • 20.
    — Data updatesare more complex — Update by query, upserts, script security issues — Not truly schema-less — Reindexing is time consuming — Adding fields, mapping conflicts — Still need custom, index management layer — Index mappings, settings, templates, naming patterns, data retention, backup/restore — Operating ES requires effort — Deployment, configuration, performance tuning, monitoring ElasticsearchTradeoffs
  • 21.
    — More indexmanagement — Better support for different types of indexes, each with own settings — Add APIs + Tools for operations — Avoid oversharding, which causes cluster stability issues — More focus on UPDATE operations — Field updates (i.e. tags) require update by query/script — Faster reindexing (i.e. adding new fields, changing field mappings) — Slow updates/reindexing can affect other system operations/transactions — Data denormalization vs joins — More production monitoring NextSteps
  • 23.