Migration from mysql to elasticsearch

Migration
from mysql to
elasticsearch

Our Service
User can find jobs
that fit him/her
Company can find suitable
candidates for its jobs.

Matching on Structured Data
• Both User and Job data are stored in mysql

• Records in Job table corresponds to multiple records in
child tables, so do records in User table (1:N relationship)
Job
Location Salary Others
User
Location Salary Others

Number of records
Table Number of records
User 529,683
child tableX of user 2,984,111
child tableY of user 1,966,161
… …
Table Number of records
Job 160,305
child tableA of job 2,359,512
child tableB of job 232,202
… …
For user searchFor Job search

Complicated SQL Query
SELECT parentTable.id FROM parentTable

INNER JOIN childTableD ON childTableD.id = parentTable.id

LEFT OUTER JOIN childTableA ON childTableA.id = parentTable.id AND childTableA.code = 273230450

….

INNER JOIN (

SELECT id FROM childTableB

WHERE code IN (11232) AND id IN (SELECT id FROM parentTable WHERE code = 1 )

UNION

….

SELECT id FROM childTableC

WHERE code IN (1, 2, 3) AND id IN (SELECT id FROM parentTable WHERE code = 5 )

) temp2 ON parentTable.id = temp2.id

WHERE

AND childTableA.id IS NULL

….

AND parentTable.id IN (SELECT id FROM childTableC WHERE code IN (1, 2, 3))

AND (

parentTable.id IN (SELECT id FROM childTableB WHERE code IN (11232))

OR

….

)

) ORDER BY childTableD.timestamp DESC, parentTable.updatedOn DESC LIMIT 101 OFFSET
• And SQL queries are constructed dynamically in Java.

We are struggling handling job/user search
• Lots of slow ( > 1s) queries of mysql

• Some of them are pretty slow ( > 2s )
160305

Top 20 slow queries of a day
Total Count Average Sec Query
1 209 1.4 SELECT COUNT(*) FROM tableA WHERE tableA.id IN ( SELECT tableA.id FROM tableA LEFT OUTER JOIN tableB ON sc
6 96 1.1 INSERT INTO tableC (id, lastLoginDate, rank, sortKey, createdOn, updatedOn) SELECT id, DATE(FROM_UNIXTIME(IFNUL,
8 85 1.1 INSERT INTO tableD (id, tableD, createdOn, updatedOn) SELECT id, 2, UNIX_TIMESTAMP(), UNIX_TIMESTAMP() FROM
9 84 1.1 INSERT INTO tableC (id, lastLoginDate, rank, sortKey, createdOn, updatedOn) SELECT id,DATE(FROM_UNIXTIME(IFNUL
10 82 1.1 INSERT INTO tableD (id, tableD, createdOn, updatedOn) SELECT id, 1, UNIX_TIMESTAMP(), UNIX_TIMESTAMP() FROM
16 69 3 SELECT COUNT(*) FROM tableA WHERE tableA.id IN ( SELECT tableA.id FROM tableA LEFT OUTER JOIN tableB ON sc
Most of them are related to user search

Solution for the slow query problem
• SQL tuning 
-> Done (4 diﬀerent sorted tables, bit-operation).  
-> But still slow

• Introducing elasticsearch(ES) 
-> Maybe. But ES is a full-text search engine. 
Can ES handle structured data too? Not sure 🤔

• Do an experiment before introducing ES to our project 
-> Create an ES cluster with production data in dev env 

• The result: ES query(0.5s) is faster than mysql one(2s)

The ﬁrst target
• User data characteristics

• Data are updated in real-time.

• A number of target records increases quickly  
Less than 10,000 in Jan/2017  
More than 500,000 in Aug/2017

• Job data characteristics

• Data are updated three times a day

• Data growth isn’t fast

• Data structure will change due to a business reason
Let’s migrate job search ﬁrst!

Basic Strategy
• Data can be recovered from mysql  
-> Data can be recovered by batch even if entire ES cluster is collapsed

• Try to minimize use case of ES 
-> Only focus on searching list of jobs which match the given search
condition 
-> Do not deal with trivial queries for a speciﬁc job (use mysql)

• Use high performance machine 
-> 3 physical data nodes (20CPU,32GB RAM 200GB SSD)

• All job related data stored in ES 
-> We had a option that we get only ids by ES and get necessary data
from other data source (such as mysql, redis). But we didn’t choose it
for simple implementation

Performance Tuning
• Servers in Cluster 
3 master nodes: To avoid split brain 
3 data nodes: 2 maybe enough, but for safety

• Index settings 
1 shard: 1 shard has better performance than 2 shards 
2 replicas: each data node has 1 shard

• Indexing performance setting 
indices.store.throttle.max_bytes_per_sec is set 200mb (default
20mb)

Performance Tuning
• Memory Settings

• Thread Pool Settings
Default Tuned
indices.ﬁelddata.cache.size Unbounded 10%
indices.queries.cache.size 10% 10%
indices.requests.cache.size 1% 1%
Default Tuned
thread_pool.index.queue_size 200 1,000
thread_pool.bulk.queue_size 200 1,000
thread_pool.search.queue_size 1,000 5,000

Preparation Before Migration
• Stress Test 
-> Use gatling (load and performance test tool) 
-> Use queries executed in beta 
-> High load search while indexing 
-> Check the maximum request per second of the cluster (900req/sec) 
-> Long run (more than 24 hours with load)

• HA and failure Test 
-> Search function works even when one data node is available 
-> Top page of our service is available even when entire cluster is down 
-> Circuit breaker behavior

• Monitoring and Alert 
-> Use prometheus and alertmanager 
-> Server metrics, JVM metrics, ES speciﬁc metrics such as query cache size 
-> Alert test before operation

Latency of API for job search
Before Migration (mysql)
After Migration (elasticsearch)

Positive side effects
• Dedicated full text search engine (groonga) is no longer needed 
Everything can be done in ES

• More ﬂexible and precise geo location search than geohash

• Simpler code 
-> No SQL tricks and No need to refer many tables

• Avoid constructing SQL query by string concatenation

Future Plan
• adapt ES to other implementation 
-> Job related features that we avoided adapting before  
-> User search (the main target)

• Better full text search with neologd

Migration from mysql to elasticsearch

More Related Content

What's hot

Similar to Migration from mysql to elasticsearch

Recently uploaded

Migration from mysql to elasticsearch