Improve eventlogging replication procedure
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	jcrespo
	Jan 21 2016, 5:32 PM

Description

Eventlogging databases (m4 shard): db1046 (m4-master), db1047 (analytics slave 1), dbstore1002 (analytics slave2), and dbstore2002 (dallas backup) use a custom replication mechanism for several reasons:

Regular mysql replication is too slow and unsuitable for large batches of data
Purging is innefficient over the network
Specially, over WAN, things get very slow
If replication stops, it is almost impossible to get them up to sync again
Analytics slaves are IO-saturated due to the large announcement of long-running queries, combined with having data from 8+ shards in a single physical machines (needed to run JOINS)

The current solution is using a script (https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/mariadb/eventlogging_sync.sh) that does not have all the advantages that it could get, namely:

parallel replication of several tables at the same time
import and export using LOAD DATA, faster than parsing SQL commands
Using a 3rd server to offload the process, so it minimizes the mysql server time used
Using actual temporary files for batches, instead of OS unnamed pipes, that eventually fail due to locking the master during too much time
Configurable purging
No monitoring of the process

Details

	Subject	Repo	Branch	Lines +/-
	Properly default to master database name when slave database not given	operations/puppet	production	+1 -1
	Improvements to eventlogging_sync.sh script	operations/puppet	production	+157 -48

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T124306 Polish script that checks eventlogging lag to use it for alarming
Resolved	Ottomata	T124307 Improve eventlogging replication procedure
Resolved	None	T125135 Add autoincrement id to EventLogging MySQL tables. {oryx}
Resolved	Ottomata	T161855 Drop tables with no events in last 90 days.

Event Timeline

jcrespo created this task.Jan 21 2016, 5:32 PM

jcrespo raised the priority of this task from to Needs Triage.

jcrespo updated the task description. (Show Details)

jcrespo added projects: Analytics, SRE, DBA.

jcrespo subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 21 2016, 5:32 PM

jcrespo added a parent task: T124306: Polish script that checks eventlogging lag to use it for alarming .Jan 21 2016, 5:33 PM

Milimetric triaged this task as Medium priority.Jan 21 2016, 6:25 PM

Milimetric set Security to None.

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.

elukey subscribed.Jan 21 2016, 6:26 PM

Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.Jan 21 2016, 6:26 PM

nshahquinn-wmf subscribed.May 19 2016, 9:41 PM

jcrespo moved this task from Triage to Meta/Epic on the DBA board.Nov 10 2016, 12:05 PM

jcrespo mentioned this in T133588: Gerrit 285208 broke eventlogging_sync.sh.Nov 10 2016, 12:08 PM

jcrespo mentioned this in T125135: Add autoincrement id to EventLogging MySQL tables. {oryx}.Jan 31 2017, 3:57 PM

Ottomata added a subtask: T125135: Add autoincrement id to EventLogging MySQL tables. {oryx}.Jan 31 2017, 7:39 PM

@Marostegui ok! So the T125135 auto-increment thing is a very small piece of this larger issue.

Let's see if we can hammer out a way to use regular MySQL replication before we think about making eventlogging_sync.sh better.

Regular mysql replication is too slow and unsuitable for large batches of data

Is the issue large batch inserts, or is the issue just too many inserts? We can do either. IIRC, EventLogging was optimized to do large batch inserts to make things easier on the master. There's no reason we couldn't revert to doing individual inserts, or smaller batch inserts, if that would help. Would it?

In T124307#2987587, @Ottomata wrote:

@Marostegui ok! So the T125135 auto-increment thing is a very small piece of this larger issue.

Let's see if we can hammer out a way to use regular MySQL replication before we think about making eventlogging_sync.sh better.

It would be great :-)

Regular mysql replication is too slow and unsuitable for large batches of data

Is the issue large batch inserts, or is the issue just too many inserts? We can do either. IIRC, EventLogging was optimized to do large batch inserts to make things easier on the master. There's no reason we couldn't revert to doing individual inserts, or smaller batch inserts, if that would help. Would it?

The issue is normally larg batch of inserts (not sure if you also do DELETEs, but that is even harder for MySQL, big DELETES normally cause replication lag). Normally high number of INSERTS is fine (as the the master will probably have all in memory), but depending on what "high number" means it could become a problem but I guess we are not at that point just yet.

As Jaime mentions in the original post, using LOAD DATA can also be a benefit here for large amount of data inserting.

@mforns, can you comment about large DELETES? Do they happen often? How large are they when it happens?

@Marostegui , Would LOAD DATA actually help replication?

In T124307#2989663, @Ottomata wrote:

@Marostegui , Would LOAD DATA actually help replication?

If you need to do massive data imports into the DB, it will help if you have the file ordered by PK and can be loaded into the DB it will be a lot faster than parsing all the SQL commands (as Jaime also stated on the original post).

EventLogging is a stream of data. We can do batching because the data is consumed from Kafka, and then inserted into MySQL via a python MySQL client. So we could consume periodically, or wait until N messages are consumed before inserting. Constructing a file and ordering by primary key (I'm not sure what primary key would be, other than an auto-increment id) would be a little hacky. But I'm confused. LOAD DATA is inherently batch, right? Wouldn't that hinder the replication process?

LOAD DATA is a lot faster to bulk lots of data in the DB, there is a lot less overhead in parsing SQL statements and all the processes around that parsing.

This is an interesting blog post from @jcrespo where you can see some performance benchmarks: https://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/

@Ottomata: we do not delete data from eventlogging (other than the purging that it should happen after 90 days) the system just inserts batches of records. cc @mforns

purging that it should happen after 90 days

How do you implement purging? That surely must run deletes or some kind of updates?

Ottomata mentioned this in T155639: Create reading depth schema.Mar 2 2017, 4:53 PM

Let's take advantage of the fact that after the rename we have now autoincrement ids on new tables .

• Nuria added a project: MediaWiki-extensions-EventLogging.Mar 30 2017, 4:38 PM

Change 345646 had a related patch set uploaded (by Ottomata):
[operations/puppet@production] Improvements to eventlogging_sync.sh script

https://gerrit.wikimedia.org/r/345646

gerritbot added a project: Patch-For-Review.Mar 30 2017, 7:56 PM

Ottomata created subtask T161855: Drop tables with no events in last 90 days..Mar 30 2017, 8:35 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 31 2017, 2:04 PM

Ottomata moved this task from In Progress to In Code Review on the Analytics-Kanban board.

ping @Marostegui, in case you didn't see it: https://gerrit.wikimedia.org/r/345646

• Nuria closed subtask T125135: Add autoincrement id to EventLogging MySQL tables. {oryx} as Resolved.Apr 3 2017, 4:38 PM

Change 345646 merged by Ottomata:
[operations/puppet@production] Improvements to eventlogging_sync.sh script

https://gerrit.wikimedia.org/r/345646

Change 346541 had a related patch set uploaded (by Ottomata):
[operations/puppet@production] Properly default to master database name when slave database not given

https://gerrit.wikimedia.org/r/346541

Change 346541 merged by Ottomata:
[operations/puppet@production] Properly default to master database name when slave database not given

https://gerrit.wikimedia.org/r/346541

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Apr 5 2017, 1:24 PM

• Nuria closed subtask T161855: Drop tables with no events in last 90 days. as Resolved.Apr 6 2017, 3:09 PM

• Nuria closed this task as Resolved.Apr 6 2017, 3:11 PM

• Nuria set the point value for this task to 5.

jcrespo mentioned this in T123509: "db1047/eventlogging_sync processes" icinga alert is flaky since at least early January.Apr 12 2017, 2:47 PM

• Tbayer mentioned this in T174815: Schema:Popups suddenly stopped logging events in MariaDB, but they are still being sent according to Grafana.Sep 4 2017, 9:48 PM

Improve eventlogging replication procedureClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Improve eventlogging replication procedure
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...