⚓ T95501 Fix causes of replica lag and get it to under 5 seconds at peak

Subject	Repo	Branch	Lines +/-
Log when transactions affect many rows in TransactionProfiler	mediawiki/core	master	+45 -5
Add $wgMaxJobDBWriteDuration setting for avoiding replication lag	mediawiki/core	master	+33 -13
Make updateCategoryCounts() have better lag checks	mediawiki/core	master	+160 -129
Lower $wgAPIMaxLagThreshold to 5	operations/mediawiki-config	master	+1 -1
Make LinksDeletionUpdate use query batching	mediawiki/core	master	+94 -22
Lowered $wgMaxUserDBWriteDuration to 5	operations/mediawiki-config	master	+1 -1
[WIP] Lowered "max lag" setting to 5 seconds	operations/mediawiki-config	master	+2 -2
Enforce stricter slave lag limits for bot API requests	mediawiki/core	master	+53 -7
Set initial $wgMaxUserDBWriteDuration value	operations/mediawiki-config	master	+3 -0
Add $wgMaxUserDBWriteDuration to limit user-generated transactions	mediawiki/core	master	+37 -3
Make HTMLCacheUpdate always use the job queue	mediawiki/core	master	+20 -19
Move category membership RC updates to CategoryMembershipChangeJob	mediawiki/core	master	+305 -46
Reduce updateLinksTimestamp() DB contention	mediawiki/core	master	+5 -1
Enforce stricter slave lag limits for bot API requests	mediawiki/core	master	+48 -6
Made LinksUpdate on edit use the job queue	mediawiki/core	master	+20 -9
Made RenameUserJob better avoid slave lag spikes	mediawiki/extensions/Renameuser	master	+142 -62
Made HTMLCacheUpdateJob flush the trx between jobs	mediawiki/core	master	+3 -2
Lowered $wgUpdateRowsPerJob to avoid slave lag	mediawiki/core	master	+1 -1
Added support for enqueueable DataUpdates	mediawiki/core	master	+45 -14
Made page deletions defer the link deletion updates	mediawiki/core	master	+10 -6
Moved ActiveUsers updates to recent changes jobs	mediawiki/core	REL1_25	+150 -174
Added $wgJobSerialCommitThreshold setting	mediawiki/core	master	+103 -13
Moved ActiveUsers updates to recent changes jobs	mediawiki/core	master	+150 -174
Made JobRunner avoid slave lag more aggressively	mediawiki/core	wmf/1.26wmf1	+1 -1
Made JobRunner avoid slave lag more aggressively	mediawiki/core	master	+1 -1

Status	Subtype	Assigned	Task
			· · ·
Declined		None	T3268 Database replication lag issues (tracking)
Duplicate		None	T108551 Database locked error while publishing article using CX
Resolved		aaron	T95501 Fix causes of replica lag and get it to under 5 seconds at peak
Resolved		None	T109179 Migrate MySQLs to use ROW-based replication
Resolved		jcrespo	T122429 Batch updates create slave lag on s3 over WAN
Resolved		Nikerabbit	T86385 Translate extension makes huge batch INSERTS
Resolved		aaron	T116425 Rename user creates lag on enwiki
Resolved		aaron	T109943 Long running query from LinksUpdate::incrTableUpdate job causing general lag
Resolved		Lydia_Pintscher	T123867 Repeated reports of wikidatawiki (s5) API going read only
			Restricted Task
Resolved		aaron	T134613 updateNotificationTimestamp() should batch queries
Declined	PRODUCTION ERROR	None	T135798 DBPerformance warning for "DELETE .. echo_email_batch" query from MWEchoEmailBatch
Duplicate		None	T141340 /* RenameUserJob::run */ of user Burgundo created >1000 seconds of lag on itwiki (db1036), impacting recentchanges, watchlist and log visualization of all s2 wikis
Declined		aaron	T142135 LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves
Resolved		aaron	T150124 Parsercache purging can create lag
Resolved		Krinkle	T163801 WikiPage::updateCategoryCounts caused 14 minutes of lag on enwiki
			· · ·

jcrespo mentioned this in T111769: [Bug] EntityUsageTable::touchUsageBatch slow query.Feb 5 2016, 1:21 PM

jcrespo closed subtask T122429: Batch updates create slave lag on s3 over WAN as Resolved.Feb 5 2016, 1:24 PM

Change 260507 merged by jenkins-bot:
Set initial $wgMaxUserDBWriteDuration value

https://gerrit.wikimedia.org/r/260507

Krinkle removed projects: MW-1.27-release (WMF-deploy-2015-12-08_(1.27.0-wmf.8)), MW-1.27-release (WMF-deploy-2015-11-17_(1.27.0-wmf.7)), MW-1.27-release (WMF-deploy-2015-11-10_(1.27.0-wmf.6)), MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)), MW-1.27-release (WMF-deploy-2015-10-13_(1.27.0-wmf.3)), MW-1.27-release (WMF-deploy-2015-09-29_(1.27.0-wmf.1)).Feb 22 2016, 7:44 PM

aaron lowered the priority of this task from High to Medium.Mar 17 2016, 10:55 PM

Change 242814 abandoned by Aaron Schulz:
[WIP] Lowered "max lag" setting to 5 seconds

Reason:
conflicted

https://gerrit.wikimedia.org/r/242814

aaron merged a task: T108551: Database locked error while publishing article using CX.Mar 25 2016, 7:31 PM

aaron added subscribers: Amire80, • santhosh.

Change 275734 had a related patch set uploaded (by Aaron Schulz):
Lowered $wgMaxUserDBWriteDuration to 5

https://gerrit.wikimedia.org/r/275734

jcrespo mentioned this in T133801: SELECT /* CategoryMembershipChangeJob::run 127.0.0.1 */ GET_LOCK('CategoryMembershipUpdates:XXXX', 10) AS lockstatus.May 2 2016, 9:22 AM

jcrespo added a subtask: Restricted Task.May 2 2016, 9:32 AM

aaron created subtask T134613: updateNotificationTimestamp() should batch queries.May 6 2016, 11:54 PM

aaron removed a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).May 9 2016, 8:48 PM

aaron closed subtask Restricted Task as Resolved.May 9 2016, 11:40 PM

aaron mentioned this in T85266: Look into Maria 10 parallel-replication.May 10 2016, 11:09 PM

aaron closed subtask T109943: Long running query from LinksUpdate::incrTableUpdate job causing general lag as Resolved.May 13 2016, 9:38 PM

When/if this succeeds (thanks aaron for working so hard on this) we could enforce (of course, we would start by logging) query writes/transactions that write data to take lower than 1 second. This is an essential task for several needs:

Transparent failover: long running transactions should either killable or take less than 1 second to be able to do a master failover without noticeable read only
Galera and clustering in general: even if we do not use clustering, row based replication needs small write datasets to be effective
Multi-datacenter and/or multi-master writes/routing: as per above

MariaDB 10 masters, semisync, paralel replication and new hardware (happening now or very soon) will also help.

Change 275734 merged by jenkins-bot:
Lowered $wgMaxUserDBWriteDuration to 5

https://gerrit.wikimedia.org/r/275734

aaron created subtask T135798: DBPerformance warning for "DELETE .. echo_email_batch" query from MWEchoEmailBatch.May 20 2016, 3:55 AM

Change 289992 had a related patch set uploaded (by Aaron Schulz):
[WIP] Make LinksDeletionUpdate use query batching

https://gerrit.wikimedia.org/r/289992

aaron closed subtask T134613: updateNotificationTimestamp() should batch queries as Resolved.May 23 2016, 10:14 PM

Change 289992 merged by jenkins-bot:
Make LinksDeletionUpdate use query batching

https://gerrit.wikimedia.org/r/289992

ReleaseTaggerBot added projects: MW-1.28-release-notes, MW-1.28-release (WMF-deploy-2016-06-07_(1.28.0-wmf.5)).Jun 2 2016, 6:00 PM

Change 293674 had a related patch set uploaded (by Aaron Schulz):
Lower $wgAPIMaxLagThreshold to 5

https://gerrit.wikimedia.org/r/293674

Change 293674 merged by jenkins-bot:
Lower $wgAPIMaxLagThreshold to 5

https://gerrit.wikimedia.org/r/293674

Danny_B moved this task from Tag to Should be Goal instead on the Tracking-Neverending board.Jul 11 2016, 3:20 PM

jcrespo added a subtask: T141340: /* RenameUserJob::run */ of user Burgundo created >1000 seconds of lag on itwiki (db1036), impacting recentchanges, watchlist and log visualization of all s2 wikis.Jul 26 2016, 7:51 AM

Amire80 unsubscribed.Jul 26 2016, 9:27 AM

jcrespo mentioned this in T141090: Investigate increase in "readonly" saving errors in Content Translation.Jul 28 2016, 7:01 AM

KartikMistry subscribed.Jul 28 2016, 7:12 AM

jcrespo reopened subtask T116425: Rename user creates lag on enwiki as Open.Aug 1 2016, 4:17 PM

aaron created subtask T142135: LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves.Aug 4 2016, 8:39 PM

aaron closed subtask T116425: Rename user creates lag on enwiki as Resolved.

Krinkle removed projects: MW-1.28-release (WMF-deploy-2016-06-07_(1.28.0-wmf.5)), MW-1.28-release-notes, MW-1.27-release-notes, Patch-For-Review.Aug 5 2016, 1:33 AM

aaron closed subtask T142135: LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves as Declined.Aug 5 2016, 10:08 PM

• Phabricator_maintenance added a project: Goal.Aug 13 2016, 8:41 PM

• Phabricator_maintenance renamed this task from Fix causes of slave lag and get it to under 5 seconds at peak (tracking) to Fix causes of slave lag and get it to under 5 seconds at peak.Aug 14 2016, 12:08 AM

• Phabricator_maintenance removed a project: Tracking-Neverending.

Last 7 days of s1,s3,s4,s5,s6,s7 look good (s1 has been fine for some time) on https://grafana.wikimedia.org/dashboard/db/mysql .

I saw a 1 min spike in s2 though.

Change 316733 had a related patch set uploaded (by Aaron Schulz):
Make updateCategoryCounts() lag checks better

https://gerrit.wikimedia.org/r/316733

gerritbot added a project: Patch-For-Review.Oct 19 2016, 4:51 AM

@aaron how are you looking at per-cluster graphs on that grafana dashboard? I'm only seeing a per-server dropdown.

Tediously, through the main loaded slave DBs. There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).

Please note there are many times where there is maintenance or even not-normal-but pooled states (such as an alter table going on creating degradation of service), so only looking at https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen can be misleading as a measure of "normal operation". But I have to say congratulations, and thank you for the hard work!

There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).

I want you to meet a friend we did last quarter, he is called prometheus monitoring, and I think you will love him :-P
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=slave&from=now-3d&to=now

Sadly, for now this is only SHOW SLAVE STATUS output, I will need to learn some go to make mysqld-exporter (https://github.com/prometheus/mysqld_exporter) understand pt-heartbeat and multi-source replication.

@aaron based on this discussion and what @jcrespo said about maintenance that can cause spikes, I think we need a plan about what the criteria for success is here. Otherwise, despite all the improvements, it might be difficult to claim at the end of the quarter that "slave lag is under 5 seconds at peak" if we still have greater peaks that we need to manually investigate to check if it was maintenance or not.

I see that @Krinkle helped with improving the display of the graph we care about: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen Is that a good enough graph to measure success, or do you think we need to refine it further?

Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps? If so, maybe we could set up something similar to "show deployments" in grafana for those, where we could see at a glance if a spike was likely caused by a maintenance task that happened around the time of a spike.

Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps?

They are definitely logged and scheduled on the Deployments page, but a schema change can easily take 1 month to be applied, and give issues on the day 26th, so I do not think that will be possible.

For example, this was clearly an alter table that required a day to be executed, then the slave caught up:
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&from=1477033182550&to=1477048917441&panelId=6&fullscreen

The graph shows the lag of all servers, not matter if they are pooled or not and under maintenance. It even shows the lag of vslow/dump servers, which are not considered for lag purposes on mediawiki (weight => 0), precisely because they certainly get stressed quite a lot during the dump process, but we do not care about them for general query serving.

Maybe instead of the max lag, you can plot the 80-percentile of lag (on a separate dashboard)- and use that as a metric? Given that 80 is 4/5, it means it would monitor all servers minus 1 or 2 per shard.

It turns out we cannot (at least, easily) calculate percentiles over servers, only over time. :-/

I have at least created:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag

Which may help understand what is going on?

In T95501#2744077, @jcrespo wrote:

It turns out we cannot (at least, easily) calculate percentiles over servers, only over time. :-/

I have at least created:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag

Which may help understand what is going on?

That much more readable than the others. Thanks!

Maybe some of you put log on the rows written/rows read here? https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated ?

That doesn't make much sense because that shows the aggregated of those per shard one on top of each other, which makes s7 less "high" than s1. It makes sense for the lag because it starts from 0.

Apologies if it was not any of you and this is completely off topic.

More on topic: I see on https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag db1073 always lagging behind 10 seconds, unlike the rest of the slaves. Being only a single server, this could be a configuration, schema or hardware issue- we will investigate. CC @Marostegui

@jcrespo I can take care of db1073 if you like - up to you :)

For the record, db1073 has compressed tables: https://phabricator.wikimedia.org/T139055#2730773

Change 316733 merged by jenkins-bot:
Make updateCategoryCounts() have better lag checks

https://gerrit.wikimedia.org/r/316733

ReleaseTaggerBot added projects: MW-1.29-release (WMF-deploy-2016-11-08_(1.29.0-wmf.2)), MW-1.29-release-notes.Nov 2 2016, 5:00 PM

It turned out that db1073 was block device corruption (T149728) despite the hardware RAID 10- so no software or OS issue.

jcrespo created subtask T150124: Parsercache purging can create lag.Nov 6 2016, 2:07 PM

Krinkle closed subtask T150124: Parsercache purging can create lag as Resolved.Nov 21 2016, 9:57 PM

Krinkle unsubscribed.Nov 21 2016, 11:39 PM

jcrespo reopened subtask T150124: Parsercache purging can create lag as Open.Nov 27 2016, 10:46 AM

aaron closed subtask T150124: Parsercache purging can create lag as Resolved.Dec 16 2016, 10:52 PM

As of now, in the last 7 days, I see the following shards with main-load servers having 5+ second spikes:

shard	spikes
s1	0
s2	1
s3	0
s4	0
s5	0
s6	0
s7	0

Per https://grafana-admin.wikimedia.org/dashboard/db/mysql-replication-lag?from=1482570013829&to=1483779613831

It looks like it's just 0 and 1 load (db-eqiad.php) servers that occasionally show these spikes *if* those graphs are trusted.

On the other hand, perhaps due to inadequate reporting intervals or relay lag, the "all slaves lagged" error (proxy for 6+ seconds lag on all non-zero load DBs) shows up fairly often in the background.

aaron moved this task from Doing (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Apr 5 2017, 7:33 PM

jcrespo created subtask T163801: WikiPage::updateCategoryCounts caused 14 minutes of lag on enwiki.Apr 25 2017, 4:57 PM

Krinkle closed subtask T163801: WikiPage::updateCategoryCounts caused 14 minutes of lag on enwiki as Resolved.Apr 28 2017, 8:10 PM

Lydia_Pintscher closed subtask T123867: Repeated reports of wikidatawiki (s5) API going read only as Resolved.May 1 2017, 4:55 PM

jcrespo mentioned this in T164191: Tired of APIError: readonly.May 2 2017, 3:29 PM

Krinkle removed a project: MW-1.29-release (WMF-deploy-2016-11-08_(1.29.0-wmf.2)).May 25 2017, 1:49 PM

Change 355651 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag

https://gerrit.wikimedia.org/r/355651

Change 355814 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler

https://gerrit.wikimedia.org/r/355814

Change 355651 merged by jenkins-bot:
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag

https://gerrit.wikimedia.org/r/355651

Change 355814 merged by jenkins-bot:
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler

https://gerrit.wikimedia.org/r/355814

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)).Jun 12 2017, 7:00 PM

Krinkle removed projects: MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), MW-1.29-release-notes, Patch-For-Review.Jul 29 2017, 4:40 AM

aaron moved this task from To-do: Goals prioritized current Quarter to Blocked (old) on the Performance-Team board.Sep 12 2017, 10:08 AM

@aaron @Imarlier One of us needs to follow up on the subtasks here.

• Imarlier moved this task from Blocked (old) to Doing (old) on the Performance-Team board.Jan 18 2018, 4:08 PM

I set T109179 as children, and I can confirm that replication lag went down significantly the month or 2 we were on ROW on wikidatawiki/dewiki. However, I am not sure we may be able to migrate full as it causes some operational problems (schema changes cannot be parallelized). Please feel free to detach it from here, because it is something that it would be nice to have and it would help a lot, but it is not a hard dependency, and needs a lot of thinking, probably more from the persistence group than by performance (or in collaboration).

Some things, like making sure all tables have a primary key, however, is something that would be nice to have, and performance can help.

@aaron Need to check back on this, and verify whether this is still happening.

aaron removed aaron as the assignee of this task.Jul 16 2018, 9:12 PM

tstarling mentioned this in T198049: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC.Aug 8 2018, 1:48 AM

tstarling mentioned this in T201482: LinksUpdate fails, spams exception logs, whenever replication lag on any server rises above 10s.Aug 8 2018, 2:34 AM

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 20 2018, 8:33 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Aug 22 2018, 7:44 PM

Krinkle closed subtask T135798: DBPerformance warning for "DELETE .. echo_email_batch" query from MWEchoEmailBatch as Declined.Sep 12 2018, 10:47 PM

Krinkle moved this task from Doing to Tag on the Sustainability board.Apr 28 2020, 9:09 PM

jcrespo closed subtask T109179: Migrate MySQLs to use ROW-based replication as Declined.Sep 30 2020, 9:37 AM

jcrespo changed the status of subtask T109179: Migrate MySQLs to use ROW-based replication from Declined to Resolved.

• Gilles renamed this task from Fix causes of slave lag and get it to under 5 seconds at peak to Fix causes of replica lag and get it to under 5 seconds at peak.Oct 1 2020, 7:05 AM

@aaron Can this tracking task be closed? If there are causes remaining, they could be filed individually under Wikimedia-database-issue and Sustainability (Incident Followup) and possibly on our Performance-Team (Radar) if we want to stay notified of them specifically.

aaron closed this task as Resolved.Mar 25 2021, 12:20 AM

jcrespo mentioned this in T207940: Large transaction-related errors and other problems (tracking).Jul 29 2021, 7:59 AM

Fix causes of replica lag and get it to under 5 seconds at peak
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	aaron
	Apr 8 2015, 10:54 PM

	F152370: paste.png
	Apr 22 2015, 1:58 PM

	F152368: paste.png
	Apr 22 2015, 1:57 PM

Fix causes of replica lag and get it to under 5 seconds at peakClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fix causes of replica lag and get it to under 5 seconds at peak
Closed, ResolvedPublic
Actions

Related Objects
Search...