Page MenuHomePhabricator

Fix causes of replica lag and get it to under 5 seconds at peak
Closed, ResolvedPublic

Description

Ideally lag would be in the 0-1 range.

Ideally 'max lag' could be set to 5 seconds once the lag is low enough. Doing that now would just result in lots of "DB locked" pages for users.

Details

SubjectRepoBranchLines +/-
mediawiki/coremaster+45 -5
mediawiki/coremaster+33 -13
mediawiki/coremaster+160 -129
operations/mediawiki-configmaster+1 -1
mediawiki/coremaster+94 -22
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -2
mediawiki/coremaster+53 -7
operations/mediawiki-configmaster+3 -0
mediawiki/coremaster+37 -3
mediawiki/coremaster+20 -19
mediawiki/coremaster+305 -46
mediawiki/coremaster+5 -1
mediawiki/coremaster+48 -6
mediawiki/coremaster+20 -9
mediawiki/extensions/Renameusermaster+142 -62
mediawiki/coremaster+3 -2
mediawiki/coremaster+1 -1
mediawiki/coremaster+45 -14
mediawiki/coremaster+10 -6
mediawiki/coreREL1_25+150 -174
mediawiki/coremaster+103 -13
mediawiki/coremaster+150 -174
mediawiki/corewmf/1.26wmf1+1 -1
mediawiki/coremaster+1 -1
Show related patches Customize query in gerrit

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 260507 merged by jenkins-bot:
Set initial $wgMaxUserDBWriteDuration value

https://gerrit.wikimedia.org/r/260507

aaron lowered the priority of this task from High to Medium.Mar 17 2016, 10:55 PM

Change 242814 abandoned by Aaron Schulz:
[WIP] Lowered "max lag" setting to 5 seconds

Reason:
conflicted

https://gerrit.wikimedia.org/r/242814

Change 275734 had a related patch set uploaded (by Aaron Schulz):
Lowered $wgMaxUserDBWriteDuration to 5

https://gerrit.wikimedia.org/r/275734

aaron closed subtask Restricted Task as Resolved.May 9 2016, 11:40 PM

When/if this succeeds (thanks aaron for working so hard on this) we could enforce (of course, we would start by logging) query writes/transactions that write data to take lower than 1 second. This is an essential task for several needs:

  • Transparent failover: long running transactions should either killable or take less than 1 second to be able to do a master failover without noticeable read only
  • Galera and clustering in general: even if we do not use clustering, row based replication needs small write datasets to be effective
  • Multi-datacenter and/or multi-master writes/routing: as per above

MariaDB 10 masters, semisync, paralel replication and new hardware (happening now or very soon) will also help.

Change 275734 merged by jenkins-bot:
Lowered $wgMaxUserDBWriteDuration to 5

https://gerrit.wikimedia.org/r/275734

Change 289992 had a related patch set uploaded (by Aaron Schulz):
[WIP] Make LinksDeletionUpdate use query batching

https://gerrit.wikimedia.org/r/289992

Change 289992 merged by jenkins-bot:
Make LinksDeletionUpdate use query batching

https://gerrit.wikimedia.org/r/289992

Change 293674 had a related patch set uploaded (by Aaron Schulz):
Lower $wgAPIMaxLagThreshold to 5

https://gerrit.wikimedia.org/r/293674

Change 293674 merged by jenkins-bot:
Lower $wgAPIMaxLagThreshold to 5

https://gerrit.wikimedia.org/r/293674

Phabricator_maintenance renamed this task from Fix causes of slave lag and get it to under 5 seconds at peak (tracking) to Fix causes of slave lag and get it to under 5 seconds at peak.Aug 14 2016, 12:08 AM

Last 7 days of s1,s3,s4,s5,s6,s7 look good (s1 has been fine for some time) on https://grafana.wikimedia.org/dashboard/db/mysql .

I saw a 1 min spike in s2 though.

Change 316733 had a related patch set uploaded (by Aaron Schulz):
Make updateCategoryCounts() lag checks better

https://gerrit.wikimedia.org/r/316733

@aaron how are you looking at per-cluster graphs on that grafana dashboard? I'm only seeing a per-server dropdown.

Tediously, through the main loaded slave DBs. There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).

Please note there are many times where there is maintenance or even not-normal-but pooled states (such as an alter table going on creating degradation of service), so only looking at https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen can be misleading as a measure of "normal operation". But I have to say congratulations, and thank you for the hard work!

There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).

I want you to meet a friend we did last quarter, he is called prometheus monitoring, and I think you will love him :-P
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=slave&from=now-3d&to=now

Sadly, for now this is only SHOW SLAVE STATUS output, I will need to learn some go to make mysqld-exporter (https://github.com/prometheus/mysqld_exporter) understand pt-heartbeat and multi-source replication.

@aaron based on this discussion and what @jcrespo said about maintenance that can cause spikes, I think we need a plan about what the criteria for success is here. Otherwise, despite all the improvements, it might be difficult to claim at the end of the quarter that "slave lag is under 5 seconds at peak" if we still have greater peaks that we need to manually investigate to check if it was maintenance or not.

I see that @Krinkle helped with improving the display of the graph we care about: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen Is that a good enough graph to measure success, or do you think we need to refine it further?

Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps? If so, maybe we could set up something similar to "show deployments" in grafana for those, where we could see at a glance if a spike was likely caused by a maintenance task that happened around the time of a spike.

Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps?

They are definitely logged and scheduled on the Deployments page, but a schema change can easily take 1 month to be applied, and give issues on the day 26th, so I do not think that will be possible.

For example, this was clearly an alter table that required a day to be executed, then the slave caught up:
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&from=1477033182550&to=1477048917441&panelId=6&fullscreen

The graph shows the lag of all servers, not matter if they are pooled or not and under maintenance. It even shows the lag of vslow/dump servers, which are not considered for lag purposes on mediawiki (weight => 0), precisely because they certainly get stressed quite a lot during the dump process, but we do not care about them for general query serving.

Maybe instead of the max lag, you can plot the 80-percentile of lag (on a separate dashboard)- and use that as a metric? Given that 80 is 4/5, it means it would monitor all servers minus 1 or 2 per shard.

It turns out we cannot (at least, easily) calculate percentiles over servers, only over time. :-/

I have at least created:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag

Which may help understand what is going on?

It turns out we cannot (at least, easily) calculate percentiles over servers, only over time. :-/

I have at least created:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag

Which may help understand what is going on?

That much more readable than the others. Thanks!

Maybe some of you put log on the rows written/rows read here? https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated ?

That doesn't make much sense because that shows the aggregated of those per shard one on top of each other, which makes s7 less "high" than s1. It makes sense for the lag because it starts from 0.

Apologies if it was not any of you and this is completely off topic.

More on topic: I see on https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag db1073 always lagging behind 10 seconds, unlike the rest of the slaves. Being only a single server, this could be a configuration, schema or hardware issue- we will investigate. CC @Marostegui

@jcrespo I can take care of db1073 if you like - up to you :)

Change 316733 merged by jenkins-bot:
Make updateCategoryCounts() have better lag checks

https://gerrit.wikimedia.org/r/316733

It turned out that db1073 was block device corruption (T149728) despite the hardware RAID 10- so no software or OS issue.

As of now, in the last 7 days, I see the following shards with main-load servers having 5+ second spikes:

shardspikes
s10
s21
s30
s40
s50
s60
s70

Per https://grafana-admin.wikimedia.org/dashboard/db/mysql-replication-lag?from=1482570013829&to=1483779613831

It looks like it's just 0 and 1 load (db-eqiad.php) servers that occasionally show these spikes *if* those graphs are trusted.

On the other hand, perhaps due to inadequate reporting intervals or relay lag, the "all slaves lagged" error (proxy for 6+ seconds lag on all non-zero load DBs) shows up fairly often in the background.

Change 355651 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag

https://gerrit.wikimedia.org/r/355651

Change 355814 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler

https://gerrit.wikimedia.org/r/355814

Change 355651 merged by jenkins-bot:
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag

https://gerrit.wikimedia.org/r/355651

Change 355814 merged by jenkins-bot:
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler

https://gerrit.wikimedia.org/r/355814

@aaron @Imarlier One of us needs to follow up on the subtasks here.

I set T109179 as children, and I can confirm that replication lag went down significantly the month or 2 we were on ROW on wikidatawiki/dewiki. However, I am not sure we may be able to migrate full as it causes some operational problems (schema changes cannot be parallelized). Please feel free to detach it from here, because it is something that it would be nice to have and it would help a lot, but it is not a hard dependency, and needs a lot of thinking, probably more from the persistence group than by performance (or in collaboration).

Some things, like making sure all tables have a primary key, however, is something that would be nice to have, and performance can help.

@aaron Need to check back on this, and verify whether this is still happening.

aaron removed aaron as the assignee of this task.Jul 16 2018, 9:12 PM
Gilles renamed this task from Fix causes of slave lag and get it to under 5 seconds at peak to Fix causes of replica lag and get it to under 5 seconds at peak.Oct 1 2020, 7:05 AM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

@aaron Can this tracking task be closed? If there are causes remaining, they could be filed individually under Wikimedia-database-issue and Sustainability (Incident Followup) and possibly on our Performance-Team (Radar) if we want to stay notified of them specifically.