Ideally lag would be in the 0-1 range.
Ideally 'max lag' could be set to 5 seconds once the lag is low enough. Doing that now would just result in lots of "DB locked" pages for users.
aaron | |
Apr 8 2015, 10:54 PM |
F152370: paste.png | |
Apr 22 2015, 1:58 PM |
F152368: paste.png | |
Apr 22 2015, 1:57 PM |
Ideally lag would be in the 0-1 range.
Ideally 'max lag' could be set to 5 seconds once the lag is low enough. Doing that now would just result in lots of "DB locked" pages for users.
Change 242814 abandoned by Aaron Schulz:
[WIP] Lowered "max lag" setting to 5 seconds
Reason:
conflicted
Change 275734 had a related patch set uploaded (by Aaron Schulz):
Lowered $wgMaxUserDBWriteDuration to 5
When/if this succeeds (thanks aaron for working so hard on this) we could enforce (of course, we would start by logging) query writes/transactions that write data to take lower than 1 second. This is an essential task for several needs:
MariaDB 10 masters, semisync, paralel replication and new hardware (happening now or very soon) will also help.
Change 289992 had a related patch set uploaded (by Aaron Schulz):
[WIP] Make LinksDeletionUpdate use query batching
Change 293674 had a related patch set uploaded (by Aaron Schulz):
Lower $wgAPIMaxLagThreshold to 5
Last 7 days of s1,s3,s4,s5,s6,s7 look good (s1 has been fine for some time) on https://grafana.wikimedia.org/dashboard/db/mysql .
I saw a 1 min spike in s2 though.
Change 316733 had a related patch set uploaded (by Aaron Schulz):
Make updateCategoryCounts() lag checks better
@aaron how are you looking at per-cluster graphs on that grafana dashboard? I'm only seeing a per-server dropdown.
Tediously, through the main loaded slave DBs. There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).
Please note there are many times where there is maintenance or even not-normal-but pooled states (such as an alter table going on creating degradation of service), so only looking at https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen can be misleading as a measure of "normal operation". But I have to say congratulations, and thank you for the hard work!
There should be some graph with the X servers with the high lag in the period or something (normal stuff graphite can do).
I want you to meet a friend we did last quarter, he is called prometheus monitoring, and I think you will love him :-P
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=slave&from=now-3d&to=now
Sadly, for now this is only SHOW SLAVE STATUS output, I will need to learn some go to make mysqld-exporter (https://github.com/prometheus/mysqld_exporter) understand pt-heartbeat and multi-source replication.
@aaron based on this discussion and what @jcrespo said about maintenance that can cause spikes, I think we need a plan about what the criteria for success is here. Otherwise, despite all the improvements, it might be difficult to claim at the end of the quarter that "slave lag is under 5 seconds at peak" if we still have greater peaks that we need to manually investigate to check if it was maintenance or not.
I see that @Krinkle helped with improving the display of the graph we care about: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen Is that a good enough graph to measure success, or do you think we need to refine it further?
Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps? If so, maybe we could set up something similar to "show deployments" in grafana for those, where we could see at a glance if a spike was likely caused by a maintenance task that happened around the time of a spike.
Regarding maintenance that can cause spikes, @jcrespo are DB maintenance actions logged with timestamps?
They are definitely logged and scheduled on the Deployments page, but a schema change can easily take 1 month to be applied, and give issues on the day 26th, so I do not think that will be possible.
For example, this was clearly an alter table that required a day to be executed, then the slave caught up:
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&from=1477033182550&to=1477048917441&panelId=6&fullscreen
The graph shows the lag of all servers, not matter if they are pooled or not and under maintenance. It even shows the lag of vslow/dump servers, which are not considered for lag purposes on mediawiki (weight => 0), precisely because they certainly get stressed quite a lot during the dump process, but we do not care about them for general query serving.
Maybe instead of the max lag, you can plot the 80-percentile of lag (on a separate dashboard)- and use that as a metric? Given that 80 is 4/5, it means it would monitor all servers minus 1 or 2 per shard.
It turns out we cannot (at least, easily) calculate percentiles over servers, only over time. :-/
I have at least created:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag
Which may help understand what is going on?
Maybe some of you put log on the rows written/rows read here? https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated ?
That doesn't make much sense because that shows the aggregated of those per shard one on top of each other, which makes s7 less "high" than s1. It makes sense for the lag because it starts from 0.
Apologies if it was not any of you and this is completely off topic.
More on topic: I see on https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag db1073 always lagging behind 10 seconds, unlike the rest of the slaves. Being only a single server, this could be a configuration, schema or hardware issue- we will investigate. CC @Marostegui
For the record, db1073 has compressed tables: https://phabricator.wikimedia.org/T139055#2730773
Change 316733 merged by jenkins-bot:
Make updateCategoryCounts() have better lag checks
It turned out that db1073 was block device corruption (T149728) despite the hardware RAID 10- so no software or OS issue.
As of now, in the last 7 days, I see the following shards with main-load servers having 5+ second spikes:
shard | spikes |
s1 | 0 |
s2 | 1 |
s3 | 0 |
s4 | 0 |
s5 | 0 |
s6 | 0 |
s7 | 0 |
It looks like it's just 0 and 1 load (db-eqiad.php) servers that occasionally show these spikes *if* those graphs are trusted.
On the other hand, perhaps due to inadequate reporting intervals or relay lag, the "all slaves lagged" error (proxy for 6+ seconds lag on all non-zero load DBs) shows up fairly often in the background.
Change 355651 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag
Change 355814 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler
Change 355651 merged by jenkins-bot:
[mediawiki/core@master] Add $wgMaxJobDBWriteDuration setting for avoiding replication lag
Change 355814 merged by jenkins-bot:
[mediawiki/core@master] Log when transactions affect many rows in TransactionProfiler
I set T109179 as children, and I can confirm that replication lag went down significantly the month or 2 we were on ROW on wikidatawiki/dewiki. However, I am not sure we may be able to migrate full as it causes some operational problems (schema changes cannot be parallelized). Please feel free to detach it from here, because it is something that it would be nice to have and it would help a lot, but it is not a hard dependency, and needs a lot of thinking, probably more from the persistence group than by performance (or in collaboration).
Some things, like making sure all tables have a primary key, however, is something that would be nice to have, and performance can help.
@aaron Can this tracking task be closed? If there are causes remaining, they could be filed individually under Wikimedia-database-issue and Sustainability (Incident Followup) and possibly on our Performance-Team (Radar) if we want to stay notified of them specifically.