Page MenuHomePhabricator

High replication lag causing read only mode on commons
Closed, ResolvedPublic

Description

Today oct 12 between about 7:10-10:10 UTC there was a large amount of lag (>6 seconds) for commons, causing the database to intermittently go into read only mode. https://logstash.wikimedia.org/goto/2395ce2a93ebfab066887548f18ed62b and https://commons.wikimedia.org/wiki/Commons:Village_pump#Extended_read-only_of_Commons

Additionally between 15:00 - present (17:00) there seems to be a (much) smaller amount of lag related errors. (https://logstash.wikimedia.org/goto/3b982cffdf77978581c578ac50e818e8 ) This was actually for db1034 (s7) and db1070 (s5). I suppose maybe commons was doing either something wikidata or central auth related which triggered read only mode for the request to the foreign db. Edit: I just got confused here by logstash. db1034 had about 7.8 seconds of lag for a very short period of time (between 2017-10-13T01:42:38 and 2017-10-13T01:42:43) . The other s7 dbs were fine, so read only mode was not triggered. Similarly, db1070 intermittently had some bouts of lag, but this didn't cause any real problems as (I think) the other s5 replicas were fine.

  • Perhaps the purging of commons rc needs to go slower? (T177772)
  • Shouldn't this have triggered some sort of alerting, especially for the 7:10-10:10 period? There was no irc alerts at that time in the #wikimedia-operatins irc channel (I don't know how alerting for dbs are managed, so this comment may be silly)

Event Timeline

Bawolff updated the task description. (Show Details)

This could have been caused by T178040: Purge html5-misnesting tags from the database, which disproportionately affected commonswiki? I started it at 7:18 UTC according to SAL and commons was the first large wiki it hit. But I was deleting 500 rows at a time and then wfWaitForSlaves() in between so I thought it would have avoided replica lag...

The rc purge was stopped today at 9:18 UTC so it is unlikely it caused it. It has also been running for a few days without giving any troubles.
I would be more inclined to think that it was due to @Legoktm script.

As why it didn't alert, we normally do several checks to avoid false positives, so in order to trigger an alert the lag must be consistent (and high) and it looks like it was spiking: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=4&fullscreen&orgId=1&from=1507750727567&to=1507837127567&var-dc=eqiad%20prometheus%2Fops

We are now re-thinking how we alert so you might want to keep an eye on: T177782

@Legoktm when would your script be finished you think?

Thanks

So I guess part of the problem with the alerting is its looking at individual lag, where the real issue (MediaWiki goes into read-only mode) comes up when all the slaves are lagged more than 6 seconds - which as it stands, MediaWiki doesn't even create a log entry for this situation.

So I guess part of the problem with the alerting is its looking at individual lag, where the real issue (MediaWiki goes into read-only mode) comes up when all the slaves are lagged more than 6 seconds - which as it stands, MediaWiki doesn't even create a log entry for this situation.

I believe there has been discussions about whether mediawiki handles well the fact that maybe only one slave is lagged or not.
There has also been discussions on Operations on whether we should alert if mediawiki goes into read only, but this can cause many many false positives that it might be paging every hours :-)

Slaves can lag (for a bit) every time, that can happen and will keep happening probably in the future. Not a massive amount of seconds (as you could see and described on the original task description). There is not much DBAs can do about it, as it is always a possibility.
Not sure what to do with this task as we (DBAs) cannot do anything about it.

There was a ticket talking about read only on mediawiki, some sort of tracking, but I cannot find it now.

There has also been discussions on Operations on whether we should alert if mediawiki goes into read only, but this can cause many many false positives that it might be paging every hours :-)

I dont think actual paging is needed in this situation. DBA attention was not needed - but having some spam in #wikimedia-operations when going into read only mode probably would have stopped the issue as the person running the maintenace script would notice the alert spam and connect it to the script they just started running. But as you say getting an apropriate threshold to avoid false positives can be challenging (something like if there was > 5000 requests in the last 5 minutes that were served in read only mode would seem appropriate maybe)

There has also been discussions on Operations on whether we should alert if mediawiki goes into read only, but this can cause many many false positives that it might be paging every hours :-)

I dont think actual paging is needed in this situation. DBA attention was not needed - but having some spam in #wikimedia-operations when going into read only mode probably would have stopped the issue as the person running the maintenace script would notice the alert spam and connect it to the script they just started running.

Assuming that person was on IRC at the time :-)

But as you say getting an apropriate threshold to avoid false positives can be challenging (something like if there was > 5000 requests in the last 5 minutes that were served in read only mode would seem appropriate maybe)

It would be hard to set a threshold without studying it for months, depending on the traffic patterns, spikes, duration etc...

ops is looking into adjusting how the alerts work to make sure that such things are alerted for in the future.

That is not true, "ops" do not care about databases lagging 10 seconds, specially when created by mediawiki maintenance not run by them. We do not have a saying on such jobs, and even existing rules (like communicating them on Deployments) are not followed. Ops are not going to police and monitoring they do not have control or awareness of. Performance set the read only mode to 5 seconds without taking into account ops alerts- they should take care of monitoring their own configuration changes (or someone else at mediawiki level if code checking it is non-working). It seems to me that their monitoring seems not that useful: https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?orgId=1&from=1507785949387&to=1507812249394

Other than that, databases lagging on alerts on IRC would be completely ignored.

The funny thing is that DBAs have been running rather intensive purging all week long, and lag only appears when they are on holidays :-D.

Most jobs/scripts seem to be executing WaitForReplication() instead of commitAndWaitForReplication(), or COMMIT + WaitForReplication(). Hopefuly I am missunderstanding the code. But if I was not, that would make things worse (much worse) in terms of performance (lagging, general query latency), not better. I would expect WaitForReplication() to commit whatever is done automatically, otherwise, the rows written pileup on the master on every loop until they create lag after the full run is done.

In the last 7 days there has been ~12M events of databases lagging and 213K events of mediawiki deciding one wiki to go to read only. https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication , for context.

To clarify, this is not a rant against performance, -which already are doing more than they should-, but about the lack of maintenance for Database code, the many times reckless edition of the WMF databases, and the lack of coordination among teams.

assuming that person was on IRC at the time :-)

well its usually considered very bad form to do any sort of maintenance without being in #wikimedia-operations at least at the beginning of it.

.

Most jobs/scripts seem to be executing WaitForReplication() instead of commitAndWaitForReplication(), or COMMIT + WaitForReplication(). Hopefuly I am missunderstanding the code. But if I was not, that would make things worse (much worse) in terms of performance (lagging, general query latency), not better. I would expect WaitForReplication() to commit whatever is done automatically, otherwise, the rows written pileup on the master on every loop until they create lag after the full run is done.

Most scripts run with DBO_TRX off, so they are in autocommit mode (i think. Not sure about this).

ops is looking into adjusting how the alerts work to make sure that such things are alerted for in the future.

That is not true, "ops" do not care about databases lagging 10 seconds, specially when created by mediawiki maintenance not run by them. We do not have a saying on such jobs, and even existing rules (like communicating them on Deployments) are not followed. Ops are not going to police and monitoring they do not have control or awareness of. Performance set the read only mode to 5 seconds without taking into account ops alerts- they should take care of monitoring their own configuration changes (or someone else at mediawiki level if code checking it is non-working). It seems to me that their monitoring seems not that useful: https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?orgId=1&from=1507785949387&to=1507812249394

Other than that, databases lagging on alerts on IRC would be completely ignored.

The funny thing is that DBAs have been running rather intensive purging all week long, and lag only appears when they are on holidays :-D.

Right. I didnt mean to imply that this was neccesarily DBA/ops responsibilities, just that it should be someone's responsibility and yours was the closest team i could think of. But maybe I should be pinging Performance-Team or Release-Engineering-Team instead (?).

In any case, mediawiki devs are not db experts. We are going to screw up from time to time. Its important to have some sort of automated feedback when we do so we know we've done something wrong and we can stop what we're doing and and figure out what happened. I consider it rather unfortunate that users had their wikis disrupted for 3 hours over something very easily avoided, we just didnt realize it was happening until after it was done.
.

In the last 7 days there has been ~12M events of databases lagging and 213K events of mediawiki deciding one wiki to go to read only. https://logstash.wikimedia.org/app/kibana#/dashboard/DBReplication , for context.

Of which roughly half correspond to the 3.5 hour period of this incident.

.

just that it should be someone's responsibility and yours was the closest team i could think of

At https://www.mediawiki.org/wiki/Developers/Maintainers it says the "Back-end database issues: SQL errors, compatibility, etc." responsible is "no one". Maybe we should tell someone about it.

One could fill a book with the areas of wikimedia that nobody or everybody jointly (which is the same as nobody) is responsible for...

jcrespo claimed this task.

@Bawolff I am going to close this- there are many nuances on the load balancer (reported on other tickets) and the databases to tune, but this report is not actionable by itself, unless other specifics were added. The most likely outcome is that the specific problem mentioned will not appear again, but others appear. As this is more of a "one time X happened" (incident documentation), I think it is not really relevant anymore. In particular, there has been documented cases of lag reports that were missreported, and those have been fixed, while the specific recentchanges purging issues are most definitely not happening now. Also the load balancer was improved and changed many times since this. Feel free to reopen if you think there is something to keep here (so I consider it resolved).

One last answer- replication lag, unless it is continued, does not alert. E.g. if a replication breaks, and it lags beyond 300 second (or something else that makes it have sustained lag), it pages, but otherwise ops do not monitor lag - as it is considered more in the performance side of things. Performance added monitoring on prometheus https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?panelId=1&fullscreen&orgId=1&from=now-24h&to=now , because in most cases, mediawiki things there is lag, but it is only at mediawiki layer but not database. Paging also should not happen for individual servers, but the service as a whole (e.g. no edits are possible, not "one server has lag").