Page MenuHomePhabricator

Dumps generation without prefetch cause disruption to the production environment
Open, In Progress, HighPublic

Description

Today, June 20th 2024 we got paged because of MariaDB Replica lag on the s1 cluster, on db1206.

The configured threshold for paging is 300 seconds and the actual value was 300.70 seconds. Ever so slightly above the limit.

Shortly after the issue resolved itself without manual intervention.

first investigation by DBA showed:

18:50 < Amir1> db1206 starts lagging a lot when dumps start, something in the query is either wrong or it bombard the replica. Either way, it needs to be investigated.
18:55 < Amir1> > | 236130644 | wikiadmin2023   | 10.64.0.157:37742    | enwiki | Query     |       1 | Creating sort index                                    | SELECT /* WikiExporter::dumpPages  */  /*!  .. STRAIGHT_JOIN */ re

Notification Type: PROBLEM

Service: MariaDB Replica Lag: s1 #page
Host: db1206 #page
Address: 10.64.16.89
State: CRITICAL

Date/Time: Thu Jun 20 18:15:09 UTC 2024

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

Acknowledged by :

Additional Info:

CRITICAL slave_sql_lag Replication lag: 300.70 seconds


The query:

SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_actor,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `revision` JOIN `page` ON ((rev_page=page_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `slots` ON 
20:58 ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (page_id >= 673734 AND page_id < 673816) AND ((rev_page > 0 OR (rev_page = 0 AND rev_id > 0)))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000

In addition, dumps generation for english wikipedia also caused network saturation in eqiad:

 <Amir1>	yup it's dumps: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535
<Amir1>	it seems to be only snapshot1012 and that host has enwiki dump running

that had severe consequences including a full outage for editing that persisted for more than half an hour.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-06-21T07:03:59Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1206 depool for debugging T368098', diff saved to https://phabricator.wikimedia.org/P65293 and previous config saved to /var/cache/conftool/dbconfig/20240621-070358-arnaudb.json

I depooled the host by reflex, its currently repooling right now

ABran-WMF changed the task status from Open to In Progress.Fri, Jun 21, 12:15 PM
ABran-WMF triaged this task as Medium priority.
ABran-WMF moved this task from Triage to Blocked on the DBA board.

This is clearly a bug in dumps.

I suspect significantly more pressure is being generated because we deleted the 20240520 dumps in T365501 due to T365155.

Now that the 20240620 run of the dumps is going, there is no content for the dumps process to prefetch from.

This should only happen on this run of the dumps, and next run on 20240720 will have prefetch available.

SRE got paged again.

22:38 <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.30 seconds 
                   https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
22:42 < mutante> !incidents
22:42 < sirenbot> 4769 (ACKED)  db1206 (paged)/MariaDB Replica Lag: s1 (paged)

The task is marked as in progress but has no person assigned to it.

This is impacting the wikis - bots are receiving maxlag errors.

If we expect that this particularly expensive dumps run is going to take a while, and as a result will cause db1206 to lag behind significantly, would it be possible / make sense to drop the nominal-weight on the instance to zero?

IIUC this would avoid using the replica for generic / interactive traffic from MW while it's lagging, but still leave it available for dumps (i.e., it's fully pooled and the dumps group weight is non-zero).

23:28 < brett> !log # dbctl instance db1206 set-weight 10 --section s1

Mentioned in SAL (#wikimedia-operations) [2024-06-21T23:43:28Z] <brett@puppetmaster1001> dbctl commit (dc=all): 'set db1206 s1 weight to 1 - T368098', diff saved to https://phabricator.wikimedia.org/P65334 and previous config saved to /var/cache/conftool/dbconfig/20240621-234328-brett.json

lag went down from over 8 minutes to 4 min shortly after the commit

23:54 <+icinga-wm> RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

23:53 < JJMC89> even with a weight of 1, I'm still hitting db1206

Joe raised the priority of this task from Medium to Unbreak Now!.EditedMon, Jun 24, 5:36 AM
Joe subscribed.

Dumps for english wikipedia caused a full sized outage as they saturated the network on saturday night:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535&viewPanel=8

We definitely need to work on fixing whatever causes this amount of work to be done. Even if this a relatively one-off, we might need to kill the dumps generation if it causes issues again so the logic that clearly has no cap on the amount of data fetched in parallel needs to be fixed.

To clarify further, our running hypothesis regarding the outage is that dumps put an extraordinary strain on the databases, which in turn caused a flurry of errors to be logged from mediawiki, which overwhelmed network (so a new case of T365655)

Joe renamed this task from db1206 - replica lag - page - 20240620 to Dumps generation without prefetch cause disruption to the production environment.Mon, Jun 24, 5:41 AM
Joe updated the task description. (Show Details)
Joe edited projects, added Data-Engineering; removed DBA.

[if a proper fix is going to be time-consuming, at least only running dumps when there are staff around would be an improvement]

[if a proper fix is going to be time-consuming, at least only running dumps when there are staff around would be an improvement]

Or simply running it slower (with lower number of parallel jobs, currently it's like five to ten of them in the snapshot1012).

[if a proper fix is going to be time-consuming, at least only running dumps when there are staff around would be an improvement]

I don't think that's an option - dumps run for a long time.

Thinking of immediate stopgaps, maybe we can limit the bandwidth for the dumps servers for the time being?

I think we need netops to take a look at what exactly got saturated and why, and maybe take some immediate-term coutermeasures - that won't be harmless to dumps, as it can make them less performant of course.

I think actually @Ladsgroup's proposal seems like the easier solution on the short term. @xcollazo do you see any complication with reducing parallelism of the enwiki dumps?

enwiki finished dumping at 2024-06-22 19:52:29, which doesn't correlate with the network event, which started around 2024-06-22 22:46:00.

However snapshot1012 is currently dumping commonswiki, so I suspect this was the one running when the event happened.

Additionally we are both running the 20240620 (i.e partial dump) as well as the 20240601 (i.e. full dump) of wikidatawiki. The full dump of wikidatawiki did had a network spike around the event time.

All these backed up runs are unfortunate, so sorry for the outage folks.

@xcollazo do you see any complication with reducing parallelism of the enwiki dumps?

enwiki is now done, but I will monitor the run of commonswiki while I look whether we can temporarily lower the parallelism.

That would fit with the observed outage of the S4 DBs (which host commons) at the start of the incident.

Change #1049182 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

Ok, I want to revise my previous assessment.

Nominal network usage with bursts up to ~90 MB/s are common from snapshot10* servers. This can be seen on plots over last 30 days:

snapshot1010: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1010&var-datasource=thanos&var-cluster=dumps&from=1716652461762&to=1719244461762&viewPanel=8
snapshot1011: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1011&var-datasource=thanos&var-cluster=dumps&from=1716652461762&to=1719244461762&viewPanel=8
snapshot1012: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1716652461762&to=1719244461762&viewPanel=8
snapshot1013: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1013&var-datasource=thanos&var-cluster=dumps&from=1716652461762&to=1719244461762&viewPanel=8

After observing current commonswiki dump, I can see that there are ~50 MB/s bursts attributable to MD5 and SHA1 calculations against the NFS share that hosts the dumps artifacts. Similarly, every 4 hours the rsync process that copies the artifacts to the cluddumps servers kicks in, which will also cause high network usage for a relatively short time span.

So it seems from a network usage perspective the dumps are working normally.

Coming back to:

I think actually @Ladsgroup's proposal seems like the easier solution on the short term. @xcollazo do you see any complication with reducing parallelism of the enwiki dumps?

After perusing code, the only existing mechanism to reduce parallelism in the current dumps infrastructure seems to be slots mechanism. For each snapshot server, we can control how many concurrent jobs are allowed by modifying this variable. The current maxslots is set to 28 slots per host in the profile::dumps::generation::worker::dumper::maxjobs puppet config.

However, this config is global, meaning all jobs from the dumps will be affected. Dumping already takes around ~18 days to finish a full dump, and around ~8 days to finish a partial dump, which leaves us with just a couple more days in the month to do reruns or restarts when needed.

I want to offer a solution though, in next comment.

Ok after observing the dumps for a bit now, here are some conclusions:

  • I still believe the extra pressure against the databases is attributable to not having prefetch, as discussed in T368098#9913625.
  • I do not think network usage bursts coming from the snapshot servers are an indication of a bug, as discussed in T368098#9918320.
  • There is an existing mechanism to control parallelism in dumps, called slots (unrelated to MCR slots), however this is currently a global setting that I would rather not lower. More discussion in T368098#9918849.

The hypothesis I have is that we will not see this issue once we are done with this particular run of the dumps that doesn't have prefetch. And in fact, up until now in my tenure as (reluctant) dumps maintainer, this is the first time I see a run affecting a database so heavily.

Here is a plan of action, considering this is an unbreak now:

  1. I will continue monitoring the remaining two wikis related to the offending 20240620 run: commonswiki and wikidatawiki.
  1. In the event that we have another SRE being paged, I will kill the remaining 20240620 dumps runs.
  1. In a couple days, we will start the next run of the dumps, namely the 20240701 run. I do not expect that run to be problematic, as it will leverage prefetching. However, in the event that we do see similar issues, I will work with @BTullis to lower the dumps slots, and/or pause the dumps as necessary.

Let me know if this sounds reasonable.

Mentioned in SAL (#wikimedia-operations) [2024-06-24T20:21:44Z] <mutante> snapsho1017 - systemctl mask commonsrdf-dump ; systemctl mask commonsjson-dump T368098

This happened today again, starting at 20:05 and resulted in basically the same issues we observed over the weekend and in the previous incident. I have not identified the query and I leave that to people who know better, but the steps we took to resolve this were:

taavi@snapshot1017 ~ $ sudo systemctl stop commons*.service
<mutante>	snapsho1017 - systemctl mask commonsrdf-dump ; systemctl mask commonsjson-dump T368098

(Please note: Puppet is still disabled there so that it doesn't start the service.)

And then subsequently, a patch by @Ladsgroup to reduce the log severity https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049261.

This took basically everything down. Wikis returned errors, Phabricator down, toolforge down. Large user impact. We had to interrupt the dumps and ensure they don't start again this time.

The working theory is that DBs get overwhelmed by the expensive query and then MW starts _massively_ logging which saturates the interface on mwlog*.

I just spot-checked some of the runPrimaryTransactionIdleCallbacks logs that made it through the 99% throttle in logstash during the "precursor" event starting shortly after 19:30 UTC (logs during the "main" event starting shortly after 20:00 are pretty incomplete, at least at the moment).

At least during that earlier event, this looks quite similar to what we saw T365655#9823954: every stack trace I looked at was handling EmergencyTimeoutException by way of MWExceptionHandler::rollbackPrimaryChangesAndLog.

That first event is also clearly visible in the traffic graphs for cr1-eqiad <> asw2-c-eqiad (i.e., toward mwlog1002), but much smaller than the later event - ~ 10 Gbps / no out-discards vs. nearly 40 Gbps / 520k pps out-discards.

Edit: sflow data that shows both events is now available: https://w.wiki/AUec (aggregate volume toward mwlog1002).

Ok after observing the dumps for a bit now, here are some conclusions:

  • I still believe the extra pressure against the databases is attributable to not having prefetch, as discussed in T368098#9913625.
  • I do not think network usage bursts coming from the snapshot servers are an indication of a bug, as discussed in T368098#9918320.
  • There is an existing mechanism to control parallelism in dumps, called slots (unrelated to MCR slots), however this is currently a global setting that I would rather not lower. More discussion in T368098#9918849.

The hypothesis I have is that we will not see this issue once we are done with this particular run of the dumps that doesn't have prefetch. And in fact, up until now in my tenure as (reluctant) dumps maintainer, this is the first time I see a run affecting a database so heavily.

Here is a plan of action, considering this is an unbreak now:

  1. I will continue monitoring the remaining two wikis related to the offending 20240620 run: commonswiki and wikidatawiki.
  1. In the event that we have another SRE being paged, I will kill the remaining 20240620 dumps runs.
  1. In a couple days, we will start the next run of the dumps, namely the 20240701 run. I do not expect that run to be problematic, as it will leverage prefetching. However, in the event that we do see similar issues, I will work with @BTullis to lower the dumps slots, and/or pause the dumps as necessary.

Let me know if this sounds reasonable.

Overall the plan sounds reasonable, although we did cause another outage last night, and thus commons dumps are disabled at the moment. Let's wait for the SRE mitigations to be in place before restarting it.

The trigger seems to be a duress imposed on s4:
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1719253892503&to=1719262567130&viewPanel=9&var-site=eqiad&var-group=core&var-shard=s4&var-role=All

image.png (836×1 px, 74 KB)

Which caused a general slow down, that led to mediawiki start to spit out 5 billion logs which congested the network and made everything slow which sustained the downward cycle.

I need to check what kind of query to commons caused this.


Some notes:
I think current dumps have logical errors that make them really inefficient: Let me give you an example. During the outage. Here is the hits to WAN cache during the peak of the commons dumper: https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=1719088287268&to=1719102615017

image.png (780×1 px, 200 KB)

That's around 100M hit per minute. To comparison, commons only has 100M images in total. It basically goes through the whole commons in one minute for twenty minutes. Even if we assume it's going through all revisions of commons, it means it should be done dumping all of commons with all of its history by causing the outage. The fact that it's taking days or weeks is quite scary to me.

Another inefficiency brought up by @daniel is that it uses memcached for getting revisions of all pages. That fills memcached with entries that would never be accessed (see miss rate in the graph above) and reduces overall efficiency of our memcached by forcing eviction of useful entries.

The other note is that dumps are really big hammers, e.g. they are making queries with limit of 50,000. This is way too high. The highest we go in production is 10,000. Daniel tracked it down to T207628#4690924 but I think that needs revisiting. That benchmark was done in isolation and once on a small wiki. It is currently doing ten to fifteen of them at the same time for wikis like enwiki and commons. This large hammer wasn't a big issue until we actually had to change something. It's exposing this problem that was masked by tuning in other places. I highly encourage you to lower that limit to 10,000. It won't make it that slower (certainly not five times).

I will dig even deeper but I have to go afk right now.

Change #1049182 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

Change #1049532 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Change #1049532 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:46:28Z] <cgoubert@deploy1002> Started scap: Deploy udp2log rate-limiting - T365655 - T368098

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:51:40Z] <cgoubert@deploy1002> Finished scap: Deploy udp2log rate-limiting - T365655 - T368098 (duration: 05m 49s)

Question, what went wrong with dump replica groups? I believe (or at least it used to be) that dumps only use databases in the dump replica group. Was it overloaded and it jump into other dbs? If yes, can that be prevented/help throttle the dumps?

Question, what went wrong with dump replica groups? I believe (or at least it used to be) that dumps only use databases in the dump replica group. Was it overloaded and it jump into other dbs? If yes, can that be prevented/help throttle the dumps?

We still have dumps group in each section but:

  • We have had many bugs that dumpers for at least a portion of the queries went to general group
  • Replicas in dump group also serve general traffic too (with lower weight usually) and a dump replica becoming slow can slow down a certain portion of normal requests as well. It should be a low percent but I assume combined with bugs in mw, it can trigger a flood.

...
That's around 100M hit per minute. To comparison, commons only has 100M images in total. It basically goes through the whole commons in one minute for twenty minutes. Even if we assume it's going through all revisions of commons, it means it should be done dumping all of commons with all of its history by causing the outage. The fact that it's taking days or weeks is quite scary to me.

Another inefficiency brought up by @daniel is that it uses memcached for getting revisions of all pages. That fills memcached with entries that would never be accessed (see miss rate in the graph above) and reduces overall efficiency of our memcached by forcing eviction of useful entries.

The current dump implementation is certainly not ideal, but at least from my side there are no cycles to modify its general behavior. It is in maintenance mode. We are working on a new implementation that will not be coupled to MediaWiki and thus avoid having a bespoke batch system hitting the same databases that production workloads run against.

...
The other note is that dumps are really big hammers, e.g. they are making queries with limit of 50,000. This is way too high. The highest we go in production is 10,000. Daniel tracked it down to T207628#4690924 but I think that needs revisiting. That benchmark was done in isolation and once on a small wiki. It is currently doing ten to fifteen of them at the same time for wikis like enwiki and commons. This large hammer wasn't a big issue until we actually had to change something. It's exposing this problem that was masked by tuning in other places. I highly encourage you to lower that limit to 10,000. It won't make it that slower (certainly not five times).

Thank you for finding this. I agree this should not make dumps significantly slower. I will go ahead and change it.

I wonder what changed though, because that 50k setting has not changed since Oct 2018, so ~6 years. I want to understand better why is it that the databases can't take the load from this particular run. In T368098#9919385 we disabled the commons RDF dumps, but that is not explained by my theory of prefetch, as commons RDF dumps does not rely on that mechanism.

...

  • Replicas in dump group also serve general traffic too

Can we modify this behavior? Do we need to couple dumps with general production traffic?

Change #1049617 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[mediawiki/core@master] On T207628, as part of performance tuning, WikiExporoter's BATCH_SIZE was set to 50000. There was no decisive benchmarking done to support that size. We have recently seen a lot of database pressure coming from this configuration.

https://gerrit.wikimedia.org/r/1049617

Change #1049623 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Dumps: suppress blob cache while generating dumps

https://gerrit.wikimedia.org/r/1049623

Change #1049617 merged by jenkins-bot:

[mediawiki/core@master] Modify WikiExporter's BATCH_SIZE from 50000 to 10000

https://gerrit.wikimedia.org/r/1049617

Change #1049982 had a related patch set uploaded (by Ladsgroup; author: Xcollazo):

[mediawiki/core@wmf/1.43.0-wmf.11] Modify WikiExporter's BATCH_SIZE from 50000 to 10000

https://gerrit.wikimedia.org/r/1049982

Change #1049984 had a related patch set uploaded (by Ladsgroup; author: Xcollazo):

[mediawiki/core@wmf/1.43.0-wmf.10] Modify WikiExporter's BATCH_SIZE from 50000 to 10000

https://gerrit.wikimedia.org/r/1049984

WDoranWMF lowered the priority of this task from Unbreak Now! to High.Wed, Jun 26, 4:19 PM

Change #1049984 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.10] Modify WikiExporter's BATCH_SIZE from 50000 to 10000

https://gerrit.wikimedia.org/r/1049984

Change #1049982 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.11] Modify WikiExporter's BATCH_SIZE from 50000 to 10000

https://gerrit.wikimedia.org/r/1049982

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:05:56Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:08:53Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwd

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:14:49Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] (duration: 08m 52s)