Page MenuHomePhabricator

UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages
Closed, ResolvedPublic

Description

We have had partial outages (could have been full outages if we didn't have circuit breaking) because of these queries. See P74902 for the slow log details.

The queries scan tens of millions of rows while it can be rewritten to scan only a couple thousands. We have disabled the script to protect the infrastructure

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
mediawiki/extensions/GrowthExperimentsmaster+72 -48
mediawiki/extensions/GrowthExperimentsmaster+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.24+130 -110
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.24+72 -48
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.24+149 -30
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.25+130 -110
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.25+72 -48
mediawiki/extensions/GrowthExperimentswmf/1.44.0-wmf.25+149 -30
mediawiki/extensions/GrowthExperimentsmaster+130 -110
mediawiki/extensions/GrowthExperimentsmaster+149 -30
Show related patches Customize query in gerrit

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1135983 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mediawiki: Absent updatementeedata jobs

https://gerrit.wikimedia.org/r/1135983

Specifically the last query in the log scans 704 million rows.

Aklapper renamed this task from UncachedMenteeOverviewDataProvider query is extremely aggersive causing partial outages to UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages.Apr 11 2025, 3:48 PM

Change #1135983 merged by Ladsgroup:

[operations/puppet@production] mediawiki: Absent updatementeedata jobs

https://gerrit.wikimedia.org/r/1135983

KStoller-WMF added a subscriber: Etonkovidova.

Thanks, @Ladsgroup.

@Etonkovidova - am I correct in thinking that this means the Mentor Dashboard data will load but no longer get refreshed regularly now?

If that's true it sounds like this is something we should look into ASAP (next week).

@Ladsgroup: is this something new that changed recently or has this problem going on for a long time?

@Ladsgroup: is this something new that changed recently or has this problem going on for a long time?

It has been going on for almost a month now but I think the issue has been around for a long time, something tipped the balance to make it actually cause outages recently (more people signed up as mentor for example) but it doesn't really matter.

I suggest doing multiple queries: you have the user ids, get the list of indef blocked ones via one query, get list of the ones that have enabled the preferences via another query and so on. On top of that, it's better to chunk up the list to a smaller set of let's say 100 user ids

more people signed up as mentor for example

On enwiki, there are only 10 more mentors than in January, and most of them joined when T384505 was announced.

Michael moved this task from Incoming to Doing on the Growth-Team (Current Sprint) board.
Michael edited projects, added GrowthExperiments-Mentorship; removed GrowthExperiments.

more people signed up as mentor for example

On enwiki, there are only 10 more mentors than in January, and most of them joined when T384505 was announced.

Mh, this might be what tipped the scales in the sense that it causes significantly more mentees for enwiki.

I suggest doing multiple queries: you have the user ids, get the list of indef blocked ones via one query, get list of the ones that have enabled the preferences via another query and so on. On top of that, it's better to chunk up the list to a smaller set of let's say 100 user ids

Thanks! I'll see what I can do.

While I have your attention. Please take a look at T391405: CommunityStructuredMentorProvider::getMentors is making uncached database query on every logged-in page view twice too (of course much lower priority). Thanks!

Goes onto the list. (Still need to create the actual list, for now the one in my head.)

Change #1136393 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136393

Change #1136412 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136412

Thanks, @Ladsgroup.

@Etonkovidova - am I correct in thinking that this means the Mentor Dashboard data will load but no longer get refreshed regularly now?

If that's true it sounds like this is something we should look into ASAP (next week).

I didn't observe it for wmf.24. I was doing quite a bit of testing for updating Mentor Dashboard when mentees are re-assigned for Mentors being away. I checked testwiki wmf.24 - also, there was no issues. But, of course, on a wikis with different scale of mentor/mentees numbers, it could an issue.

Change #1136664 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136664

Change #1136665 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] metrics(Mentorship): use detailed profiling data from updateMenteeData

https://gerrit.wikimedia.org/r/1136665

Change #1136393 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136393

Change #1136412 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136412

Change #1136664 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136664

Change #1136696 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136696

Change #1136698 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136698

Change #1136700 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136700

Change #1136701 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136701

Change #1136702 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136702

Change #1136703 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136703

Change #1136701 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136701

Change #1136702 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136702

Change #1136703 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.25] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136703

Mentioned in SAL (#wikimedia-operations) [2025-04-15T13:38:02Z] <cgoubert@deploy1003> Started scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-15T13:49:42Z] <cgoubert@deploy1003> migr, cgoubert: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia

Mentioned in SAL (#wikimedia-operations) [2025-04-15T13:56:29Z] <cgoubert@deploy1003> Finished scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 27s)

Change #1136696 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136696

Change #1136698 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136698

Change #1136700 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.24] perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider

https://gerrit.wikimedia.org/r/1136700

Mentioned in SAL (#wikimedia-operations) [2025-04-15T13:59:57Z] <cgoubert@deploy1003> Started scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:rOPUP1136698535bb|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-15T14:11:50Z] <cgoubert@deploy1003> migr, cgoubert: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:rOPUP1136698535bb|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia

Mentioned in SAL (#wikimedia-operations) [2025-04-15T14:18:28Z] <cgoubert@deploy1003> Finished scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:rOPUP1136698535bb|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 30s)

Change #1136970 had a related patch set uploaded (by Michael Große; author: Michael Große):

[operations/puppet@production] Revert "mediawiki: Absent updatementeedata jobs"

https://gerrit.wikimedia.org/r/1136970

Mentioned in SAL (#wikimedia-operations) [2025-04-16T10:17:03Z] <MichaelG_WMF> migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.25$ mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose #T391695

Mentioned in SAL (#wikimedia-operations) [2025-04-16T10:23:21Z] <MichaelG_WMF> migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki frwiki --verbose #T391695

Mentioned in SAL (#wikimedia-operations) [2025-04-16T10:29:50Z] <MichaelG_WMF> migr@mwmaint1002:/srv/mediawiki/php-1.44.0-wmf.24$ time mwscript ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki ruwiki --verbose #T391695

Logged results from running the script as a test at some wikis:

testwiki:

Profiling data:
  * filtermentees: 1.32 seconds
  * edittimestampinternal: 0.13 seconds
  * usernames: 0.02 seconds
  * reverted: 0.12 seconds
  * questions: 0.11 seconds
  * editcount: 0.02 seconds
  * registration: 0.02 seconds
  * blocks: 0.17 seconds
===============
Done. Took 11 seconds.

frwiki:

Profiling data:
  * filtermentees: 108.66 seconds
  * edittimestampinternal: 0.96 seconds
  * usernames: 0.47 seconds
  * reverted: 18.26 seconds
  * questions: 24.8 seconds
  * editcount: 0.41 seconds
  * registration: 0.38 seconds
  * blocks: 3.5 seconds
===============
Done. Took 261 seconds.

ruwiki

Profiling data:
  * filtermentees: 86.66 seconds
  * edittimestampinternal: 0.8 seconds
  * usernames: 0.25 seconds
  * reverted: 9.03 seconds
  * questions: 12.02 seconds
  * editcount: 0.22 seconds
  * registration: 0.21 seconds
  * blocks: 1.96 seconds
===============
Done. Took 173 seconds.

I'm not seeing any major bumps during these runs on db1180:
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1180&var-port=9104&viewPanel=3&from=1744782042787&to=1744803642787

grafik.png (1×3 px, 250 KB)

so I think it's good to go now, maybe keep an eye on it after the re-enabling it for a while though.

Change #1136970 merged by Ladsgroup:

[operations/puppet@production] Revert "mediawiki: Absent updatementeedata jobs"

https://gerrit.wikimedia.org/r/1136970

I will wait for a bit to see if all is well.

Re-enabling seems to have mostly worked well, except we're getting some mixed signals for s1/enwiki:

On the one hand, there are success messages:

migr@mwmaint1002:~$ tail /var/log/mediawiki/mediawiki_job_growthexperiments-updateMenteeData-s1/syslog.log
Apr 16 15:15:00 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: -----------------------------------------------------------------
Apr 16 15:15:00 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: enwiki
Apr 16 15:15:00 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: -----------------------------------------------------------------
Apr 16 15:55:16 mwmaint1002 mediawiki_job_growthexperiments-updateMenteeData-s1[31703]: enwiki:  Done. Took 2416 seconds.

and

sudo systemctl status mediawiki_job_growthexperiments-updateMenteeData-s1.service
Main PID: 31703 (code=exited, status=0/SUCCESS)

but on the other hand there are also error messges:

FIRING: [2x] SystemdUnitFailed: mediawiki_job_growthexperiments-updateMenteeData-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

and

Apr 16 15:18:28 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-updateMenteeData-s1.service: Current command vanished from the unit file, execution of the command list won't be resumed.

See today's log of #wikimedia-operations for the chatter: https://wm-bot.wmcloud.org/logs/%23wikimedia-operations/20250416.txt

Not sure what to make of all that.

I'm not seeing the alert in alert manager anymore (https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed) let's call it a transient error? As long as enwiki is getting updated, it should be fine.

Change #1136665 abandoned by Michael Große:

[mediawiki/extensions/GrowthExperiments@master] metrics(Mentorship): use detailed profiling data from updateMenteeData

Reason:

This has bigger issues, in particular: we need to add more buckets to be able to show durations above 60s seconds: https://wikimedia.slack.com/archives/C06HBTGUM2S/p1744715294350989

s1 / enwiki currently takes are 10 minutes (based on Graphite tracking)

https://gerrit.wikimedia.org/r/1136665