Page MenuHomePhabricator

User impact API: Maintenance scripts should defer work to the job queue
Closed, ResolvedPublic

Description

Per Amir:

. The architecture of "let's update data from services by introducing regular cron maint scripts" is okay for small cases or small number of wikis but it has been creeping up in many places including Growth experiments and is quite unsustainable in so many ways:

  • It's not distributed, all of our mw crons are in mwmaint1002 and basically a single point of failure. Any noisy neighbor can cause wide-scale disruption.
  • It's quite wasteful. The updates usually happen by checking all of wiki or something like that. It needs a more robust event-driven architecture. You backfill the data once and with any change you trigger a job to update that page.
  • Time-wise it is problematic. We don't have a central catalog of mw crons and when they get started yet. They put different levels of pressure on our system and if this way of doing things continue, in no time we will have outages caused by concurrent mw scripts bringing down database or something like that. The distribution of such changes must be automatic not through guessing or picking "low-load" times and crossing our fingers.
  • There is no criticality levels in mw maint scripts. Higher priority scripts are being ran in the same place as low priority ones. It is quite possible a low-prio script could cause issues on high prio scripts (manual or automatic). e.g. the ones that clean up old private data so we could comply with data retention policies.
  • This is basically making a system that is already fragile and making it even more fragile.

Generally I'm okay with having crons that clean up data, but regular updates from services seems wrong, they should build pipelines to update the database (mostly through mediawiki jobs) and then they can have monthly "let's update everything" crons.

One thing on top of that, if you think there is no way that you can avoid maint scripts, it's fine to keep it. One rather simple solution would be to trigger a maint script that queues jobs. That's good.
Part of this problem is also something SRE should fix. With move to mw-on-k8s, a maint script will become a docker container and ran and once done, just dies. That's much more robust and scalable but we are not there yet.

Event Timeline

It needs a more robust event-driven architecture.

Unfortunately that doesn't seem possible for UserImpact data, because we need to re-calculate page view data for each article the user has edited daily, in order to show the highest viewed articles to the user.
If it weren't for that, I think we could do something like:

  • UserImpactHandler would compute and store impact data on the fly for eligible users (created in last year, edited in last X days, with homepage enabled, etc) when it is not found in the database.
  • This chain of patches would then take care of refreshing data when the user edits or receives thanks.

But as it stands, we need the daily update for refreshing the page view data. Converting the maintenance script to use jobs seems fairly straightforward.

kostajh triaged this task as Medium priority.

Change 854980 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] UserImpact: Add option to use job queue

https://gerrit.wikimedia.org/r/854980

Change 855546 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/puppet@production] GrowthExperiments: Use job queue for refreshUserImpact script

https://gerrit.wikimedia.org/r/855546

Change 855525 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.8] refreshUserImpactData: Add option to use job queue

https://gerrit.wikimedia.org/r/855525

Change 854980 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] refreshUserImpactData: Add option to use job queue

https://gerrit.wikimedia.org/r/854980

Change 855525 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.8] refreshUserImpactData: Add option to use job queue

https://gerrit.wikimedia.org/r/855525

Mentioned in SAL (#wikimedia-operations) [2022-11-10T14:58:23Z] <kharlan@deploy1002> Started scap: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]]

Mentioned in SAL (#wikimedia-operations) [2022-11-10T14:58:42Z] <kharlan@deploy1002> kharlan and kharlan: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-11-10T15:03:10Z] <kharlan@deploy1002> Finished scap: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]] (duration: 04m 47s)

Change 855546 merged by RLazarus:

[operations/puppet@production] GrowthExperiments: Use job queue for refreshUserImpact script

https://gerrit.wikimedia.org/r/855546

Mentioned in SAL (#wikimedia-operations) [2022-11-10T17:18:41Z] <rzl> rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered.service # test run for T322706 T322541

Mentioned in SAL (#wikimedia-operations) [2022-11-10T17:23:38Z] <rzl> rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited.service # test run for T322706 T322541

Mentioned in SAL (#wikimedia-operations) [2022-11-10T17:34:27Z] <rzl> rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactDelete.service # test run for T322706 T322541

Change 858557 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] [WIP] RefreshUserImpactJob: De-duplicate based on user ID only

https://gerrit.wikimedia.org/r/858557

Change 858670 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Batch user impact data updates via RefreshUserImpactJob

https://gerrit.wikimedia.org/r/858670

Change 858670 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Batch user impact data updates via RefreshUserImpactJob

https://gerrit.wikimedia.org/r/858670

Change 858557 abandoned by Kosta Harlan:

[mediawiki/extensions/GrowthExperiments@master] RefreshUserImpactJob: De-duplicate based on user ID only

Reason:

https://gerrit.wikimedia.org/r/858557

Change 859970 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Makeshift deduplication for RefreshUserImpactJob

https://gerrit.wikimedia.org/r/859970

Change 859970 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Makeshift deduplication for RefreshUserImpactJob

https://gerrit.wikimedia.org/r/859970