Page MenuHomePhabricator

[L] Change how we send image-suggestions notifications to experienced users
Closed, ResolvedPublic

Description

ATM we run a scheduled maintenance script to send image-suggestions notifications to experienced users

We need to change how this is done before we start sending section-image-suggestions notifications:

From @Ladsgroup:

The architecture of "let's update data from services by introducing regular cron maint scripts" is okay for small cases or small number of wikis but it has been creeping up in many places including Growth experiments and is quite unsustainable in so many ways:

  • It's not distributed, all of our mw crons are in mwmaint1002 and basically a single point of failure. Any noisy neighbor can cause wide-scale disruption.
  • It's quite wasteful. The updates usually happen by checking all of wiki or something like that. It needs a more robust event-driven architecture. You backfill the data once and with any change you trigger a job to update that page.
  • Time-wise it is problematic. We don't have a central catalog of mw crons and when they get started yet. They put different levels of pressure on our system and if this way of doing things continue, in no time we will have outages caused by concurrent mw scripts bringing down database or something like that. The distribution of such changes must be automatic not through guessing or picking "low-load" times and crossing our fingers.
  • There is no criticality levels in mw maint scripts. Higher priority scripts are being ran in the same place as low priority ones. It is quite possible a low-prio script could cause issues on high prio scripts (manual or automatic). e.g. the ones that clean up old private data so we could comply with data retention policies.
  • This is basically making a system that is already fragile and making it even more fragile.

Generally I'm okay with having crons that clean up data, but regular updates from services seems wrong, they should build pipelines to update the database (mostly through mediawiki jobs) and then they can have monthly "let's update everything" crons.

Event Timeline

Hi,
You can do something rather simple in hollowing out the job. That's what we do in refreshlinks.

  • Queue a job with start id = 1 and batch size of 1000
  • Check 1k articles in that batch, check for notification, etc.
  • Queue next job with start id of 1000
  • return true

With that you wouldn't queue hundreds of thousands of jobs, and won't run a "master job" which could take hours to finish (and killed after timeout), or avoid queuing potentially thousands of jobs at once which could choke the whole job queue.

I assume this is not critical stuff dealing with canonical data so if one of the jobs fails, it's fine, the next week, it fills the gap.

Does that sound good to you?

CBogen renamed this task from Change how we send image-suggestions notifications to experienced users to [L] Change how we send image-suggestions notifications to experienced users.Mar 22 2023, 4:38 PM

Change 908567 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/ImageSuggestions@master] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/908567

Change 909213 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/CirrusSearch@master] Allow setting of an initial value for search_after

https://gerrit.wikimedia.org/r/909213

Change 909213 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Allow setting of an initial value for search_after

https://gerrit.wikimedia.org/r/909213

Change 916527 had a related patch set uploaded (by Cparle; author: Cparle):

[integration/config@master] Add CirrusSearch to CI so unit tests can run

https://gerrit.wikimedia.org/r/916527

Change 916527 merged by jenkins-bot:

[integration/config@master] Add CirrusSearch to CI so unit tests can run

https://gerrit.wikimedia.org/r/916527

Change 908567 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/908567

Change 922853 had a related patch set uploaded (by Matthias Mullie; author: Cparle):

[mediawiki/extensions/ImageSuggestions@wmf/1.41.0-wmf.9] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/922853

Change 922853 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@wmf/1.41.0-wmf.9] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/922853

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:35:45Z] <mlitn@deploy1002> Started scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:37:16Z] <mlitn@deploy1002> mlitn: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:51:57Z] <mlitn@deploy1002> Finished scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] (duration: 16m 12s)

Change 923250 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@master] Don't forward console logs to other providers

https://gerrit.wikimedia.org/r/923250

Change 923250 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Don't forward console logs to other providers

https://gerrit.wikimedia.org/r/923250

Change 924877 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[operations/puppet@production] [ImageSuggestions] Process suggestions via job queue rather than sync

https://gerrit.wikimedia.org/r/924877

Change 924562 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@master] Report accurate amount of pages

https://gerrit.wikimedia.org/r/924562

Quick update: main patch & a couple of tiny follow-ups have been merged (one more minor - not functionally relevant - remaining)
A manual (synchronous; not via job queue) dry-run of all the new logic succeeded. The scheduled weekly runs completed successfully as well.
Now that we know all the refactored logic works out, we'll let it process over job queue from next week on.

@Ladsgroup can you CR+2 this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924877/
That change will instruct the maintenance script to submit batches to job queue instead of executing immediately (relevant code here)

Change 924562 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Report accurate amount of pages

https://gerrit.wikimedia.org/r/924562

Change 924877 merged by Ladsgroup:

[operations/puppet@production] [ImageSuggestions] Process suggestions via job queue rather than sync

https://gerrit.wikimedia.org/r/924877