[L] Change how we send image-suggestions notifications to experienced users
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Nov 10 2022, 6:03 PM

Description

ATM we run a scheduled maintenance script to send image-suggestions notifications to experienced users

We need to change how this is done before we start sending section-image-suggestions notifications:

From @Ladsgroup:

The architecture of "let's update data from services by introducing regular cron maint scripts" is okay for small cases or small number of wikis but it has been creeping up in many places including Growth experiments and is quite unsustainable in so many ways:

It's not distributed, all of our mw crons are in mwmaint1002 and basically a single point of failure. Any noisy neighbor can cause wide-scale disruption.

It's quite wasteful. The updates usually happen by checking all of wiki or something like that. It needs a more robust event-driven architecture. You backfill the data once and with any change you trigger a job to update that page.

Time-wise it is problematic. We don't have a central catalog of mw crons and when they get started yet. They put different levels of pressure on our system and if this way of doing things continue, in no time we will have outages caused by concurrent mw scripts bringing down database or something like that. The distribution of such changes must be automatic not through guessing or picking "low-load" times and crossing our fingers.

There is no criticality levels in mw maint scripts. Higher priority scripts are being ran in the same place as low priority ones. It is quite possible a low-prio script could cause issues on high prio scripts (manual or automatic). e.g. the ones that clean up old private data so we could comply with data retention policies.

This is basically making a system that is already fragile and making it even more fragile.

Generally I'm okay with having crons that clean up data, but regular updates from services seems wrong, they should build pipelines to update the database (mostly through mediawiki jobs) and then they can have monthly "let's update everything" crons.

Details

Subject	Repo	Branch	Lines +/-
Add CirrusSearch to CI so unit tests can run	integration/config	master	+1 -1
[ImageSuggestions] Process suggestions via job queue rather than sync	operations/puppet	production	+7 -7
Report accurate amount of pages	mediawiki/extensions/ImageSuggestions	master	+7 -4
Don't forward console logs to other providers	mediawiki/extensions/ImageSuggestions	master	+5 -1
Change maint script to do work via jobs	mediawiki/extensions/ImageSuggestions	wmf/1.41.0-wmf.9	+1 K -400
Change maint script to do work via jobs	mediawiki/extensions/ImageSuggestions	master	+1 K -400
Allow setting of an initial value for search_after	mediawiki/extensions/CirrusSearch	master	+54 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T318017 [EPIC] Section-level Image Suggestions Notifications for More Experienced Contributors
Resolved	Cparle	T330945 [S] Schedule section-level image suggestions notifications
Resolved	Cparle	T322872 [L] Change how we send image-suggestions notifications to experienced users

Event Timeline

Cparle created this task.Nov 10 2022, 6:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2022, 6:03 PM

CBogen added a project: Image-Suggestions.Nov 10 2022, 6:04 PM

Cparle added subscribers: mfossati, matthiasmullie.Nov 10 2022, 6:10 PM

AUgolnikova-WMF moved this task from Triage to Image Suggestions on the Structured-Data-Backlog board.Nov 14 2022, 5:36 PM

AUgolnikova-WMF moved this task from Image Suggestions to Section Level Image Suggestions on the Structured-Data-Backlog board.

Hi,
You can do something rather simple in hollowing out the job. That's what we do in refreshlinks.

Queue a job with start id = 1 and batch size of 1000
Check 1k articles in that batch, check for notification, etc.
Queue next job with start id of 1000
return true

With that you wouldn't queue hundreds of thousands of jobs, and won't run a "master job" which could take hours to finish (and killed after timeout), or avoid queuing potentially thousands of jobs at once which could choke the whole job queue.

I assume this is not critical stuff dealing with canonical data so if one of the jobs fails, it's fine, the next week, it fills the gap.

Does that sound good to you?

AUgolnikova-WMF moved this task from Section Level Image Suggestions to Image Suggestions on the Structured-Data-Backlog board.Nov 23 2022, 3:17 PM

matthiasmullie added a parent task: T330945: [S] Schedule section-level image suggestions notifications.Mar 7 2023, 10:11 AM

CBogen moved this task from Image Suggestions to Triage on the Structured-Data-Backlog board.Mar 7 2023, 1:54 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

matthiasmullie updated the task description. (Show Details)Mar 22 2023, 4:38 PM

CBogen renamed this task from Change how we send image-suggestions notifications to experienced users to [L] Change how we send image-suggestions notifications to experienced users.Mar 22 2023, 4:38 PM

CBogen moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.Mar 22 2023, 5:07 PM

CBogen mentioned this in T330945: [S] Schedule section-level image suggestions notifications.Mar 22 2023, 5:11 PM

Cparle claimed this task.Mar 30 2023, 1:25 PM

Cparle moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

CBogen mentioned this in T330931: [L] Section Image suggestions notification UI.Apr 3 2023, 4:27 PM

CBogen mentioned this in T330934: [L] Send image suggestion notification (for article + section) to experienced users.

Change 908567 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/ImageSuggestions@master] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/908567

gerritbot added a project: Patch-For-Review.Apr 13 2023, 2:41 PM

Change 909213 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/CirrusSearch@master] Allow setting of an initial value for search_after

https://gerrit.wikimedia.org/r/909213

Change 909213 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Allow setting of an initial value for search_after

https://gerrit.wikimedia.org/r/909213

SimoneThisDot moved this task from Doing to Design QA on the Structured-Data-Backlog (Current Work) board.Apr 21 2023, 3:03 PM

SimoneThisDot moved this task from Design QA to Doing on the Structured-Data-Backlog (Current Work) board.

Change 916527 had a related patch set uploaded (by Cparle; author: Cparle):

[integration/config@master] Add CirrusSearch to CI so unit tests can run

https://gerrit.wikimedia.org/r/916527

Change 916527 merged by jenkins-bot:

[integration/config@master] Add CirrusSearch to CI so unit tests can run

https://gerrit.wikimedia.org/r/916527

Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.May 8 2023, 11:15 AM

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.9; 2023-05-15).May 15 2023, 2:47 PM

Change 908567 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/908567

Cparle mentioned this in rEISU4ef09c63b710: Change maint script to do work via jobs.May 24 2023, 3:26 PM

Maintenance_bot removed a project: Patch-For-Review.May 24 2023, 3:30 PM

Change 922853 had a related patch set uploaded (by Matthias Mullie; author: Cparle):

[mediawiki/extensions/ImageSuggestions@wmf/1.41.0-wmf.9] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/922853

gerritbot added a project: Patch-For-Review.May 24 2023, 3:35 PM

Change 922853 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@wmf/1.41.0-wmf.9] Change maint script to do work via jobs

https://gerrit.wikimedia.org/r/922853

matthiasmullie mentioned this in rEISU87646d95c50b: Change maint script to do work via jobs.May 25 2023, 7:35 AM

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:35:45Z] <mlitn@deploy1002> Started scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:37:16Z] <mlitn@deploy1002> mlitn: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-25T07:51:57Z] <mlitn@deploy1002> Finished scap: Backport for [[gerrit:922853|Change maint script to do work via jobs (T322872)]] (duration: 16m 12s)

Maintenance_bot removed a project: Patch-For-Review.May 25 2023, 8:10 AM

Change 923250 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@master] Don't forward console logs to other providers

https://gerrit.wikimedia.org/r/923250

gerritbot added a project: Patch-For-Review.May 25 2023, 8:11 AM

Change 923250 merged by jenkins-bot:

[mediawiki/extensions/ImageSuggestions@master] Don't forward console logs to other providers

https://gerrit.wikimedia.org/r/923250

matthiasmullie mentioned this in rEISU1757c6cc11b5: Don't forward console logs to other providers.May 25 2023, 8:58 AM

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.11; 2023-05-30); removed MW-1.41-notes (1.41.0-wmf.9; 2023-05-15).May 25 2023, 9:00 AM

Maintenance_bot removed a project: Patch-For-Review.May 25 2023, 9:10 AM

Change 924877 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[operations/puppet@production] [ImageSuggestions] Process suggestions via job queue rather than sync

https://gerrit.wikimedia.org/r/924877

Change 924562 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/extensions/ImageSuggestions@master] Report accurate amount of pages

https://gerrit.wikimedia.org/r/924562

Quick update: main patch & a couple of tiny follow-ups have been merged (one more minor - not functionally relevant - remaining)
A manual (synchronous; not via job queue) dry-run of all the new logic succeeded. The scheduled weekly runs completed successfully as well.
Now that we know all the refactored logic works out, we'll let it process over job queue from next week on.

@Ladsgroup can you CR+2 this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/924877/
That change will instruct the maintenance script to submit batches to job queue instead of executing immediately (relevant code here)