Select candidate jobs for transferring to the new infrastucture
Closed, ResolvedPublic

Description

Out of all the job types that are run in production we need to select candidates for being the first transferred to the new EventBus infrastructure. Requirements:

  • Low volume
  • Idempotence - the job would initially be double-processed by old and new infra, so doing it twice shouldn't cause any trouble
  • Preferably low importance - if something goes wrong it should be either easily fixable or possible to ignore
  • As simple as possible - no delayed executions, root/leaf job splitting, no recursion and no importance for deduplication.

For reference here's the list of job types currently executed in production with some notes (integral list available as P5964):

I've looked through the following jobs (struck-through jobs have been moved):

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2017, 8:02 PM
mobrovac raised the priority of this task from Normal to High.Sep 7 2017, 9:14 AM
mobrovac edited projects, added EventBus, ChangeProp, MediaWiki-JobQueue; removed Goal, Epic.
mobrovac updated the task description. (Show Details)
mobrovac removed a subscriber: Aklapper.
Restricted Application added a project: Analytics. · View Herald TranscriptSep 7 2017, 9:14 AM
mobrovac updated the task description. (Show Details)Sep 7 2017, 9:19 AM
Pchelolo updated the task description. (Show Details)Sep 7 2017, 8:53 PM
Pchelolo added a subscriber: EBernhardson.
Pchelolo updated the task description. (Show Details)Sep 7 2017, 9:38 PM
Pchelolo updated the task description. (Show Details)Sep 7 2017, 10:04 PM
mobrovac updated the task description. (Show Details)Sep 8 2017, 12:07 PM

cirrusSearchCheckerJob - basically idempotent. It verifies data in elasticsearch matches mysql, creates new jobs if they don't match. Uses delayed execution.
cirrusSearchDeleteArchive - idempotent - checks database to verify archive indexing is still appropriate when run.
cirrusSearchDeletePages - idempotent
cirrusSearchElasticaWrite - idempotent. Issued to retry failed write requests to elasticsearch. uses delayed execution
cirrusSearchIncomingLinkCount - idempotent. expensive, high volume duplicates
cirrusSearchLinksUpdate - idempotent, expensive
cirrusSearchLinksUpdatePrioritized - idempotent, expensive,
cirrusSearchMassIndex - idempotent, expensive, low volume
cirrusSearchOtherIndex - cant use versioning, so out of order updates could be problematic

Pchelolo updated the task description. (Show Details)Sep 8 2017, 9:57 PM

Thank you @EBernhardson, updated the task with your info. Now we've got a complete list of jobs executed in production.

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Sep 11 2017, 2:47 PM

IMHO, updateBetaFeaturesUserCounts is the perfect candidate here. It's very lightweight (one SELECT, one UPDATE), it's idempotent and low-volume.

IMHO, updateBetaFeaturesUserCounts is the perfect candidate here. It's very lightweight (one SELECT, one UPDATE), it's idempotent and low-volume.

Sounds like a solid choice to me. Not terribly sexy, but straightforward.

Change 377518 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable processing of the updateBetaFeaturesUserCounts job.

https://gerrit.wikimedia.org/r/377518

Change 377518 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable processing of the updateBetaFeaturesUserCounts job.

https://gerrit.wikimedia.org/r/377518

Mentioned in SAL (#wikimedia-operations) [2017-09-13T14:22:51Z] <mobrovac@tin> Started deploy [cpjobqueue/deploy@60d0a78]: Start using the EventBus infrastructure for the updateBetaFeaturesUserCounts job - T175210

Mentioned in SAL (#wikimedia-operations) [2017-09-13T14:23:24Z] <mobrovac@tin> Finished deploy [cpjobqueue/deploy@60d0a78]: Start using the EventBus infrastructure for the updateBetaFeaturesUserCounts job - T175210 (duration: 00m 33s)

mobrovac closed this task as Resolved.

The job is being double-produced now, so resolving.

Given the useful information we have in this task, I am proposing to widen the scope beyond the first job, towards generally coordinating the order of migrating individual jobs. @mobrovac, does that sound reasonable to you?

mobrovac reopened this task as Open.Sep 14 2017, 1:18 PM
mobrovac edited projects, added Services (doing); removed Services (done).

Sure.

I honestly don't have a strong preference between the other "hearted" tasks. Given that all of them are fairly low volume, would it make sense to just deploy all of the hearted ones in the next wave?

I honestly don't have a strong preference between the other "hearted" tasks. Given that all of them are fairly low volume, would it make sense to just deploy all of the hearted ones in the next wave?

Good idea. Once we fully switch the first one to EB, there is no need to go one by one for low-risk and straightforward jobs.

mforns moved this task from Incoming to Radar on the Analytics board.Sep 28 2017, 3:41 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-02T16:04:16Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Use only EventBus for processing updateBetaFeatureUserCount - T175210 (duration: 00m 51s)

Change 388139 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] JobQueue: Use EventBus for all "hearted" jobs

https://gerrit.wikimedia.org/r/388139

Change 388139 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueue: Use EventBus for all "hearted" jobs

https://gerrit.wikimedia.org/r/388139

Change 389491 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable more 'hearted' jobs

https://gerrit.wikimedia.org/r/389491

Change 389491 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable more 'hearted' jobs

https://gerrit.wikimedia.org/r/389491

Mentioned in SAL (#wikimedia-operations) [2017-11-06T14:49:23Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@e93feba]: Start processing all 'hearted' jobs T175210

Mentioned in SAL (#wikimedia-operations) [2017-11-06T14:50:07Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@e93feba]: Start processing all 'hearted' jobs T175210 (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2017-11-06T14:50:20Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Switch MessageIndexRebuildJob, flaggedrevs_CacheUpdate and deleteLinks jobs to the EventBus infrastructure - T175210 (duration: 00m 46s)

Change 389495 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Correct the regex for the consumed topics

https://gerrit.wikimedia.org/r/389495

Change 389495 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Correct the regex for the consumed topics

https://gerrit.wikimedia.org/r/389495

Change 389497 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Finally use correct regexes for matching the jobs

https://gerrit.wikimedia.org/r/389497

Change 389497 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Finally use correct regexes for matching the jobs

https://gerrit.wikimedia.org/r/389497

mobrovac updated the task description. (Show Details)Nov 6 2017, 3:38 PM

Change 389669 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Remove the Host header from the request

https://gerrit.wikimedia.org/r/389669

Change 389669 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Remove the Host header from the request

https://gerrit.wikimedia.org/r/389669

Out of the IRC discussion we've got 3 candidates for the next migration:

  • wikibase-UpdateUsagesForPage - super high traffic, well tested on beta, but super easy. TODO talk to Wikidata
  • ORESFetchScoresJob - low traffic, quite problematic
  • recentchangesupdate - decent traffic, very high user-visible effect.

The wikibase-UpdateUsagesForPage job sounds like a perfect candidate to be the next one. It's ~220 jobs/s on average over the last month, it was well tested in beta labs and it seems idempotent and it doesn't seem to use any of the advanced JobQueue features like root job deduplication or delayed execution.

Additionally, it's the biggest user of the EnqueueJob, so in reality it creates 2 jobs per execution - one actual job and one EnqueueJob, so this job accounts for 440 jobs/s which is 40% of all the jobs in the queue.

This the new kaka-based queue, EnqueueJob is not needed any more (see T181216) so transferring it will move a very significant portion of the load out of the Redis queue.

@daniel what do you think about moving the wikibase-UpdateUsagesForPage to the Kafka-based queue? Am I correct thinking that this job is idempotent?

Pchelolo updated the task description. (Show Details)Nov 24 2017, 10:37 AM
mobrovac updated the task description. (Show Details)Dec 4 2017, 5:21 PM
Pchelolo updated the task description. (Show Details)Jan 31 2018, 12:11 AM
EBernhardson updated the task description. (Show Details)Jan 31 2018, 4:56 AM
Pchelolo updated the task description. (Show Details)
Pchelolo updated the task description. (Show Details)Mar 5 2018, 6:58 PM
Pchelolo updated the task description. (Show Details)Mar 9 2018, 2:57 PM

While resolving the cirrus search issues the next bulk of jobs can be switched. Here's what I propose:

  • recentChangesUpdate - 28/s
  • categoryMembershipChange - 12/s
  • EchoNotificationDeleteJob - 7/s
  • ORESFetchScoreJob - 6/s
  • wikibase-InjectRCRecords - 3/s

These are among the most high-frequency jobs left of the old queue, in total it's 56/s which is 1/3 of all the jobs left on the old queue. All of them seem idempotent and all of them seem to have simple parameters.

I propose to switch test wikis first as always and then go with a bulk switch for all the wikis.

Change 423486 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Switch remaining high traffic jobs for test wikis.

https://gerrit.wikimedia.org/r/423486

Change 423487 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch remaining high traffic jobs for test wikis.

https://gerrit.wikimedia.org/r/423487

Pchelolo updated the task description. (Show Details)Apr 2 2018, 6:49 PM
mobrovac updated the task description. (Show Details)Apr 16 2018, 7:31 PM
Pchelolo updated the task description. (Show Details)Apr 16 2018, 9:36 PM
Pchelolo updated the task description. (Show Details)May 22 2018, 10:19 AM

Mentioned in SAL (#wikimedia-operations) [2018-05-22T10:30:45Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@b45cd3b]: Switch cross-wiki posting jobs for everything T175210

Mentioned in SAL (#wikimedia-operations) [2018-05-22T10:31:48Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@b45cd3b]: Switch cross-wiki posting jobs for everything T175210 (duration: 01m 03s)

Pchelolo updated the task description. (Show Details)May 29 2018, 10:10 AM
Pchelolo updated the task description. (Show Details)Jun 5 2018, 9:22 AM
Pchelolo closed this task as Resolved.Jun 5 2018, 11:41 AM
Pchelolo edited projects, added Services (done); removed Services (doing).

We have switched all jobs except certain outstanding problematic ones and we have tickets for all of them, so this ticket has served its purpose. Resolving.