Page MenuHomePhabricator

Update the "waiting period" implemenation so as not to block the job queue
Closed, ResolvedPublic

Description

The introduction of fetchGoogleCloudVisionAnnotation jobs from the MachineVision extension blocked the job queue due to how it was using job release timestamps (T240518).
We need an alternative method of implementing the 48-hour "waiting period" before labels are requested for a newly uploaded image. In the meantime, the extension has been disabled from enqueuing label request jobs.

Event Timeline

Mholloway renamed this task from Investigate why fetchGoogleCloudVisionAnnotations jobs blocked the job queue to Investigate why fetchGoogleCloudVisionAnnotations jobs blocked the job queue, and fix it.Dec 18 2019, 4:00 PM
Mholloway triaged this task as High priority.

@Pchelolo Is there a step to register the job in cpjobqueue that I missed? I didn't do any Kafka configuration at all, just set up the job in the extension.

From what I can see in the code, the MachineVisionFetchGoogleCloudVisionAnnotationsJobFactory accepts a job delay, and it's currently set in production to 172800 seconds (48 hours). @Mholloway could you please explain why is that?

TLDR: this is not going to work with the Kafka-based job queue. The job delays are implemented as a blocking wait in the change-prop queue controller - we fetch the job, we see the release timestamp, we block the exectuon slot for a certain period of time, we re-enqueue the job back into the queue. So, effectively with a 48 hours delay each job blocks out a single concurrency slot for the topic (topics group) for 48 hours, effectively eating out all the concurrency slots as more and more jobs are added and completely blocking the queue for the group of topics the job belongs to.

I would like to know the reasoning why we need to delay for 48 hours before executing the job before proposing any solutions to this problem or alternative implementations.

The product requirement that's meant to fulfill is for a "waiting period" of 48 hours, so that we're not sending out labeling requests for new files that are going to be quickly deleted for one reason or another. I can imagine a few different ways of implementing this, but picked the jobReleaseTimestamp since it appeared to be available for use. I thought I'd seen at least one other existing job using the release timestamp feature, but I might be mistaken.

I thought I'd seen at least one other existing job using the release timestamp feature, but I might be mistaken.

There's a couple, but they're different.

  • cdnPurge uses a really small delay, so blocking the queue is not really a problem
  • cirrusSearchCheckerJob has bigger delays, but it's separated into it's own queue.

Its possible to make it work with delayed jobs if we split it out into a separate rule in change-prop. But overall, in the Kafka queue the delayed jobs implementation is very inefficient and I wouldn't encourage using it in the long term. I remember there was some RFC about a cron-like service flowing around, that might be a much better technology to achieve what's needed here.

cc @Joe for SRE perspective.

Change 559559 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] MachineVision: Remove new upload labeling job delay

https://gerrit.wikimedia.org/r/559559

Change 559559 merged by jenkins-bot:
[operations/mediawiki-config@master] MachineVision: Remove new upload labeling job delay

https://gerrit.wikimedia.org/r/559559

@Pchelolo There's one other issue that needs resolution before we can try reenabling these jobs. Separate from the immediate job queue delay caused by the use of jobReleaseTimestamp, these jobs were failing at a high rate with DB duplicate entry errors, indicating that data for the same image had already been inserted by a previous job. This doesn't make sense from the MediaWiki side, where only a single job is submitted in the UploadComplete hook for a newly uploaded file. Were these most likely retries caused by the misbehaving release timestamp, or might there be something else going wrong?

@Pchelolo There's one other issue that needs resolution before we can try reenabling these jobs. Separate from the immediate job queue delay caused by the use of jobReleaseTimestamp, these jobs were failing at a high rate with DB duplicate entry errors, indicating that data for the same image had already been inserted by a previous job. This doesn't make sense from the MediaWiki side, where only a single job is submitted in the UploadComplete hook for a newly uploaded file. Were these most likely retries caused by the misbehaving release timestamp, or might there be something else going wrong?

The jobs should aim to be idempotent, so I guess it would be nice to add a feature to check if the data is already there before calling google. Also, refusal to write a duplicate entry should be a success and not a failure. We do our best trying not to repeat the jobs, but it's sometimes not possible (restarts of the change-prop for example) and then failing on duplicate insert would make change-prop retry the job 2 more times.

As for the source of the initial retries, we can investigate it further when we reenable it I guess. One possibility that comes to mind is that the HTTP timeout on the change-prop side is not high enough. Change-prop will post the job, not wait long enough for the result and drop a connection marking the job failed, while in reality the job eventually finishes successfully and is the result is stored in the DB. How long does it the request to google take?

Change 559585 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@master] Ignore duplicate key errors when inserting data from annotation jobs

https://gerrit.wikimedia.org/r/559585

@Pchelolo Sounds good. We'll ignore duplicate key errors upon inserting data from these jobs, then, since they're an expected condition of using the job queue.

As for the jobs that were previously submitted with the 48-hour delay but have not yet been processed, what options do we have for dealing with those? Can they be modified at this point? If necessary, they can be dropped and we can start from a clean queue with no delay.

Change 559585 merged by jenkins-bot:
[mediawiki/extensions/MachineVision@master] Ignore duplicate key errors when inserting data from annotation jobs

https://gerrit.wikimedia.org/r/559585

Change 559635 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.11] Ignore duplicate key errors when inserting data from annotation jobs

https://gerrit.wikimedia.org/r/559635

Change 559635 merged by Mholloway:
[mediawiki/extensions/MachineVision@wmf/1.35.0-wmf.11] Ignore duplicate key errors when inserting data from annotation jobs

https://gerrit.wikimedia.org/r/559635

Ramsey-WMF subscribed.

Assigning to Cormac for now since he's also looking into the edit conflict issue. The key point we're interested in investigating now is this stuff from @Pchelolo earlier:

Its possible to make it work with delayed jobs if we split it out into a separate rule in change-prop. But overall, in the Kafka queue the delayed jobs implementation is very inefficient and I wouldn't encourage using it in the long term. I remember there was some RFC about a cron-like service flowing around, that might be a much better technology to achieve what's needed here.

Mholloway renamed this task from Investigate why fetchGoogleCloudVisionAnnotations jobs blocked the job queue, and fix it to Update the "waiting period" implemenation so as not to block the job queue.Jan 13 2020, 12:10 PM
Mholloway updated the task description. (Show Details)

My recommendation would be to go ahead with splitting into a separate queue/changeprop rule, if that's acceptable from a service ops perspective. Relatively speaking, this is a pretty low-volume job type.

(Updated the task description since what went wrong is no longer a mystery.)

My recommendation would be to go ahead with splitting into a separate queue/changeprop rule, if that's acceptable from a service ops perspective. Relatively speaking, this is a pretty low-volume job type.

How low volume are we talking?

The kafka job queue delayed jobs work in the following way: The message is read from the queue. If it's a delayed message, it's held in memory up to reenqueue_delay seconds. If the reenqueue_delay limit was reached, the message is put back in the queue and a new message is fetched. This is done mostly to support queues with variable delays - we don't want a set of messages with a huge delay block the queue for possibly next messages with a small delay. While the message is held in memory, one concurrency slot of the queue is being blocked. So, up to concurrency messages can be held in the queue simultaneously.

It seems like in this case all jobs will have the same delay, so they will mostly be ordered by the release timestamp in the queue (mostly cause they're all submitted separately, so they can mix up, but only a little bit). In this case, if we set reenqueue_delay to something huge, effectively disabling this mechanism, and set the concurrency to the maximum allowed concurrency to not exceed google limits, we should get the queue moving reliably without excess re-enqueueing.

One thing that needs to be done in preparation - we need to make sure that delays of the retry jobs are ignored. We don't want to block out the retry queue for 48 hours as well.

Looks like we're getting ~500 File page creations a day, so we're talking at least 1000 delayed items being in the queue at all times. Is that a lot?

Looks like we're getting ~500 File page creations a day, so we're talking at least 1000 delayed items being in the queue at all times. Is that a lot?

No, we don't need all that. We don't really need to hold all of the waiting jobs in memory since they all will be almost sorted by release timestamp, so a normal, low concurrency should work. It will work like a sliding window. So you should set the concurrency to whatever seems reasonable if all the jobs were posted at once with no delay and how much limiting on job concurrency you want to have.

Change 565303 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Add separa rule for machine vision jobs

https://gerrit.wikimedia.org/r/565303

Change 565615 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Re-enable delayed new upload jobs for MachineVision extension

https://gerrit.wikimedia.org/r/565615

Change 565303 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Add separa rule for machine vision jobs

https://gerrit.wikimedia.org/r/565303

Mentioned in SAL (#wikimedia-operations) [2020-01-21T19:45:00Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072

Mentioned in SAL (#wikimedia-operations) [2020-01-21T19:46:11Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@1ca3071]: Add separate rule for machine vision jobs T241072 (duration: 01m 11s)

The job now has it's separate rule/queue, so you can feel free to start producing these jobs, it at least shouldn't break anything else.

Change 565615 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable delayed new upload jobs for MachineVision extension

https://gerrit.wikimedia.org/r/565615

Change 566780 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Revert "Re-enable delayed new upload jobs for MachineVision extension"

https://gerrit.wikimedia.org/r/566780

Change 566780 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Re-enable delayed new upload jobs for MachineVision extension"

https://gerrit.wikimedia.org/r/566780

Change 570287 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[operations/mediawiki-config@master] Re-enable delayed new upload jobs for MachineVision extension

https://gerrit.wikimedia.org/r/570287

Change 570287 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable delayed new upload jobs for MachineVision extension

https://gerrit.wikimedia.org/r/570287

The update to the job queue code seems to be working after a couple of weeks of testing. Will consider this closed.