Page MenuHomePhabricator

[8 hours] Investigate JobQueue
Closed, ResolvedPublicSpike

Description

Deferrable updates can choose to implement the EnqueueableDataUpdate interface. Such updates can be automatically converted to a job as-needed.

For example, if the update fails (i.e. due to the rate limit being hit?), MediaWiki will convert it to a job and queue it to try again later.

There are also other situations in which we improve reliability or optimise throughput by proactively converting updates to jobs where possible.

Event Timeline

TheresNoTime updated the task description. (Show Details)
TheresNoTime changed the subtype of this task from "Task" to "Spike".
JMcLeod_WMF renamed this task from Investigate JobQueue to [8 hours] Investigate JobQueue.Sep 7 2022, 2:46 PM

DeferredUpdates I don't think are what we want, since that will happen after the HTML is sent back to the user and thus we won't have the upload URL to give to the client. We basically are after the same idea as EnqueueableDataUpdate, though, whereby if a job fails (i.e. we hit the rate limiting), we reschedule the job. What I worry about is not being able to have any sort of throttling, i.e. run this type of job only N times a minute. That doesn't seem to be a thing with our job queue, but we'll have to ask around and see what's doable as I'm sure we're not the first to face this sort of issue.

My initial thoughts are we should simply see if Google can give us the quota we need temporarily for our rollout. That would save us a lot of engineering time. The drawback is we have to sort of hope communities won't meddle with the "meta" template that uses the Phonos tag. If they make an edit that changes the cache key, we will be in the same scenario as the initial rollout. So it would probably be wise to have some sort of "retry" logic, at the very least.

So perhaps for the initial rollout, we ask Google for the quota we know we will need. The fallback system is to always make requests to Google and store the files on parse (as we're doing now), but if we hit the rate limit, queue a job. If that job fails, it should requeue itself, I guess. Assuming that works, that seems like the minimal engineering needed to ensure Phonos stays stable without sysadmin intervention.

I'm going to put this in the sprint because I believe it's the last major chunk of work for this project.

DeferredUpdates I don't think are what we want, since that will happen after the HTML is sent back to the user and thus we won't have the upload URL to give to the client.

The URL isn't actually dependent on the file being saved, is it? I mean, technically we could return the URL before the file-saving has actually happened. I'm not sure we'd want to, because it'd then be a bit weird if clicking the button didn't actually work for some amount of time (although we will have a nice loading animation soon). Then, a job could be run and at some point the file would become available.

Although, the other thing to bear in mind here is previewing: it'd be annoying to not be able to play the audio on preview, I guess?

FYI while looking at estimating the new quota limit for T316009: Request increase to Google TTS rate limits, I found "The job executor endpoint in MediaWiki responds with an HTTP error code in case the execution has failed. In that case ChangePropagation posts a retry event into a special topic and retries executing the job with exponentially growing delay up to a configurable number of times." (per wikitech)

retries executing the job with exponentially growing delay up to a configurable number of times

sounds useful for when a job fails due to hitting the quota limit?

Also https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 shows a fairly fast processing rate but also a fairly high backlog (for some jobs)?

So as it turns out, we do have a way to rate limit a job via $wgJobBackoffThrottling. It could be set to something like $wgJobBackoffThrottling['phonosGoogleEngineSomething'] = 1000/60 (TBD). It should be less than the actual limit so to account for regular edits, previews, etc.

DeferredUpdates I don't think are what we want, since that will happen after the HTML is sent back to the user and thus we won't have the upload URL to give to the client.

The URL isn't actually dependent on the file being saved, is it? I mean, technically we could return the URL before the file-saving has actually happened. I'm not sure we'd want to, because it'd then be a bit weird if clicking the button didn't actually work for some amount of time (although we will have a nice loading animation soon). Then, a job could be run and at some point the file would become available.

Although, the other thing to bear in mind here is previewing: it'd be annoying to not be able to play the audio on preview, I guess?

You are right, the URL is independent of the file being saved, and for that reason I would favor the use of DeferredUpdates with EnqueueableDataUpdate, but like you mentioned above we would need to find an intuitive way to handle the scenario when the file is not ready yet. If that's not possible I guess it is not terrible to delay some actions (edit/preview/etc) by a couple hundred ms the first time the file is generated. One thing to consider is how many files could a single page have and how could that add up for those few initial edits.

As far as to when to queue a job or immediately fetch the file (DeferredUpdates or not), we could use $wgCommandLineMode to schedule the work as a job if phonos is triggered from the cli.