Page MenuHomePhabricator

[8 hours] Investigate why phonosIPAFilePersist jobs aren't being generated
Closed, ResolvedPublicSpike

Description

On both the beta cluster and test.wikipedia, htmlCacheUpdate jobs (and thus phonosIPAFilePersist jobs) are not being triggered as expected.

Testing on a local MediaWiki install, these jobs are triggered as expected:

$ php maintenance/showJobs.php --group
phonosIPAFilePersist: 2 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
refreshLinks: 22 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
htmlCacheUpdate: 19 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
recentChangesUpdate: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

Related Objects

Event Timeline

β€’ JMcLeod_WMF renamed this task from Investigate why phonosIPAFilePersist jobs aren't being generated to [8 hours] Investigate why phonosIPAFilePersist jobs aren't being generated.Jan 4 2023, 6:29 PM
β€’ JMcLeod_WMF added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". Β· View Herald TranscriptJan 4 2023, 6:29 PM

I'm now concluding the results of this long-winded investigation β€” in summary, the issue was:

  • phonosIPAFilePersist jobs (via htmlCacheUpdate) are not triggered on the modification of a template which uses the Phonos feature

Investigation found that:

  • this issue is consistently reproducible on the beta cluster (T325786), as well as the production cluster (today culminating in being reproducible on both test.wikipedia and en.wikipedia).

This means that:

  • The use of a job queue for Phonos jobs triggered on htmlCacheUpdate may not be appropriate
  • Large scale template rollouts (T324563) may require a manual purge of the cache to (re)generate Phonos audio
  • This finding is corroborated by en/other large projects often needing to make "null edits" on high-use templates and their transcluded pages
β€’ NRodriguez subscribed.

@TheresNoTime do you believe a new task is needed to address what came of that investigation?

TheresNoTime do you believe a new task is needed to address what came of that investigation?

Yeah, specifically this part:

The use of a job queue for Phonos jobs triggered on htmlCacheUpdate may not be appropriate
… may require a manual purge of the cache to (re)generate Phonos audio

I ran into this when working on T326163/T324233. For that, we're setting page props, which happens during the LinksUpdate phase. Perhaps instead of pushing the job on htmlCacheUpdate we could use the LinksUpdate hook? That would seemingly eliminate the need for null edits. cc @dmaza who worked on the job queuing (T318086).

I'm now concluding the results of this long-winded investigation β€” in summary, the issue was:

  • phonosIPAFilePersist jobs (via htmlCacheUpdate) are not triggered on the modification of a template which uses the Phonos feature

Investigation found that:

  • this issue is consistently reproducible on the beta cluster (T325786), as well as the production cluster (today culminating in being reproducible on both test.wikipedia and en.wikipedia).

This means that:

  • The use of a job queue for Phonos jobs triggered on htmlCacheUpdate may not be appropriate
  • Large scale template rollouts (T324563) may require a manual purge of the cache to (re)generate Phonos audio
  • This finding is corroborated by en/other large projects often needing to make "null edits" on high-use templates and their transcluded pages

@TheresNoTime: Sorry I have many questions:

  • Do we understand why it works locally but it doesn't in production or beta? If htmlCacheUpdate never runs then phonosIPAFilePersist will never be schedule since it needs the parser to run from another job or script. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Phonos/+/refs/heads/master/includes/Phonos.php#186
  • I see that you reported T325786 and it looks like it's being treated as a bug. Meaning that htmlCacheUpdate should in fact run when editing a template for those pages transcluding it. Is this not the case?
  • QUESTION: in production right now, if I edit a template, how are the pages transcluding that template re-parsed? Is it not via htmlCacheUpdate?

The use of a job queue for Phonos jobs triggered on htmlCacheUpdate may not be appropriate

  • Is this only because htmlCacheUpdate is not running?

This finding is corroborated by en/other large projects often needing to make "null edits" on high-use templates and their transcluded pages

That article implies that this should not be an issue for lesser used templates which makes me think that they are somehow being re-parsed, if not by htmlCacheUpdate then by something else(?). In our tests, it is definitely not a "high-use" template so it should work without having to do a null-edit.

I ran into this when working on T326163/T324233. For that, we're setting page props, which happens during the LinksUpdate phase. Perhaps instead of pushing the job on htmlCacheUpdate we could use the LinksUpdate hook? That would seemingly eliminate the need for null edits. cc @dmaza who worked on the job queuing (T318086).

@MusikAnimal: The idea behind phonosIPAFilePersist was to avoid hitting the API rate limit and to unblock batch operations from waiting on Phonos to generate the files. I'm not 100% sure how LinksUpdate works but by briefly looking at it it seems that it only gets triggered for the article that's being edited. Can you elaborate on how hooking to it can prevent hitting the API rate limit on a large scale template rollout or an update to the "IPA" template including the phonos tag?

Do we understand why it works locally but it doesn't in production or beta? If htmlCacheUpdate never runs then phonosIPAFilePersist will never be schedule since it needs the parser to run from another job or script. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Phonos/+/refs/heads/master/includes/Phonos.php#186

The "out of the box" job queue setup is very different to the job queue runners we have configured in production/beta β€” "By default, jobs are run at the end of a web request.". WMF uses a Kafka job queue.

I see that you reported T325786 and it looks like it's being treated as a bug. Meaning that htmlCacheUpdate should in fact run when editing a template for those pages transcluding it. Is this not the case?

I recently found and fixed T332211: deployment-docker-changeprop01: `worker died, restarting`, which was preventing some/all jobs from running on the Beta Cluster β€” my understanding still remains that htmlCacheUpdate should run when "Changing a template (all the pages that transclude the template need updating)".

QUESTION: in production right now, if I edit a template, how are the pages transcluding that template re-parsed? Is it not via htmlCacheUpdate?

I assume so, yes β€” in testing, this "often, but not always, works" on the production cluster (bar test.wikipedia). htmlCacheUpdate is however a throttled job type, so the mixture of being thrown by the beta cluster bug, $wgJobBackoffThrottling/$wgUpdateRowsPerJob and not being able to test in production proper may have skewed the above findings.


Given T332211: deployment-docker-changeprop01: `worker died, restarting`, I can take another look at this on the beta cluster and see if anything different happens now?

The "out of the box" job queue setup is very different to the job queue runners we have configured in production/beta β€” "By default, jobs are run at the end of a web request.". WMF uses a Kafka job queue.

But why is that an issue? it should be eventually picked up anyway. Correct me if I'm wrong here with the order of events using Kafka JQ

  1. Template gets edited
  2. At some point in the future htmlCacheUpdate runs and should schedule phonosIPAFilePersist
  3. At some point in a more distant future phonosIPAFilePersist runs and the files are created

The reason I'm asking is because we need to understand if we have to find an alternative to what we already have in place because it doesn't work or if test/beta are buggy or throttled so we don't see consistent results but it should be fine everywhere else.
Would adding some logging and deploying to mwdebug help troubleshoot this?

Change 901236 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/Phonos@master] Phonos: Add LoggerFactory, and log PhonosIPAFilePersistJob actions

https://gerrit.wikimedia.org/r/901236

Change 901236 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Phonos: Add LoggerFactory, and log PhonosIPAFilePersistJob actions

https://gerrit.wikimedia.org/r/901236

Change 901248 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/mediawiki-config@master] InitialiseSettings-labs: Add `Phonos` channel to `debug`

https://gerrit.wikimedia.org/r/901248

Change 901248 merged by jenkins-bot:

[operations/mediawiki-config@master] InitialiseSettings-labs: Add `Phonos` channel to `debug`

https://gerrit.wikimedia.org/r/901248

Change 901580 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/Phonos@master] Phonos: Debug logging on handleNewFile()

https://gerrit.wikimedia.org/r/901580

Change 901580 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Phonos: Debug logging on handleNewFile()

https://gerrit.wikimedia.org/r/901580

Related: T325594#8714808

CdnCacheUpdate::purge: https://en.wikipedia.beta.wmflabs.org/wiki/Template:TestPhonos2 https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Template:TestPhonos2&action=history https://en.m.wikipedia.beta.wmflabs.org/wiki/Template:TestPhonos2 https://en.m.wikipedia.beta.wmflabs.org/w/index.php?title=Template:TestPhonos2&action=history

Hi, I found this task while working on T313841, and investigating the use of $wgCommandLineMode in Phonos. I assume it's still a problem?

I think this is because the script used to run jobs in Wikimedia production, RunSingleJob.php, does not actually set $wgCommandLineMode [except for T353262, which doesn't help here] (unlike the built-in runJobs.php – because it's actually called over HTTP, not from the command line), and so your code depending on that never runs.

Is there a reason why this work can't be just done via the job queue unconditionally? That seems normal for me for a parser extension, and e.g. thumbnailing images works in a similar way, the parser doesn't wait for thumbnails to be ready. Life would be much simpler if we could do that here.

Change 982443 had a related patch set uploaded (by Bartosz DziewoΕ„ski; author: Bartosz DziewoΕ„ski):

[mediawiki/extensions/Phonos@master] [WIP] Always generate audio files in a job

https://gerrit.wikimedia.org/r/982443

Change 984513 had a related patch set uploaded (by Bartosz DziewoΕ„ski; author: Bartosz DziewoΕ„ski):

[mediawiki/extensions/Phonos@master] Fix deciding whether to use job queue

https://gerrit.wikimedia.org/r/984513

Change 984513 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Fix deciding whether to use job queue

https://gerrit.wikimedia.org/r/984513

matmarex claimed this task.

Closing optimistically.