Page MenuHomePhabricator

Create maintenance script to count/delete orphaned Phonos files
Open, Needs TriagePublic8 Estimated Story Points

Description

See T320675: Establish Phonos production storage requirements for background info. The concern is that Phonos could generate many orphaned files, especially during preview, that are forever lost in the abyss of Swift. During our initial rollout, we want to monitor how many orphaned files there are. The technical plan to accomplish this includes:

  1. Have Phonos store a page property for file usage (T326163)
  2. Loop through to collect which files are in-use
  3. Loop through directories under /phonos-render and surface files that aren't being used on the wiki

Acceptance criteria

  • Running the maintenance script should count the files that aren't being used.
  • A --delete flag should be available allowing you to delete these files, in addition to simply counting them.
  • A --wikis flag should be available to limit the script to run only against supplied comma-separated list of DB names.
  • Go by the sites table to iterate through all wikis (or only those with the aforementioned setting)
  • For single-wiki installations or those without a populated sites table, the script just runs on the current wiki

QA notes

  • Run the script with php extensions/Phonos/maintenance/countOrphanFiles.php
  • You can create orphaned files by simply using Phonos, observing that a file was generated (you can do this using only preview, if you want), then removing the <phonos> tag or changing the parameters to it so that a new file is created.
  • If you're testing on your local, you may wish to first run the addSite.php script so that the code goes off of that like it would in production. The command would be something like: php maintenance/addSite.php --language=en --pagepath='http://localhost:8080/wiki/$1' --filepath='http://localhost:8080/w/$1' my_wiki wikipedia, replacing the paths and my_wiki accordingly (with the latter being the database name). wikipedia here is a wiki "family" name and shouldn't matter on your local.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cheers! We already have a deleteOldPhonosFiles.php that was essentially copied from Score. I will do a similar thing here for this maintenance script.

MusikAnimal renamed this task from Create maintenance script to count number of orhpaned Phonos files to Create maintenance script to fetch/delete orhpaned Phonos files.Dec 1 2022, 11:27 PM
dom_walden renamed this task from Create maintenance script to fetch/delete orhpaned Phonos files to Create maintenance script to fetch/delete orphaned Phonos files.Dec 2 2022, 8:11 AM

Change 864869 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] Create countOrphanedFiles.php maintenance script with option to delete

https://gerrit.wikimedia.org/r/864869

MusikAnimal renamed this task from Create maintenance script to fetch/delete orphaned Phonos files to Create maintenance script to count/delete orphaned Phonos files.Dec 6 2022, 12:15 AM
MusikAnimal updated the task description. (Show Details)

Loop through the pages on a wiki that are in Category:Pages_that_use_Phonos
Fetch the HTML via RESTBase (as it's really fast)

Sorry for being late to the party, but I'm just wondering if there isn't a nicer way to check usage than parsing the wikitext and extracting it from the HTML? For example, could we set a page property (or multiple, one per file) when the audio file is persisted, and then query for that property? Or have a separate tracking category, just for generated audio usage (to separate it from manual file inclusion)? I'm probably not thinking through it all properly!

Change 864869 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Create countOrphanedFiles.php maintenance script with option to delete

https://gerrit.wikimedia.org/r/864869

Sorry for being late to the party, but I'm just wondering if there isn't a nicer way to check usage than parsing the wikitext and extracting it from the HTML? For example, could we set a page property (or multiple, one per file) when the audio file is persisted, and then query for that property? Or have a separate tracking category, just for generated audio usage (to separate it from manual file inclusion)? I'm probably not thinking through it all properly!

Use page props sounds like a great idea! I think that would be a nicer follow-up to this for long-term tracking, though it still wouldn't surface unused files, only the used ones. That's still better than scraping the HTML, though! I have filed T326163: Add page properties for Phonos usage data to look into this further.

QA notes (in addition to task description): The --restbase flag has only been tested in simulation, i.e. having a page on my local wiki with the same title as a production wiki and setting the --restbase flag to point to it. I don't have access to a deploy host to test it more thoroughly, but it should work on Beta with a value like --restbase="https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/". The option to use RESTBase was just to speed it up a bit; I think if it doesn't work, it's not a big deal as this maintenance script isn't meant to scale anyway. If we move forward with T326163 then we'll be removing the RESTBase integration entirely.

@MusikAnimal On beta, I guess because Swift is a shared resource, if I run the script on en.wiktionary (which has no phonos tags) it will delete all the files from Swift, even if they are being used on another wiki (e.g. enwiki).

@MusikAnimal On beta, I guess because Swift is a shared resource, if I run the script on en.wiktionary (which has no phonos tags) it will delete all the files from Swift, even if they are being used on another wiki (e.g. enwiki).

Correct, and the same will happen in production. Having it loop through wikis is a possible future iteration for this script, but for now it's really only meant to be used on a single wiki. Say for the two pilots wiks, we do just a count; both in theory should have a similar count of unused files, since presumably the unused ones are used on the other wiki.

I think T326163 will make it more feasible to run across a wiki farm.

@MusikAnimal On beta, I guess because Swift is a shared resource, if I run the script on en.wiktionary (which has no phonos tags) it will delete all the files from Swift, even if they are being used on another wiki (e.g. enwiki).

Correct, and the same will happen in production. Having it loop through wikis is a possible future iteration for this script, but for now it's really only meant to be used on a single wiki. Say for the two pilots wiks, we do just a count; both in theory should have a similar count of unused files, since presumably the unused ones are used on the other wiki.

OK, should we warn people not to use the --delete switch for now?

OK, should we warn people not to use the --delete switch for now?

And maybe the --delete param description should be more explicit about how it works so there are no accidents(?)

Having it loop through wikis is a possible future iteration for this script, but for now it's really only meant to be used on a single wiki.

Would we do this by querying the sites table? Loop through all 1000 sites (or whatever it is; I don't think we could query at this point to find out if Phonos is actually installed could we?), collecting a list of all files in use, and then check that against the files that are stored. There would be the problem of files that get created while the script is running (they could end up being deleted as unused), but they would be recreated and so this shouldn't be too annoying. The main inefficiency would I think be looping through each file and checking it against a massive list of in-use filenames.

Change 893838 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] Store usage of Phonos files as page properties

https://gerrit.wikimedia.org/r/893838

Change 893838 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Store usage of Phonos files as page properties

https://gerrit.wikimedia.org/r/893838

Mentioned in SAL (#wikimedia-releng) [2023-03-10T14:28:25Z] <TheresNoTime> (deployment-prep) [samtar@deployment-deploy03 ~]$ scap lock --all --verbose Debugging issue with extension maintenance script, T324233

Mentioned in SAL (#wikimedia-releng) [2023-03-10T15:07:50Z] <TheresNoTime> (deployment-prep) `[samtar@deployment-deploy03 ~]$ scap lock --all --verbose Debugging issue with extension maintenance script, T324233

Mentioned in SAL (#wikimedia-releng) [2023-03-10T16:37:40Z] <TheresNoTime> [samtar@deployment-deploy03 ~]$ scap lock --all --verbose Debugging issue with extension maintenance script, T324233

Change 896407 had a related patch set uploaded (by Samtar; author: Samtar):

[mediawiki/extensions/Phonos@master] countOrphanedFiles.php: Handle exception for bad sites table data

https://gerrit.wikimedia.org/r/896407

Change 896407 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] countOrphanedFiles.php: Handle exception for bad sites table data

https://gerrit.wikimedia.org/r/896407

Change 896444 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] CountOrphanedFiles: add 'wikis' flag, use API:Siteinfo for Phonos check

https://gerrit.wikimedia.org/r/896444

Mentioned in SAL (#wikimedia-releng) [2023-03-11T01:22:45Z] <TheresNoTime> [samtar@deployment-deploy03 ~]$ scap lock --all --verbose Debugging issue with extension maintenance script, T324233

Change 896444 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] CountOrphanedFiles: add 'wikis' flag, use API:Siteinfo for Phonos check

https://gerrit.wikimedia.org/r/896444

TheresNoTime subscribed.

QA notes for beta cluster
I ran a few iterations of this script on the beta cluster during testing:

samtar@deployment-mwmaint02:~$ mwscript maintenance/run.php --wiki metawiki /srv/mediawiki/php-master/extensions/Phonos/maintenance/countOrphanedFiles.php --wikis enwiki,enwiktionary
196 in-use files found.
Finding unused files in storage...
33 unused files found.

samtar@deployment-mwmaint02:~$ mwscript maintenance/run.php --wiki metawiki /srv/mediawiki/php-master/extensions/Phonos/maintenance/countOrphanedFiles.php
Error 1049: Unknown database 'dawiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `dawiki`


Error 1049: Unknown database 'idwiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `idwiki`


Error 1049: Unknown database 'minwiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `minwiki`


Error 1049: Unknown database 'mswiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `mswiki`


Error 1049: Unknown database 'nnwiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `nnwiki`


Error 1049: Unknown database 'nowiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `nowiki`


Error 1049: Unknown database 'uzwiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE `uzwiki`


196 in-use files found. 7 sites skipped due to errors.
Finding unused files in storage...
34 unused files found.

QA notes for beta cluster
I ran a few iterations of this script on the beta cluster during testing:

…

Error 1049: Unknown database 'dawiki'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::doSelectDomain
Query: USE dawiki

And to clarify, these errors are expected as the sites table apparently contains either invalid data or sites that don't actually exist on Beta.

@MusikAnimal I am finding some discrepancies between what the maintenance script reports and my own counts when querying the database and using python.

I think PHP's union operator behaves unexpectedly.

From https://www.php.net/manual/en/language.operators.array.php

The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored.

php > var_dump( ["foo", "bar"] + ["quux", "blah"] );
php shell code:1:
array(2) {
  [0] =>
  string(3) "foo"
  [1] =>
  string(3) "bar"
}

Change 902138 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] CountOrphanedFiles: use array_merge instead of union operator

https://gerrit.wikimedia.org/r/902138

…
From https://www.php.net/manual/en/language.operators.array.php

The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored.

…

TIL! Thank you pointing this out! I wonder how many other codebases I've incorrectly used the union operator instead of array_merge...

For the record, I tested against the state of my local where I know which files are unused. In my case, this bug didn't surface and it reported the correct unused files. So it's an interesting bug that probably would have gone unnoticed had you not spoken up! Thanks again :)

Patch is now awaiting review.

Change 902138 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] CountOrphanedFiles: use array_merge instead of union operator

https://gerrit.wikimedia.org/r/902138

@MusikAnimal I am getting some discrepancies between the number of files reported by countOrphanedFiles.php and Swift. I believe the number of used files is correct (based on my own calculations), so I assume the discrepancy is in number of unused files.

dwalden@deployment-ms-fe04:~$ swift list -l global-data-phonos-render -A http://deployment-ms-fe04.deployment-prep.eqiad1.wikimedia.cloud/auth/v1.0 -U mw:media -K ******* | wc
   1667    8331  163281

Total number of files reported by Swift = 1666 (swift list returns an extra line).

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki
...
355 in-use files found. 7 sites skipped due to errors.
Finding unused files in storage...
1322 unused files found.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki,en_rtlwiki,enwiktionary
355 in-use files found.
Finding unused files in storage...
1322 unused files found.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki,en_rtlwiki
355 in-use files found.
Finding unused files in storage...
1322 unused files found.

Total number of files = 355 + 1322 = 1677

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki
338 in-use files found.
Finding unused files in storage...
1334 unused files found.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki,enwiktionary
338 in-use files found.
Finding unused files in storage...
1334 unused files found.

Total number of files = 338 + 1334 = 1672

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis en_rtlwiki
102 in-use files found.
Finding unused files in storage...
1569 unused files found.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis en_rtlwiki,enwiktionary
102 in-use files found.
Finding unused files in storage...
1569 unused files found.

Total number of files = 102 + 1569 = 1671

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiktionary
0 in-use files found.
Finding unused files in storage...
1666 unused files found.

This is the same number as reported by Swift.

@dom_walden Before I do more debugging, let me double-check you've recently ran the refreshLinks.php script as you did for T326163? Beta is weird so I wouldn't be surprised if there's data missing there. Also it looks like there's at least one page on metawiki that uses Phonos, so you'll want to include that in the --wikis list as well. I know it's a bit confusing because you're running the script itself on metawiki.

This is really hard to test locally because I don't have a wiki farm, or Swift for that matter. If you're still seeing discrepancies after running refreshLinks again and including metawiki, then I might seek deployment rights for Beta so I can debug there.

@dom_walden Before I do more debugging, let me double-check you've recently ran the refreshLinks.php script as you did for T326163? Beta is weird so I wouldn't be surprised if there's data missing there. Also it looks like there's at least one page on metawiki that uses Phonos, so you'll want to include that in the --wikis list as well. I know it's a bit confusing because you're running the script itself on metawiki.

@MusikAnimal I just ran refreshLinks.php on enwiki, en_rtlwiki, enwiktionary and metawiki.

swift list returns 1055 files in total.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki
...
356 in-use files found. 7 sites skipped due to errors.
Finding unused files in storage...
711 unused files found.

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki,en_rtlwiki,enwiktionary,metawiki
356 in-use files found.
Finding unused files in storage...
711 unused files found.

356 + 711 = 1067

dwalden@deployment-deploy03:~$ mwscript extensions/Phonos/maintenance/countOrphanedFiles.php --wiki=metawiki --wikis enwiki
338 in-use files found.
Finding unused files in storage...
723 unused files found.

338 + 723 = 1061

This has been sitting here for a while. I was not able to reproduce the possible remaining bug Dom found at T324233#8784518 on my local, and may require assistance from a deployer to diagnose further. I'm going to unlick this cookie.