Page MenuHomePhabricator

Siteinfo v2 format job needs to be fixed up
Closed, ResolvedPublic

Description

The original patch went in with the same job name for both siteinfo and siteinfov2, via the default api jobs yaml file, leading to repeated attempts to run the v2 format job on each wiki even though the output file was present and the job marked as complete.

Event Timeline

ArielGlenn created this task.

In the short run I adjusted the default api jobs yaml file, see https://gerrit.wikimedia.org/r/c/operations/dumps/+/761287 and edited the job name in the dumpruninfo status file to reflect the change for all but wikidatawiki, which has a dump process runnning. All other dump processes were stopped first by adding the file "exit.txt" in the dumps branch on the hosts snapshot1010, 1012, 1013 and wating a few minutes for those jobs to exit, then shooting the remaining scheduler and letting the systemd timer exit on its own. (1009 is a testbed, 1011 is running wikidata.)

I then ran a noop job on all small wikis and on enwiki; the "big" wikis (/etc/dumps/dblists/bigwikis.dblist) have not reached this job yet so nothing needed to be done there, same for wikidatawiki.

I have the sneaking suspicion that the output file name may be "wrong" in that it now looks like e.g. 20220201/elwiki-20220201-siteinfo2-namespaces.json.gz when it should probably look like 20220201/elwiki-20220201-siteinfo2-namespacesv2.json.gz but I'll check that with some testing and either adjust the yaml file again if possible, or just let the next run straighten out the files for that run forward. In any case the current files should have proper links and status inf and be downloadable, and that's what matters most.

The name in the file (namespaces or namespacesv2) is taken from the job name, so the current file names are wrong, but moving them would be fiddly, since we'd need to regenerate all the status files. I think I'll let this run stay as is, given that this job is new anyways, and everything should be ok on the run on the 20th of the month.

Change 761532 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/dumps@master] Add a fixup script that does bulk noop jobs across wikis

https://gerrit.wikimedia.org/r/761532

Change 761532 merged by jenkins-bot:

[operations/dumps@master] Add a fixup script that does bulk noop jobs across wikis

https://gerrit.wikimedia.org/r/761532

Change 761600 had a related patch set uploaded (by ArielGlenn; author: ArielGlenn):

[operations/dumps@master] don't allow api jobs to have the same name

https://gerrit.wikimedia.org/r/761600

Change 761600 merged by jenkins-bot:

[operations/dumps@master] don't allow api jobs to have the same name

https://gerrit.wikimedia.org/r/761600

It was bothering me that siteinfo2-namespacesv2.json in the 2022-02-01 dump had two 2s in it. Seems like siteinfo-namespacesv2.json would be a less confusing name. But I looked at the code for the Dump and SiteInfoDump and SiteInfoV2Dump classes and couldn't figure out where the v2 is coming from, and maybe it's too late to change it.

I would have preferred siteinvofv2-namespaces. But we can't do that, and at this point things are set. The problem is probably relying on the job name as part of the filename, in the specific way that we do, and it's definitely too late to change that.

As long as all the files have the same name this time around and don't conflict with the v1 files, I'll consider this fixed. We'll know in a few days.

The new run is complete for almost all wikis and the filenames look like they should, even if I don't love the duplicate "v2" in there either. So I'm going to close this out. We can look at how jobs and classes are named sometime in the glorious future when the whole infra for dumps gets rewritten :-)