Page MenuHomePhabricator

Filename convention is not easy to follow for dumps using a `precombine` step
Closed, ResolvedPublic

Description

When dumps use a precombine step, sometimes files are generated that contain the whole precombined data.
For instance for enwiki dumps on 20210301:

"xmlstubsdumprecombine": {
      "status": "done",
      "updated": "2021-03-02 10:06:33",
      "files": {
        "enwiki-20210301-stub-meta-history.xml.gz": {
          "size": 71649377017,
          "url": "/enwiki/20210301/enwiki-20210301-stub-meta-history.xml.gz",
          "md5": "6b36c6a987f037fd9796733d68db01e2",
          "sha1": "84c049dc89bbbe84e728b7555e7cde4e2c36639f"
        },
        ...

This precombine step means that another step will happen, generated split data:

"xmlstubsdump": {
      "status": "done",
      "updated": "2021-03-02 08:55:50",
      "files": {
        "enwiki-20210301-stub-meta-history1.xml.gz": {
          "size": 2671067034,
          "url": "/enwiki/20210301/enwiki-20210301-stub-meta-history1.xml.gz",
          "md5": "ff7bc25cf94f6a5c62dd068955253bbc",
          "sha1": "f4715663da4fd4ee4197252ee8277243672af2e0"
        },
        "enwiki-20210301-stub-meta-history2.xml.gz": {
          "size": 2784139219,
          "url": "/enwiki/20210301/enwiki-20210301-stub-meta-history2.xml.gz",
          "md5": "f82059cb1b9412d92c660361c09fd3e1",
          "sha1": "0066a70085c82519c1e91bf774b97a66c81dde29"
        },
        "enwiki-20210301-stub-meta-history3.xml.gz": {
          "size": 2773741969,
          "url": "/enwiki/20210301/enwiki-20210301-stub-meta-history3.xml.gz",
          "md5": "1c9f37b2fc9f2afafe8d13965f6b2911",
          "sha1": "4145ecf7726eea338787a2cf92e8f39a3a251857"
        },
        ...

The filename used for the precombine data is the same that would be used in a job with no precombine step, as in srwiki for 20210301 for instance:

"xmlstubsdump": {
      "status": "done",
      "updated": "2021-03-01 15:11:08",
      "files": {
        "srwiki-20210301-stub-meta-history.xml.gz": {
          "size": 1459478238,
          "url": "/srwiki/20210301/srwiki-20210301-stub-meta-history.xml.gz",
          "md5": "b54224a5c74aba5a91631f411b2fd0c3",
          "sha1": "aa3b63c8b2eaf8c62186bf597f833702678cebdf"
        },
        ...

The fact that the jobs with a`precombine` step generate the same filename pattern as jobs with not such a step is not facilitating when grabbing all projects files, as we need more logic to decide whether or not we need to get the global file or the split file (and not both).
See T278551 for a problem due to this exact case.

Event Timeline

We don't split data, what happens is that we generate pieces of the stubs in parallel for larger wikis and then "recombine" them together. For most wikis there is no need to do this so we just generate one file with all the data immediately.

If there is no recombine step, then you have all the data directly from the xmlstubsdump job. If the filenames end in .num.xml.gz then they are partial. If you are looking for all the data in one file, you always want the stub-<type> name without the numbers.

The wikis where you need to worry about recombined files are the so-called "big wikis" (see https://github.com/wikimedia/puppet/blob/production/modules/snapshot/manifests/dumps/dblists.pp ) and enwiki, wikiatawiki.

You can consider that if there is only one file output from the xmlstubsdump job, get that; if there is more than one, use the recombine job output.

At some point in the future we might move everything to names of the form pnnnpmmm but that's a ways off and would need some thought.

Thank you for the explanation @ArielGlenn.
Let me precise my 2 concerns (they are minor):

  • job names are different for the same output in dumpstatus.json: for small wikis you should look at xmlstubsdump while for big you should look at xmlstubsdumprecombine (this is not easy to monitor all projects).
  • filenames share the same pattern between different jobs, making it confusing to get data across multiple projects with a single job. For pages-meta-current, you should get PROJECT-DATE-pages-meta-current.xml.bz2 even if PROJECT-DATE-pages-meta-current*.xml*.bz2 exist, since small projects won't have the split files and you want all projects to match. For pages-meta-history you should get PROJECT-DATE-pages-meta-history*.xml*.bz2 as there is supposedly never both single files and split-by-pages files.

Thank you for the explanation @ArielGlenn.
Let me precise my 2 concerns (they are minor):

  • job names are different for the same output in dumpstatus.json: for small wikis you should look at xmlstubsdump while for big you should look at xmlstubsdumprecombine (this is not easy to monitor all projects).
  • filenames share the same pattern between different jobs, making it confusing to get data across multiple projects with a single job. For pages-meta-current, you should get PROJECT-DATE-pages-meta-current.xml.bz2 even if PROJECT-DATE-pages-meta-current*.xml*.bz2 exist, since small projects won't have the split files and you want all projects to match. For pages-meta-history you should get PROJECT-DATE-pages-meta-history*.xml*.bz2 as there is supposedly never both single files and split-by-pages files.

For history we never recombine; those files are too large, in our judgment, to be useful for download, and they are also much too slow to create. This is the side effect of having healthy wiki communities creating lots of data. :-)

As I understand it, your main concern here is to have a simple sort of logic to decide which file(s) to get.
And while you can figure it out by waiting a little later in the run for each file you want to pick up, to see if it's there or not, that's not very convenient, for people who don't just pick up all files as they show up for their favorite wiki(s).
Would it help you to have a list of jobs that will be run for a given wiki, written out to a status file like the dumpruninfo.json and report.json files? Then you could see right away that one wiki has a recombine job for X step and get that file when it shows up while some other wiki doesn't have a recombine step for some other job so get the files for the regular job when they show up.

Alternatively you can decide to always get the file(s) from the non-recombine version of the job, and the names of the downloadable file(s) associated with the job show up in report.json when the job is complete.

What are your thoughts?

I have implemented some more logic to get the files we need, so no real need to change here.
This task was more about things to keep in mind if for instance filenames change at some point :)
Feel free to close it if it's not useful. Thank you for your explanations :)

ArielGlenn claimed this task.

I have implemented some more logic to get the files we need, so no real need to change here.
This task was more about things to keep in mind if for instance filenames change at some point :)
Feel free to close it if it's not useful. Thank you for your explanations :)

Ah ha! Well, at some point in the future we might try to move all files to -pnnnpmmm naming format, + recombine only where the final output file would be smaller than some cutoff, but that's quite some time off for this project. Which I guess means, that I can close this now :-) Thanks for your comments, and I will bear them in mind when that time comes!