Page MenuHomePhabricator

investigate why content history dump of certain wikidata page ranges is so slow
Open, HighPublic0 Story Points

Description

During the Apr 20th run I noticed that the dump of certain page ranges took an extremely long time, even using lbzip2 for compression. Check if this is due to a large number of revisions per page, long revision text, or some other reason. If possible, account for this when splitting jobs into page ranges so that no jobs take an abnormally long period of time.

Example slow range: wikidatawiki-20190401-pages-meta-history27.xml-p56915353-p56950553.bz2 35l pages, 40GB of data (compressed), over 12 hours.

Details

Related Gerrit Patches:

Event Timeline

ArielGlenn triaged this task as High priority.Apr 20 2019, 7:52 PM
ArielGlenn created this task.
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Apr 20 2019, 8:44 PM

There are an awful lot of entries that look like this one:

https://www.wikidata.org/w/index.php?title=Q57009452&offset=&limit=500&action=history
Hundreds of revisions with 1.8 million bytes each! How is this happening?

@ArielGlenn It appears that particle physics is a massively collaborative enterprise, so that the results presented in a single paper can have thousands of people behind them, all of whom are credited (hence the particularly large revision size).

I looked at the author list. But even with around 2000 authors, if we gave each one of them 80 bytes (plenty for first name, last name and an id) we'd have 160k of data, not 1.5 megabytes.

But one author is represented this way:

{"mainsnak":
    {
     "snaktype":"value",
     "property":"P2093",
     "hash":"daf91abd8b2cac6e9057fd945ca97ff06624",
     "datavalue":
        {
         "value":"G. Akimoto",
         "type":"string"
        }
    },
    "type":"statement",
    "qualifiers":
        {
         "P1545":
             [
                 {
                  "snaktype":"value",
                  "property":"P1545",
                  "hash":"d8baedaa705c5d31356a6c9dd39d4b5b185d1882",
                  "datavalue":
                      {
                          "value":"28",
                          "type":"string"
                      }
                 }
             ]
        },
    "qualifiers-order": ["P1545"],
    "id":"Q57009452$47838677-5214-47D0-AB3B-9F1F35EE82BB",
    "rank":"normal"
},

(spaces added for readability). That's over 400 bytes per author, and no wonder the article size is ballooning.

Isn't there some way to be more concise in these entries? So far there's only around 250 of them, but each one of them is over 1GB of data for all of its revisions, *compressed*. We kind of expect articles to take their time to get huge...

Change 507268 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] split up page content jobs with max bytes per page range

https://gerrit.wikimedia.org/r/507268

Change 507268 merged by ArielGlenn:
[operations/dumps@master] split up page content jobs with max bytes per page range

https://gerrit.wikimedia.org/r/507268

The above change was deployed last night and will take effect for the new run starting today. We should see results for the wikidata page-meta-history dump, with files being a lot smaller than the 40GB of some files last month. With numerous small jobs we can at least run them in parallel.

That still doesn't resolve the underlying issue, which is that some more concise way to represent this data is needed.

Perhaps this issue of conciseness in the data model is something worth addressing to @Smalyshev and @Gehel?

I agree that probably more efficient format is needed, but unfortunately I don't have any immediate ideas - all dumps I've worked with were RDFs etc. and this is completely different one.

In general, I'm not even sure Wikidata is a good fit for storing data like these, but this is a topic for a different conversation...

Isn't there some way to be more concise in these entries? So far there's only around 250 of them, but each one of them is over 1GB of data for all of its revisions, *compressed*. We kind of expect articles to take their time to get huge...

I guess this is why for the RDF and JSON dumps we only do currently revisions, not all revisions.

There are a couple of angles that could make this situation better.

Right now many clients perform multiple sequential API calls in a row to complete a set of edits on an entity, resulting in more revisions that are probably necessary. One ticket that I found covering this is T216881 (I'm sure there are more but I can't find them). Partly related here also is the desire to summarize changes well and automatically in edit summaries T67846.
I suspect that even if we strongly pursued this route and managed to combine more changes into single revisions, the overall revision creation rate probably wouldn't change all that much.

There is also the size of the JSON that is stored in revisions. There is likely room for optimization here, with some added overhead on the development side of things. Infact the storage serialization used to be different to the generally exposed serialization, but that change quite some time ago I believe to simplify things.

I guess for the all revision dumps, we start the process from revision 1 whenever generating a new dump? Is there a reason that we don't have some system in place to create dumps in batches of revisions or entities, and then do some checks each time we generate a dump to determine if something has been revdeled in a batch and then only regenerate that batch, otherwise use the previously generated dump? or something similar?

Old revisions are re-read from previous dumps. But even that takes plenty of time to decompress the old file, and recompress the already existing content so that it can be written out to the new file. Dumps are done by page, so the likelihood of a batch of pages having no new revisions is pretty small for our batch size. And having people download tens of thousands of files so that our batch sizes are small enough to get lucky here and there, is pretty unsavory as a solution.