Page MenuHomePhabricator

Wikimedia Commons structured data dump does not contain all fields, e..g, title
Closed, ResolvedPublicBUG REPORT

Description

If you compare output of https://commons.wikimedia.org/wiki/Special:EntityData/M76.json and the same entity in the Wikimedia commons structured data (entities) dump, you will notice that some fields are missing. The most important for me is "title" field, which tells you to which file the entity belongs to. Without it it is hard to determine what is the entity about (you can infer from the identity ID, because number in there matches page_id, but that requires additional data to resolve).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Is there any way I could help to push this further?

I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.

@Mitar: Just to avoid misunderstandings, do you plan to work on this at the Hackathon? :)

I would be interesting in doing that, but I probably need a helping hand to do it. So I have programming background, but zero understanding of where and how this could be fixed. My understanding is that hackathon would be suitable for this? Do I have to make a session? How do I find other people who might be able to help me?

Just had a chat with @Mitar on IRC about a possible approach to this and they will write something up here now! :) Wikimedia-Hackathon-2022

So the plan is:

  • Take addPageInfoToRecord from repo/includes/Api/ResultBuilder.php (https://github.com/wikimedia/Wikibase/blob/44b2d731c507d40472cf6f1392bc378166e2a45f/repo/includes/Api/ResultBuilder.php#L354-L362) and move it out to repo/includes.
  • Then have both API and dump generation in generateDumpForEntityId call into id to add those fields to the dump entity as well. Currently there is $data['lastrevid'] = $revision->getRevisionId(); already in generateDumpForEntityId, but reusing that function will also add title and few other fields (ns, modified, pageid), which I think is great to get parity between API and dumps.
  • It was suggested to me that this should be behind a flag/setting, I am not sure if that is really needed, but I will then add it as opt-out setting?

Change 793934 had a related patch set uploaded (by Mitar; author: Mitar):

[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields

https://gerrit.wikimedia.org/r/793934

I made a first pass. Feedback welcome.

I ran tests using composer run-script test, but I feel like they have not really run. I had a bug in JsonDumpGeneratorTest and no error was reported. I am also not sure if getMockEntityTitleStoreLookup has been correctly implemented.

I made another pass, adding configuration option to not include page metadata (then dump is without title and other page metadata).

Change 793934 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields.

https://gerrit.wikimedia.org/r/793934

So fix to the dump script has been merged to the Wikibase extension. It is gated behind a CLI switch. What is the process that this gets turned on for dumps from Wikimedia Commons (and ideally also for Wikidata)?

After the code change rolls out with the deployment train next week, you could submit a Gerrit change to Puppet to add the flag (dumpwikibasejson.sh, found via codesearch), and add it to a Puppet request window. Should be safe enough to try it out for one week’s dumps – if it doesn’t work as expected, or degrades performance by too much, it can be reverted again.

I can also try out the change in production (create a tiny partial dump just to see what the JSON looks like), once the code change has rolled out (next Thursday or Friday, probably). Feel free to remind me if I forget ^^

Awesome. Thanks for explaining.

Change 802921 had a related patch set uploaded (by Mitar; author: Mitar):

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

Done. Added it to June 7 puppet request window. Please review/advise if I did something wrong.

Seems to work in production:

lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet 2>/dev/null | jq . | tail
      "badges": []
    },
    "hewikiquote": {
      "site": "hewikiquote",
      "title": "אפריקה",
      "badges": []
    }
  },
  "lastrevid": 1652816527
}
lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet --page-metadata 2>/dev/null | jq . | tail
      "title": "אפריקה",
      "badges": []
    }
  },
  "pageid": 111,
  "ns": 0,
  "title": "Q15",
  "lastrevid": 1652816527,
  "modified": "2022-06-02T12:56:04Z"
}

Change 802921 had a related patch set uploaded (by Mitar; author: Mitar):

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

Just flagging up that this was originally for commons data dumps, but I think this puppet change would be both for wikidata and common.
@Lucas_Werkmeister_WMDE I guess Lydia should give this a stamp of approval.
I wonder if it would increase the compressed dump size much? (probably not)

Yes, this change should fix both this issue and T278031.

So what is the next step here?

@Lucas_Werkmeister_WMDE I guess Lydia should give this a stamp of approval.

🆗

I wonder if it would increase the compressed dump size much? (probably not)

Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.

I wonder if it would increase the compressed dump size much? (probably not)

Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.

I grabbed a set of 1000 500 random entity IDs from the API (P29604) and ran a dump for them; if I did my maths right, the dumps would grow by ca. 0.5% before compression:

lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | wc -c
8160550
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | wc -c
8200010

Or by ca. 2% after compression:

lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | gzip -9 | wc -c
736735
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | gzip -9 | wc -c
751205
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | bzip2 | wc -c
521147
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | bzip2 | wc -c
532060

I tried it with a larger set of 10000 entity IDs (P29605) and got fairly similar results – 0.5% before compression, 1.9% after gzip, 1.7% after bzip2. For comparison, between 2022-04-25 to 2022-06-06 (these are the least and most recent dumps we still store), the dumps grew by 1.04% (wikidata-20220425-all.json.gz: 111585957308 bytes; wikidata-20220606-all.json.gz: 112743720511 bytes). So this would be a jump ahead, in terms of dump size, of roughly three months, so to speak. (Or: when we deploy this patch, the next dump will immediately reach a size that we otherwise wouldn’t expect for roughly three months.)

That’s more than I would’ve expected, but tolerable and on the whole still worth it, I think. (As Lydia says, the need for subsetting already looms on the horizon with or without this change.)

What is this subsetting you are talking about?

Thanks for doing the measurements.

It is about providing several dumps, each containing a subset of the data, to make them smaller and easier to work with.

@Mitar your CR is all approved, please ping me on irc (jbond) when you are around and i can merge

Awesome. I will try to do so when you are online, but feel free also to just merge it without me. I do not know if I can be of much help being around anyway. :-)

Change 802921 merged by Jbond:

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

So for the next dump which will run, this will now be included? Or is there some deployment which is still necessary?

I checked commons-20220620-mediainfo.json.bz2 and it contains title field (alongside other fields which are present in API).

Mitar claimed this task.

And the dumps size has indeed increased somewhat:

dategzbz2
2022053011257343029774122394696
2022060611274372051174236824815
2022061311302664389674443521341
2022062011543287692475966141297

That’s 2.1% for gz and 2.0% for bz2 compared to the last dump without these fields (so the 2% growth comes from the field and any extra data combined).