Wikimedia Commons structured data dump does not contain all fields, e..g, title
Closed, ResolvedPublicBUG REPORT
Actions

Description

If you compare output of https://commons.wikimedia.org/wiki/Special:EntityData/M76.json and the same entity in the Wikimedia commons structured data (entities) dump, you will notice that some fields are missing. The most important for me is "title" field, which tells you to which file the entity belongs to. Without it it is hard to determine what is the entity about (you can infer from the identity ID, because number in there matches page_id, but that requires additional data to resolve).

Details

	Subject	Repo	Branch	Lines +/-
	Add page metadata to Wikibase JSON dumps	operations/puppet	production	+1 -0
	Make sure both API and dump include same page metadata fields.	mediawiki/extensions/Wikibase	master	+162 -20

Customize query in gerrit

Related Objects

Mentioned Here: P29605 More Entity IDs for T301104
P29604 Entity IDs for T301104
T278031: Wikibase canonical JSON format is missing "modified" in Wikidata JSON dumps

Event Timeline

Mitar created this task.Feb 7 2022, 10:03 AM

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptFeb 7 2022, 10:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CBogen moved this task from Triage to SDoC Statements on the Structured-Data-Backlog board.Feb 14 2022, 5:30 PM

Is there any way I could help to push this further?

I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.

• hlepp moved this task from Backlog to Hacking Projects on the Wikimedia-Hackathon-2022 board.Apr 27 2022, 9:49 PM

@Mitar: Just to avoid misunderstandings, do you plan to work on this at the Hackathon? :)

I would be interesting in doing that, but I probably need a helping hand to do it. So I have programming background, but zero understanding of where and how this could be fixed. My understanding is that hackathon would be suitable for this? Do I have to make a session? How do I find other people who might be able to help me?

@Mitar: Ah, great! No session needed in my understanding. See https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2022/Participants and https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2022/How_to for finding people - thanks!

Fuzheado subscribed.May 12 2022, 11:54 AM

Just had a chat with @Mitar on IRC about a possible approach to this and they will write something up here now! :) Wikimedia-Hackathon-2022

So the plan is:

Take addPageInfoToRecord from repo/includes/Api/ResultBuilder.php (https://github.com/wikimedia/Wikibase/blob/44b2d731c507d40472cf6f1392bc378166e2a45f/repo/includes/Api/ResultBuilder.php#L354-L362) and move it out to repo/includes.
Then have both API and dump generation in generateDumpForEntityId call into id to add those fields to the dump entity as well. Currently there is $data['lastrevid'] = $revision->getRevisionId(); already in generateDumpForEntityId, but reusing that function will also add title and few other fields (ns, modified, pageid), which I think is great to get parity between API and dumps.
It was suggested to me that this should be behind a flag/setting, I am not sure if that is really needed, but I will then add it as opt-out setting?

Change 793934 had a related patch set uploaded (by Mitar; author: Mitar):

[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields

https://gerrit.wikimedia.org/r/793934

gerritbot added a project: Patch-For-Review.May 20 2022, 9:39 PM

I made a first pass. Feedback welcome.

I ran tests using composer run-script test, but I feel like they have not really run. I had a bug in JsonDumpGeneratorTest and no error was reported. I am also not sure if getMockEntityTitleStoreLookup has been correctly implemented.

I made another pass, adding configuration option to not include page metadata (then dump is without title and other page metadata).

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/793934 is ready for a review, it has both opt-in configuration option and a test.

Addshore added a subscriber: CBogen.May 24 2022, 7:36 AM

Change 793934 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make sure both API and dump include same page metadata fields.

https://gerrit.wikimedia.org/r/793934

Maintenance_bot removed a project: Patch-For-Review.May 27 2022, 11:30 AM

So fix to the dump script has been merged to the Wikibase extension. It is gated behind a CLI switch. What is the process that this gets turned on for dumps from Wikimedia Commons (and ideally also for Wikidata)?

After the code change rolls out with the deployment train next week, you could submit a Gerrit change to Puppet to add the flag (dumpwikibasejson.sh, found via codesearch), and add it to a Puppet request window. Should be safe enough to try it out for one week’s dumps – if it doesn’t work as expected, or degrades performance by too much, it can be reverted again.

I can also try out the change in production (create a tiny partial dump just to see what the JSON looks like), once the code change has rolled out (next Thursday or Friday, probably). Feel free to remind me if I forget ^^

Awesome. Thanks for explaining.

Change 802921 had a related patch set uploaded (by Mitar; author: Mitar):

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

gerritbot added a project: Patch-For-Review.Jun 5 2022, 8:04 PM

Done. Added it to June 7 puppet request window. Please review/advise if I did something wrong.

Seems to work in production:

lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet 2>/dev/null | jq . | tail
      "badges": []
    },
    "hewikiquote": {
      "site": "hewikiquote",
      "title": "אפריקה",
      "badges": []
    }
  },
  "lastrevid": 1652816527
}
lucaswerkmeister-wmde@mwdebug1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --limit 1 --snippet --page-metadata 2>/dev/null | jq . | tail
      "title": "אפריקה",
      "badges": []
    }
  },
  "pageid": 111,
  "ns": 0,
  "title": "Q15",
  "lastrevid": 1652816527,
  "modified": "2022-06-02T12:56:04Z"
}

Thanks for testing!

In T301104#7981599, @gerritbot wrote:

Change 802921 had a related patch set uploaded (by Mitar; author: Mitar):

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

Just flagging up that this was originally for commons data dumps, but I think this puppet change would be both for wikidata and common.
@Lucas_Werkmeister_WMDE I guess Lydia should give this a stamp of approval.
I wonder if it would increase the compressed dump size much? (probably not)

Yes, this change should fix both this issue and T278031.

So what is the next step here?

In T301104#7990067, @Addshore wrote:

@Lucas_Werkmeister_WMDE I guess Lydia should give this a stamp of approval.

🆗

I wonder if it would increase the compressed dump size much? (probably not)

Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.

I wonder if it would increase the compressed dump size much? (probably not)

Yeah that is also my only worry. But I fear we'll have to deal with subsetting sooner or later anyway now.

I grabbed a set of ~~1000~~ 500 random entity IDs from the API (P29604) and ran a dump for them; if I did my maths right, the dumps would grow by ca. 0.5% before compression:

lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | wc -c
8160550
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | wc -c
8200010

Or by ca. 2% after compression:

lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | gzip -9 | wc -c
736735
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | gzip -9 | wc -c
751205
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet 2>/dev/null | bzip2 | wc -c
521147
lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/dumpJson.php wikidatawiki --list-file entityids-T301104 --snippet --page-metadata 2>/dev/null | bzip2 | wc -c
532060

I tried it with a larger set of 10000 entity IDs (P29605) and got fairly similar results – 0.5% before compression, 1.9% after gzip, 1.7% after bzip2. For comparison, between 2022-04-25 to 2022-06-06 (these are the least and most recent dumps we still store), the dumps grew by 1.04% (wikidata-20220425-all.json.gz: 111585957308 bytes; wikidata-20220606-all.json.gz: 112743720511 bytes). So this would be a jump ahead, in terms of dump size, of roughly three months, so to speak. (Or: when we deploy this patch, the next dump will immediately reach a size that we otherwise wouldn’t expect for roughly three months.)

That’s more than I would’ve expected, but tolerable and on the whole still worth it, I think. (As Lydia says, the need for subsetting already looms on the horizon with or without this change.)

What is this subsetting you are talking about?

Thanks for doing the measurements.

It is about providing several dumps, each containing a subset of the data, to make them smaller and easier to work with.

@Mitar your CR is all approved, please ping me on irc (jbond) when you are around and i can merge

Awesome. I will try to do so when you are online, but feel free also to just merge it without me. I do not know if I can be of much help being around anyway. :-)

Change 802921 merged by Jbond:

[operations/puppet@production] Add page metadata to Wikibase JSON dumps

https://gerrit.wikimedia.org/r/802921

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2022, 12:30 PM

So for the next dump which will run, this will now be included? Or is there some deployment which is still necessary?

I checked commons-20220620-mediainfo.json.bz2 and it contains title field (alongside other fields which are present in API).

Mitar closed this task as Resolved.Jun 24 2022, 12:48 PM

Mitar claimed this task.

And the dumps size has indeed increased somewhat:

date	gz	bz2
20220530	112573430297	74122394696
20220606	112743720511	74236824815
20220613	113026643896	74443521341
20220620	115432876924	75966141297

That’s 2.1% for gz and 2.0% for bz2 compared to the last dump without these fields (so the 2% growth comes from the field and any extra data combined).

Wikimedia Commons structured data dump does not contain all fields, e..g, titleClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

Wikimedia Commons structured data dump does not contain all fields, e..g, title
Closed, ResolvedPublicBUG REPORT
Actions