When exporting articles from a Wikipedia (through [[Special:Export]]) there is no mention of the Wikidata qid corresponding to each article.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T197090 [CLIENT][SW] Wikidata qid of articles is not present in export/dump | |||
| Resolved | None | T379615 Exporting articles from a Wikipedia (through [[Special:Export]]) |
Event Timeline
Can we please clarify what the requirements are on where this info should appear in the xml and how it should look?
e.g. we could add something like this
<wikidataid>Q1</wikidataid>
For an example xml, you can go to Tools/Special pages/Page tools/Export Pages
Here is an example XML file, we can move/rename the wikidataid as needed
Hi @SuzanneWood-WMDE, thanks for the questions, the example and the instructions. I was able to create my own XML files, and that was really helpful. I'm missing some info which might have already been discussed last week. Apologies if I'm repeating stuff. I have a few questions that might help me answer where the Wikidata ID should be placed and what it should be called. Some of the questions are UX-related, so we might not be able to get through all of them right now, I'll list them here as a starting point, and we can discuss them from a technical point of view.
- When do editors export articles from Wikipedia?
- Can we tell how often or what types of communities use this feature?
- What is present in the XML export?
It seems that things in the tools bar are not present. So, the info in the "in other projects section" would need to find its way there.
- Are there dependencies that make this a barrier?
- I can imagine that it would be useful to have a QID assuming editors are doing bulk dumps to new wikis or for wikiprojects or other similar initiatives
- What are the impacts of putting the ID in different places?
- Where might editors expect to see the Wikidata QID?
- We can go back and ask whoever reported this for their opinion
- Is there a reason it wasn't there in the first place? e.g. too expensive, The exports might take too long...
- What kinds of workarounds could we propose for these?
Rather than wikidataid we suggest wikibaseitemid, as this will apply not specifically to wikidata
Backwards compatibilty
TLDR: from documentation below, it looks like we can add the new field to the xml without giving notice, as it would not be a breaking change.
This page about the wikibase dump format says that we follow the Stable Interface Policy, which mentions about XML specifically that “For data formats that allow namespacing, like XML does, names (attribute names, element names) that belong to a namespace not explicitly mentioned by the specification of the data format can be ignored by consumers. Addition and changes to data structures from other namespaces are not considered breaking changes.”
It also says "MediaWiki XML Dumps are not considered a stable interface. MediaWiki XML dumps contain the raw data of page revisions in their internal representation. The internal representation of Wikibase entities is not a stable interface. It has changed significantly in the past, and it may change again in the future. Several different representations of Wikibase content may be present in the same XML dump."
Impact on size of files/time to export
Something to consider: the xml dumps of wikipedia currently take 2-3 days to run. We would be adding about 100 million more rows of data to these, so it could impact the time it takes to run these
A question I have is:
Should we add the additional info into all dumps/exports, or provide a checkbox to opt in to the additional data? The two considerations for this are:
- Backwards compatibility: from the above documentation it looks like adding a new field is fine so for this a checkbox wouldn't be necessary
- Impact on size of files/time to export - it would be good to check what this would be, both interally and externally
From some research I found the following points:
- MediaWiki:Help:Export - Why export? - more efficient than database download as avoids converting HTML to wikicode
- Backup / archive an article (may be particularly relevant for contentious issues where you expect a lot of edits, reversions, vandalism)
- Import article to another Wiki
- Use content in your own MediaWiki instance / fork
- Copy content and wikisyntax to another lang. wiki for in-place translation.
- Special: Export has checkboxes to export the content + Templates + SubTemplates + Gadgets
- Offline access: intermittent or limited internet connectivity, an offline-available version of the article.
Thanks, @Danny_Benjafield_WMDE and @SuzanneWood-WMDE. I've just asked the person who submitted the initial ticket what they want to use the Wikidata QIDs for. Hopefully, we will get a response soon.
Hi Suzanne,
to follow up directly on the questions, adding a new field sounds like a good idea in terms of giving people the ability to opt-in or out. I would ask if we could follow up on finding out how much longer it would take to export with the QID involved.
In terms of users we would be keen ensure that they have Wikidata information included, they include editors who might want to, like Danny noted,
copy content and wikisyntax to another lang. wiki for in-place translation. In the ideal world, we would do some research to see what other ways of doing this exist.
Happy to continue discussing this :)
Thank you :)
About making it opt-in/opt out - I’d suggest we make it non-optional - see details below:
The wording of this original request from Sylvain_WMFr is:
“When exporting articles from a Wikipedia (through [[Special:Export]]), or in the XML dump, there is no mention of the Wikidata qid corresponding to each article. It would be great to have it added, to save further API queries to get it or to avoid having to download a full dump of Wikidata on top of the Wikipedia(s) one.”
Here, when he says ‘the XML dump’ he means the huge dumps that happen each week of wikipedia, which this would add 100 million lines to.
Since it needs to be added to those anyway, the largest kind of export, that makes me think it would not be worth adding a checkbox on whether to add it or not, since the biggest impact area would already have it switched on.
@Ifeatu_Nnaobi_WMDE we can consider whether to include the new tag if the QID is blank. Currently we are assuming to add it for all pages, so the schema is consistent (Option 1 below)
Option 1:
Include the tag like this if the QID is blank
<wikibaseitem_id></wikibaseitem_id>
and like this if it is present
<wikibaseitem_id>Q1</wikibaseitem_id>
Advantage: consistent schema, easier to work with the data
Disadvantage: This means that every page exported in the huge dump would still have a line added even if they don't have a wikidata item associated. That might mean the amount of time to export would increase.
Option 2:
Don't include the tag
This means the size/time impact of exporting the big dumps could be less
About the above stable interface policy:
I've realised that this applies to the exports from Wikibase.
However, when that is used for the wikipedia wikidata dumps, they have a different policy here
The XML should follow a schema, which has changed 11 times in the past. So we would need to update that schema, and get the foundation to communicate that, etc
(So a less smooth change than I'd hoped)