Page MenuHomePhabricator

MCR: Import all slots from XML dumps
Open, MediumPublic

Description

Once we have T174031: MCR: Include all slots in XML dumps, we need to also be able to read/import slots other than the main slot from dumps. This means implementing support for XML schema version 0.11 in WikiImporter. To enable this, we'll probably want to turn WikiRevision into a wrapper for MutableRevisionRecord.

Event Timeline

daniel created this task.Apr 9 2019, 3:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2019, 3:57 PM
daniel added a comment.Apr 9 2019, 3:58 PM

I just realized we never did that. It seems kind of important ;)

WDoranWMF moved this task from MCR to mop on the Core Platform Team board.Jul 26 2019, 6:38 PM
Pchelolo claimed this task.Aug 23 2019, 6:45 PM
Nuria added a subscriber: Nuria.Nov 21 2019, 10:24 PM

From other tickets i gather that there was an agreement for a format that would contain all slots, seems like this one was version 0.11 https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Schema

But the version of the current dumps on commons are version="0.10" so, are there any dumps that include the slots that in turn include the structure data?

cc @ArielGlenn which might know the answer to the question

@Nuria 0.10 is still the default format, we should probably change that. Maybe this could 3even make it into 1.34 still, I suppose we just forgot to move it forward. @CCicalese_WMF, thoughts?

@daniel: so I understand since i know little about all this. At this time the slots that contain the structure data items on say, a page in commons, are NOT included in the dumps with the page itself. Correct?

Is that structure data being dumped elsewhere on its own?

@daniel: so I understand since i know little about all this. At this time the slots that contain the structure data items on say, a page in commons, are NOT included in the dumps with the page itself. Correct?

Is that structure data being dumped elsewhere on its own?

Not yet; there's a task for that but it's blocked on a performance issue. See https://phabricator.wikimedia.org/T222497 the blocker, and https://phabricator.wikimedia.org/T221917 the dumps task.

@daniel: so I understand since i know little about all this. At this time the slots that contain the structure data items on say, a page in commons, are NOT included in the dumps with the page itself. Correct?

Is that structure data being dumped elsewhere on its own?

Data in slots other than the main slot are not dumped anywhere right now. This was tagged as Not A Blocker (tm) for the MVP. Ask @Abit and @Ramsey-WMF about the reasoning.

daniel added a comment.EditedNov 22 2019, 11:39 AM

Not yet; there's a task for that but it's blocked on a performance issue. See https://phabricator.wikimedia.org/T222497 the blocker, and https://phabricator.wikimedia.org/T221917 the dumps task.

To clarify - the blocker is for the RDF dumps. Including the MediaInfo slot in the XML dump is not blocked on anything, we could just do it. Or am I missing something?

Not yet; there's a task for that but it's blocked on a performance issue. See https://phabricator.wikimedia.org/T222497 the blocker, and https://phabricator.wikimedia.org/T221917 the dumps task.

To clarify - the blocker is for the RDF dumps. Including the MediaInfo slot in the XML dump is not blocked on anything, we could just do it. Or am I missing something?

That's right, this is an answer to the question "Is that structured data being dumped elsewhere on its own" (like the wikidata entity dumps).

mforns moved this task from Incoming to Radar on the Analytics board.Nov 25 2019, 5:14 PM

Putting this con the CPT clinic duty board as a "small project".

Pchelolo removed Pchelolo as the assignee of this task.Apr 15 2020, 4:04 PM
Pchelolo added a subscriber: Pchelolo.
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM