Page MenuHomePhabricator

Provide appropriate dumps of Commons including the structured data
Open, LowPublic

Description

Coming out of a quick conversation between @Jdforrester-WMF and @ArielGlenn, we need to determine what if any specific work is needed to adjust the dumps for Commons. This is not for immediate decision and execution – we don't have to climb the whole mountain at once – and indeed is blocked on work to determine what data will exist (e.g. virtual properties).

  • Binaries – Already done; any changes expected?
  • Wikitext – Already done; will continue, changed via T198706
  • SDC data – Raw JSON will come along as part of T174031

This will get us the raw content of the slot, but what more do we need to do, if anything?

  • We'll need to dump local properties, presumably. (Are there going to be any local non-media items?)
  • Can virtual properties be ignored (and caculated on import like EXIF), or will they need exporting?
  • Is it OK to ask people to grab the Wikidata XML or entity dumps to use with the Commons dump? Will users demand a special de-referenced walk of Wikidata from Commons properties? What about recursion? (Oy.)
  • Will users want a separate sort of 'commons media info' weekly or bimonthly run with information specially formatted for folks working with media file meta-data?

Event Timeline

Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.

See T221917 where this is actually being done (1/2, the other half which is the inclusion in xml dumps, depends on some pending changes to core still in the works). Should I merge this task into the other one?

Aklapper subscribed.

Adding missing Multi-Content-Revisions code project tag as Core Platform Team Initiatives (MCR) team tag is archived and its parent Platform Engineering team does not exist anymore