Page MenuHomePhabricator

consider generating an empty abstract file for wikidata
Closed, ResolvedPublic

Description

We only produce an abstract for articles in the main namespace that are not redirects and that have a content model of text or wikitext. For Wikidata, all items in the main namespace are Q-items with content model wikibase-item.

If this is not something we expect to change, then there' s no point in spending 36 hours generating a bunch of files that contain for each entry, <abstract not-applicable="" />. We should just generate an empty abstract file and be done with it. I guess we'd generate the 27 empty partial files and one complete 'recombined' file, each containing only the mediawiki and siteinfo information and the mediawiki footer.

I'd like to get the input of folks on the xmldatadumps mailing list as well as @hoo to see what people think.

Event Timeline

ArielGlenn triaged this task as Medium priority.Oct 21 2019, 5:34 AM
ArielGlenn created this task.

@hoo if you're not the right person to ping for this, can you point me to the right person? Basically I'm interested in knowing if the configuration can reasonably ever change so that anything besides a Q-item can be in the main namespace, and in particular anything with a content model that is text or wikitext. If not, as the task description states, I'm seriously considering generating 'empty' abstract files and saving wear and tear on the db servers. What do you think?

adding @WMDE-leszek for comments too in case you are more active on Wikidata; if you're the wrong person to answer about contents of the Wikidata main namespace, please redirect me and/or remove yourself.

@ArielGlenn: for Wikidata data it is not expected to change that Q-item or P-property would go to the namespace with text or wikitext content model. So optimizing the process in this area should be absolutely fine
Note that Commons/MediaInfo are a bit different, as they store structured data (M-entities) in the separate slot the File namespace, which, I believe, is wikitext content model? Not sure if this is relevant for the question here.

@WMDE-leszek Thanks for the answer; is it expected that at some time in the future other things might go into the main namespace for Wikidata, that might have a text or wikitext content model?

For MediaInfo items, those are in a secondary slot so they will never show up for abstracts for Commons, and we don't have to worry about them at all :-)

@WMDE-leszek Thanks for the answer; is it expected that at some time in the future other things might go into the main namespace for Wikidata, that might have a text or wikitext content model?

Negative.

Thanks! I will send an email to the xml datadumps list and see what people think, though I do not expect any objections.

(Updated) Message sent.

Change 547197 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] ability to configure a wiki to produce empty abstract files

https://gerrit.wikimedia.org/r/547197

A week has passed and no one has commented. Silence equals consent, and the above patch has been tested with the config setting enabled and disabled, so it's ready to go.

This will be merged shortly before the Nov 20th run unless something else derails things.

Change 547197 merged by ArielGlenn:
[operations/dumps@master] ability to configure a wiki to produce empty abstract files

https://gerrit.wikimedia.org/r/547197

Change 551172 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] configure wikidata dumps to generate empty abstracts files

https://gerrit.wikimedia.org/r/551172

Change 551172 merged by ArielGlenn:
[operations/puppet@production] configure wikidata dumps to generate empty abstracts files

https://gerrit.wikimedia.org/r/551172

This is now complete. Nov 20th wikidata abstract files are nice little empty files as expected.