Setup dump for categories RDF representation
Closed, ResolvedPublic

Description

After https://gerrit.wikimedia.org/r/#/c/327862/ is merged, we need to setup regular dumps for categories into RDF. We should have a list of wikis which are dumped, probably in mediawiki configs list.

Proposed lists of wikis:

Initial:

  • testwiki
  • test2wiki

Then:

  • enwiki
  • dewiki
  • commonswiki

After those work, we can ask people for enabling it on more wikis or just enable it on all wikis.

Change 373167 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add list for wikis that would have categories dumped into RDF

https://gerrit.wikimedia.org/r/373167

Change 373354 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Add RDF dumps for categories

https://gerrit.wikimedia.org/r/373354

Smalyshev moved this task from Backlog to Waiting on the User-Smalyshev board.

Change 373167 merged by jenkins-bot:
[operations/mediawiki-config@master] Add list for wikis that would have categories dumped into RDF

https://gerrit.wikimedia.org/r/373167

Mentioned in SAL (#wikimedia-operations) [2017-08-28T23:07:04Z] <ebernhardson@tin> Synchronized dblists/categories-rdf.dblist: T173892: Add list for wikis that would have categories dumped into RDF (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2017-08-28T23:08:34Z] <ebernhardson@tin> Synchronized docroot/noc/conf/categories-rdf.dblist: T173892: Add list for wikis that would have categories dumped into RDF (duration: 00m 43s)

@ArielGlenn could you take a look and see if this (https://gerrit.wikimedia.org/r/373354) makes sense?

It looks generally ok, I'll have a close look today.

Just a couple nits, see gerrit. Once that's sorted, would you be ok with me merging this whenever? Also, any estimate on how long the job would take to run across all wikis? Thanks!

Once that's sorted, would you be ok with me merging this whenever?

Yes, please!

Also, any estimate on how long the job would take to run across all wikis?

Hmm I can't find timings now, I remember enwiki being done in terms of hours, but I can't locate where I recorded it. I'll retest and add it. Most wikis have less categories than enwiki or commons, so will probably be much faster. I'll add the figures in a bit.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Sep 4 2017, 11:03 AM

Change 373354 merged by ArielGlenn:
[operations/puppet@production] Add RDF dumps for categories

https://gerrit.wikimedia.org/r/373354

ArielGlenn added a comment.EditedSep 4 2017, 11:49 AM

I've merged and deployed this, after making a few (mostly cosmetic) changes. I'm going to update the script now so that it uses te clean new way of getting config settings, and I'll leave tis ticket open until cron runs once successfully.

Smalyshev added a comment.EditedSep 5 2017, 7:49 PM

@ArielGlenn thank you, will wait for first dump to happen. If that works fine, I'll enable it for more wikis. The timing for enwiki is:

real    40m49.040s
user    29m37.468s
sys     0m9.160s

This seems to be pretty reasonable. Dump size for enwiki is ~50M (gzipped).

ArielGlenn closed this task as Resolved.Sep 11 2017, 3:47 PM
ArielGlenn claimed this task.

The testwiki dumps are present, so I'm closing this ticket.

Change 377369 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Add categories RDF dump into the index page

https://gerrit.wikimedia.org/r/377369

Change 377369 merged by ArielGlenn:
[operations/puppet@production] Add categories RDF dump into the index page

https://gerrit.wikimedia.org/r/377369