Page MenuHomePhabricator

Create mechanism to update categories database in graph storage
Closed, ResolvedPublic

Description

As categories change, we need to update the contents of the graph database hosting categories. For this, we need to figure out mechanism for updating those.

Event Timeline

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptAug 21 2017, 7:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Current thinking is:

  • Every day, create RDF of updated categories, as SPARQL Update file
  • Load it into the blazegraph after it is created.
  • This will be done for each wiki that has the functionality enabled.

Reality check:
enwiki seems to have 73662 category updates and 498 category creations on August 19th 2017. Similar numbers show up on other days. This seems to be completely workable number to process daily. Moreover, many category updates will prove on the same categories - seems to be real number of distinct categories update on enwiki is around 25K/day.

On commons, numbers seem to be about 2-3x from this for modifications and about 5x for creations. Still seems to be workable, and commons is probably the upper bound of what we're going to get.

debt triaged this task as Medium priority.Sep 7 2017, 5:20 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Smalyshev added a subscriber: User-Smalyshev.
debt added a subscriber: debt.Oct 17 2017, 5:30 PM

This is still blocked on a merge...waiting.

Change 392736 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Create script for automatic reload of categories

https://gerrit.wikimedia.org/r/392736

Change 394021 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] wdqs: schedule cronjob to reload categories

https://gerrit.wikimedia.org/r/394021

Change 392736 merged by Gehel:
[operations/puppet@production] Create script for automatic reload of categories

https://gerrit.wikimedia.org/r/392736

Change 394021 merged by Gehel:
[operations/puppet@production] wdqs: schedule cronjob to reload categories

https://gerrit.wikimedia.org/r/394021

Smalyshev closed this task as Resolved.Dec 11 2017, 6:11 PM

Categories are now auto-updated weekly, on Monday.