Page MenuHomePhabricator

Create script to dump recently changed categories
Closed, ResolvedPublic

Description

Given a wiki and a date interval, produce an SPAQRL update statement which contains all updates to categories and memberships relevant to RDF representation, between those dates.

Event Timeline

Smalyshev triaged this task as Medium priority.Aug 21 2017, 7:27 PM
Smalyshev created this task.
Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.

Change 372905 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Create update SPARQL for category changes

https://gerrit.wikimedia.org/r/372905

Another reality check: current dumping script dumps changes for one day from enwiki in 6.5s. This performance seems to be acceptable, even if we run it for all wikis - commons might take more but even if it takes 10x it looks ok.

Change 378355 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Generate daily diffs for categories RDF

https://gerrit.wikimedia.org/r/378355

We are currently running with the weekly updates, looping back around to here do we still think we need daily updates in addition, or is the weekly data good enough?

@Lydia_Pintscher, @Lea_WMDE I think your opinion on the above question would be helpful.

@Smalyshev @EBernhardson sorry, I seem to have overlooked the comments before. From my purely user experience based perspective I of course would like to see daily and not weekly updates. What are the reasons that we keep it weekly for now? Is there a big downside to daily updates?

What are the reasons that we keep it weekly for now?

We only have weekly dump, we need to either implement daily diffs from this patch or daily dumps. I'll check to see whether daily dumps are feasible.

Is there a big downside to daily updates?

It needs daily dumps/diffs implemented :) Which means one more dump process (with accompanying data) and one more process to load them. Nothing that we can't do, just wanted to ensure we need it before we invest the time, since we've been running with weeklies for a while and nobody complained so far :)

If you plan to implement daily dumps, I would strongly encourage you to do them as diffs, assuming that these will be much more efficient than full dumps every day.

Yes, this is the general idea, but I wanted to compare what is the difference. But unless it turns out there's no difference (unlikely) the plan is for daily diffs.

I had a talk with @Charlie_WMDE (UX) about it and we do feel daily is the better way to go :)

Change 372905 merged by jenkins-bot:
[mediawiki/core@master] Create update SPARQL for category changes

https://gerrit.wikimedia.org/r/372905

Oops, this is actually only half-done - the script is there, but dumps are still not generated.