Page MenuHomePhabricator

Automate data reload for SPARQL Endpoint for Commons
Closed, ResolvedPublic2 Estimated Story Points

Description

We can regularly reload data from dumps while waiting for the streaming updates to be ready. This can probably be achieved with a simple bash script and a cron job. Blazegraph needs to be shut down during data reload, so this will imply some regular downtime. Dumps are generated weekly, so it does not make sense to reload data more frequently.

Acceptance Criteria:

  • data is reloaded from dumps weekly without human intervention

Event Timeline

Gehel created this task.Apr 30 2020, 12:00 PM
Mstyles claimed this task.May 18 2020, 6:45 PM
Mstyles added a subscriber: Zbyszko.

As discussed in email with @Zbyszko the script should do the following

  1. Download the newest dump (maybe - we can provide it manually, but where's the fun in that?)
  2. Munge the data (running a script with appropriate data)
  3. Stop Blazegraph
  4. Delete Blazegraph journal
  5. Start (now empty) Blazegraph
  6. Execute loadData.sh script with sdc namespace and pointing to the munged data (see Igor's comment).

Change 597398 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/deploy@master] add script to automate data load

https://gerrit.wikimedia.org/r/597398

Change 597398 abandoned by Mstyles:
add script to automate data load

Reason:
moving to source repo

https://gerrit.wikimedia.org/r/597398

Change 598134 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] add script to automate sdoc data load

https://gerrit.wikimedia.org/r/598134

Change 598134 merged by jenkins-bot:
[wikidata/query/rdf@master] add script to automate wcqs data load

https://gerrit.wikimedia.org/r/598134

Mstyles closed this task as Resolved.Jul 27 2020, 5:11 PM
dcausse reopened this task as Open.Jul 29 2020, 7:45 AM
dcausse added a subscriber: dcausse.

Re-opening this task just to address the last bit of work for scheduling this script weekly through cron or systemd timer

Mstyles removed Mstyles as the assignee of this task.Jul 29 2020, 3:38 PM
Mstyles added a subscriber: Mstyles.
CBogen added a subscriber: CBogen.Aug 3 2020, 5:13 PM

Scope of this last piece is to schedule a simple cron job managed by puppet.

Zbyszko set the point value for this task to 2.Aug 10 2020, 12:30 PM

Change 619289 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/puppet@production] Add a weekly reload job for wcqs data reload

https://gerrit.wikimedia.org/r/619289

Zbyszko claimed this task.Aug 10 2020, 4:04 PM

During the first data reload for some reason there data was not restored properly. I couldn't find a root cause of this - I'm doing some small changes to have a better understanding of the issue if it happens again.

Change 621003 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Changes to debug auto reload issue

https://gerrit.wikimedia.org/r/621003

Change 621003 merged by jenkins-bot:
[wikidata/query/rdf@master] Changes to debug auto reload issue

https://gerrit.wikimedia.org/r/621003

Change 619289 merged by Ryan Kemper:
[operations/puppet@production] Add a weekly reload job for wcqs data reload

https://gerrit.wikimedia.org/r/619289

Gehel closed this task as Resolved.Sep 1 2020, 12:24 PM