Page MenuHomePhabricator

Automate data reload for SPARQL Endpoint for Commons
Closed, ResolvedPublic2 Estimated Story Points

Description

We can regularly reload data from dumps while waiting for the streaming updates to be ready. This can probably be achieved with a simple bash script and a cron job. Blazegraph needs to be shut down during data reload, so this will imply some regular downtime. Dumps are generated weekly, so it does not make sense to reload data more frequently.

Acceptance Criteria:

  • data is reloaded from dumps weekly without human intervention

Event Timeline

Mstyles added a subscriber: Zbyszko.

As discussed in email with @Zbyszko the script should do the following

  1. Download the newest dump (maybe - we can provide it manually, but where's the fun in that?)
  2. Munge the data (running a script with appropriate data)
  3. Stop Blazegraph
  4. Delete Blazegraph journal
  5. Start (now empty) Blazegraph
  6. Execute loadData.sh script with sdc namespace and pointing to the munged data (see Igor's comment).

Change 597398 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/deploy@master] add script to automate data load

https://gerrit.wikimedia.org/r/597398

Change 597398 abandoned by Mstyles:
add script to automate data load

Reason:
moving to source repo

https://gerrit.wikimedia.org/r/597398

Change 598134 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[wikidata/query/rdf@master] add script to automate sdoc data load

https://gerrit.wikimedia.org/r/598134

Change 598134 merged by jenkins-bot:
[wikidata/query/rdf@master] add script to automate wcqs data load

https://gerrit.wikimedia.org/r/598134

dcausse added a subscriber: dcausse.

Re-opening this task just to address the last bit of work for scheduling this script weekly through cron or systemd timer

Mstyles added a subscriber: Mstyles.

Scope of this last piece is to schedule a simple cron job managed by puppet.

Change 619289 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/puppet@production] Add a weekly reload job for wcqs data reload

https://gerrit.wikimedia.org/r/619289

During the first data reload for some reason there data was not restored properly. I couldn't find a root cause of this - I'm doing some small changes to have a better understanding of the issue if it happens again.

Change 621003 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Changes to debug auto reload issue

https://gerrit.wikimedia.org/r/621003

Change 621003 merged by jenkins-bot:
[wikidata/query/rdf@master] Changes to debug auto reload issue

https://gerrit.wikimedia.org/r/621003

Change 619289 merged by Ryan Kemper:
[operations/puppet@production] Add a weekly reload job for wcqs data reload

https://gerrit.wikimedia.org/r/619289