Page MenuHomePhabricator

Prepare deployment of JSON dumps for Lexeme
Closed, ResolvedPublic

Description

  • changes to operations-puppet (modules/snapshot/ in puppet, the bash for the JSON dumps is in modules/snapshot/files/cron/dumpwikidatajson.sh)
  • some coordination with ops ArielGlenn
  • gain an overview about the code there and create a ticket for refactoring it so that we are more confident to perform changes in the future (put it under test, ...)

Access to "the dump generation server" may be exclusive to Marius at the moment.

Related Objects

Event Timeline

Change 637895 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Do weekly dumps of Wikidata Lexeme

https://gerrit.wikimedia.org/r/637895

I still need to test my changes (sadly there are no automated tests for these scripts and it would be excruciatingly hard to write some for the current bash scripts). Once that is done, this is good to go from my side.

Change 637895 merged by ArielGlenn:
[operations/puppet@production] Do weekly dumps of Wikidata Lexeme

https://gerrit.wikimedia.org/r/637895

Dump generation is currently scheduled for Wednesday (3:15 UTC, just like the non-lexeme JSON dumps are on Monday 3:15 UTC).

Are you sure they ran? That directory only contains RDF dumps as far as I can tell (Turtle and NTriples), we’ve been generating those for a while (compare 20210122 with 20201218). I haven’t found any lexeme JSON dumps yet.

Are you sure they ran? That directory only contains RDF dumps as far as I can tell (Turtle and NTriples), we’ve been generating those for a while (compare 20210122 with 20201218). I haven’t found any lexeme JSON dumps yet.

Ah crap. Yeah I see that now.

I didn't get any failure emails about it, but when I looked in the log I saw this:
root@snapshot1008:~# more /var/log/wikidatadump/dumpwikidatajson-wikidata-20210127-lexemes-main.log
File size for shard 0 is only 26086402. Aborting.

I guess those values need to be adjusted for lexemes.

Ah, thanks. Looks like the threshold was last adjusted in June 2019.

I guess for the inaugural lexeme JSON dumps it would be best to skip the check completely? And then use the first dumps’ sizes going forward. (Or we could try to estimate the size from the existing RDF dumps?)

…or maybe use the file size that the error message already tells us as a guideline 🤦

Change 659945 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/puppet@production] Update minimum expected file size for lexeme JSON dumps

https://gerrit.wikimedia.org/r/659945

Change 659945 merged by ArielGlenn:
[operations/puppet@production] Update minimum expected file size for lexeme JSON dumps

https://gerrit.wikimedia.org/r/659945

All set. We should check on these again in the middle of next week, as the run starts on Monday at ridiculous-o-clock when we are all sleeping.

I am doing some prep work before I try to test this on buster. Getting close!

Lydia_Pintscher added a subscriber: hoo.

This seems to be looking good on our side. Ariel: I'm assigning this to you so you can do anything remaining you would like to do and then close it. If there is anything you need from the WD team still please let us know.

Also: Yay! Lexeme dumps :D

These look fine to me from today, and I've done all the buster-side testing so that's ok too. Closing this! Ah, do we want to anounce it anywhere though? Maybe I won't close it pending that answer. Places it could be announced: xmldatadumps-l, wikitech-l, research list, wikidata list.

amy_rc added a subscriber: amy_rc.

Thanks. We'll send a short note to the Wikidata mailing list and add it to the next weekly summary. \o/