Page MenuHomePhabricator

Test new Updater service
Closed, DeclinedPublic

Description

Before we deploy new Updater service into production, we want to test it thoroughly, since problems in the updater would lead to wrong data queried and ultimately desynchronization between Wikidata and WDQS bases.

The current test plan is as follows:

note: At this point wdqs1004 should still be running the merging updater.
After all these tests has been run and returned satisfactory results, we can start enabling new updater option on production hosts.

Event Timeline

Procedure for comparing journals:

Given two journals wikidata.jnl.1 and wikidata.jnl.2 and accompanying properties files, dump triples:

# dump jnl1
java -Dlogback.configurationFile=./logback.xml -cp warlib/*:warlib/*.jar com.bigdata.rdf.store.RebuildJournal -dump RWStore.1.properties
# dump jnl2
java -Dlogback.configurationFile=./logback.xml -cp warlib/*:warlib/*.jar com.bigdata.rdf.store.RebuildJournal -dump RWStore.2.properties

That produces wikidata.jnl.1.data.gz and wikidata.jnl.2.data.gz.
Sort:

zcat wikidata.jnl.1.data.gz | sed 's/ : .*$//' | sort | gzip > wikidata.jnl.1.sorted.nt.gz
zcat wikidata.jnl.2.data.gz | sed 's/ : .*$//' | sort | gzip > wikidata.jnl.2.sorted.nt.gz

Clean up volatile triples:

zcat wikidata.jnl.1.sorted.nt.gz | grep -v http://wikiba.se/ontology#timestamp | grep -v " _:t" | gzip -c > wikidata.jnl.1.cleandata.gz
zcat wikidata.jnl.2.sorted.nt.gz | grep -v http://wikiba.se/ontology#timestamp | grep -v " _:t" | gzip -c > wikidata.jnl.2.cleandata.gz

Compare:

comm -3 <(zcat wikidata.jnl.1.cleandata.gz) <(zcat wikidata.jnl.2.cleandata.gz) > wikidata.jnl.12.diff

Loading 1 hour 25 mins of updates from 201908010000 under both updaters shows no differences except ones that can be attributed to edits (since we always load the latest version on old changes). So this first test seems to be a success.

Smalyshev triaged this task as Medium priority.Aug 29 2019, 6:11 AM

Differences in bnodes might be tolerated with additional replacement. The cleanup stage could be merged with initial sed+sort

zcat wikidata.jnl.1.data.gz | sed 's/ : .*$//;s/_:t[^,>]*/bnode/g' | grep -v http://wikiba.se/ontology#timestamp | sort | gzip > wikidata.jnl.1.sorted.gz
zcat wikidata.jnl.2.data.gz | sed 's/ : .*$//;s/_:t[^,>]*/bnode/g' | grep -v http://wikiba.se/ontology#timestamp | sort | gzip > wikidata.jnl.2.sorted.gz

then compare will be just

comm -3 <(zcat wikidata.jnl.1.sorted.gz) <(zcat wikidata.jnl.2.sorted.gz) > wikidata.jnl.diff

Change 551167 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] query_service: add updater mode option

https://gerrit.wikimedia.org/r/551167

Change 551169 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] Switch wdqs1004 to merging updater mode

https://gerrit.wikimedia.org/r/551169

Change 551549 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] add logging config for exporting updated entities

https://gerrit.wikimedia.org/r/551549

Change 551167 merged by Gehel:
[operations/puppet@production] query_service: add updater mode option

https://gerrit.wikimedia.org/r/551167

Change 551169 merged by Gehel:
[operations/puppet@production] Switch wdqs1004 to merging updater mode

https://gerrit.wikimedia.org/r/551169

Mentioned in SAL (#wikimedia-operations) [2019-11-19T11:16:57Z] <gehel> restarting wdqs updater on wdqs1004 - T231411

lag is climbing and updater logs are quiet on wdqs1004, something is wrong. Threaddumps:

And the blazegraph threads:

Change 551796 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "Switch wdqs1004 to merging updater mode"

https://gerrit.wikimedia.org/r/551796

What about new logger UPDATED_ENTITY_IDS does it track updated entity IDs? How many per minute/hour?

Change 551796 merged by Gehel:
[operations/puppet@production] Revert "Switch wdqs1004 to merging updater mode"

https://gerrit.wikimedia.org/r/551796

output of
iostat -x 1
and
sudo iotop
?

Mentioned in SAL (#wikimedia-operations) [2019-11-19T11:37:57Z] <gehel> restarting wdqs blazegraph on wdqs1004 - T231411

the new updater seemed to not process any updates. The overall number of triples dropped significantly during the switch to the new updater. This requires more analysis before trying again.

wdqs1004 is depooled at the moment, let's make sure it is a good shape before repooling, maybe copy the journal from another host first.

Mentioned in SAL (#wikimedia-operations) [2019-11-19T14:34:58Z] <gehel> restarting blazegraph with additional logging on wdqs1004 - T231411

Zbyszko subscribed.

We're going in the new direction of rewriting updated into a streaming applicaiton.