Page MenuHomePhabricator

Reload wikidata journal from fresh dumps
Closed, ResolvedPublic3 Estimated Story Points

Description

As wdqs user I want to query a dataset that is coherent with the state of wikidata.

During the past months some issues (T255657, T266211, T264042, T267924, T267175, T268408, T272120, T273937) caused the wdqs journal to be out of sync, salvaging current journal seems difficult as detecting all the discrepancies is not possible. Reloading the data from dump is probably the best approach.

The approach is to use the wdqs.data-reload cookbook to one machine, catchup the lag and then copy over the fresh journal to other machines using the wdqs.data-transfer cookbook.

AC:

  • Blazegraph is running with a journal created from a recent RDF dump on all servers (except wdqs1009)

Current status

Note: We need to fix https://phabricator.wikimedia.org/T280382 which will require starting from a clean state, so the below status is irrelevant for now because we'll need to re-image these servers anyway.

[EQIAD PUBLIC]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

NEXT sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1012.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[CODFW PUBLIC]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
NEXT sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[EQIAD INTERNAL]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1003.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1008.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
NEXT sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[CODFW INTERNAL]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
NEXT sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I pre-fetched the dumps required for the reload on wdqs1010 & wdqs1009.

  • wdqs1009 needs to be reloaded using --reuse-downloaded-dump --reload-data wikidata --skolemize
  • wdqs1010 with --reuse-downloaded-dump --reload-data wikidata

Mentioned in SAL (#wikimedia-operations) [2021-02-05T22:42:01Z] <ryankemper> T267927 sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 failing with ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF

sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 is failing with:

ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 && echo 'done with 1009' >> /home/ryankemper/wdqs_wikidata_reload.log && echo 'done!'
START - Cookbook sre.wdqs.data-reload
----- OUTPUT of 'test -f /srv/wdq...test-all.ttl.bz2' -----
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.92hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'test -f /srv/wdq...test-all.ttl.bz2'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected dump (/srv/wdqs/latest-all.ttl.bz2). Skipping download
----- OUTPUT of 'test -f /srv/wdq...-lexemes.ttl.bz2' -----
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.42hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'test -f /srv/wdq...-lexemes.ttl.bz2'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected dump (/srv/wdqs/latest-lexemes.ttl.bz2). Skipping download
checking available disk space
----- OUTPUT of 'dump_size=`du /s...avail+$db_size))' -----
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.35hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'dump_size=`du /s...avail+$db_size))'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Running munger for main database and then lexeme
munging /srv/wdqs/munged (skolemizaton: True)
----- OUTPUT of 'rm -rf /srv/wdqs...d -- --skolemize' -----
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
22:56:53.717 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
org.openrdf.rio.RDFParseException: Expected an RDF value here, found '{' [line 2]
        at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
        at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
        at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:662)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicate(TurtleParser.java:505)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:420)
        at org.openrdf.rio.turtle.TurtleParser.parseImplicitBlank(TurtleParser.java:613)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:394)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:259)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
        at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
        at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)
================
PASS |                                                                                                                                                                                                                                                                                                                          |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.10hosts/s]
100.0% (1/1) of nodes failed to execute command 'rm -rf /srv/wdqs...d -- --skolemize': wdqs1009.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'rm -rf /srv/wdqs...d -- --skolemize'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.wdqs.data-reload:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 226, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 19, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 225, in run
    munge(remote_host, args.skolemize)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 123, in munge
    .format(path=dump['path'], munge_path=dump['munge_path'], skolemize="--skolemize" if skolemize else ""),
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)

sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927is failing with:

ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927
START - Cookbook sre.wdqs.data-reload
----- OUTPUT of 'test -f /srv/wdq...test-all.ttl.bz2' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.92hosts/s]
FAIL |                                                                                                                                    |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'test -f /srv/wdq...test-all.ttl.bz2'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected dump (/srv/wdqs/latest-all.ttl.bz2). Skipping download
----- OUTPUT of 'test -f /srv/wdq...-lexemes.ttl.bz2' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.44hosts/s]
FAIL |                                                                                                                                    |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'test -f /srv/wdq...-lexemes.ttl.bz2'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected dump (/srv/wdqs/latest-lexemes.ttl.bz2). Skipping download
checking available disk space
----- OUTPUT of 'dump_size=`du /s...avail+$db_size))' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.45hosts/s]
FAIL |                                                                                                                                    |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'dump_size=`du /s...avail+$db_size))'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Running munger for main database and then lexeme
munging /srv/wdqs/munged (skolemizaton: False)
----- OUTPUT of 'rm -rf /srv/wdqs.../wdqs/munged -- ' -----
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
22:50:03.137 [main] ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF
org.openrdf.rio.RDFParseException: Expected an RDF value here, found '{' [line 2]
        at org.openrdf.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:440)
        at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:685)
        at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1405)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:662)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicate(TurtleParser.java:505)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:420)
        at org.openrdf.rio.turtle.TurtleParser.parseImplicitBlank(TurtleParser.java:613)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:394)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:259)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
        at org.wikidata.query.rdf.tool.Munge.run(Munge.java:105)
        at org.wikidata.query.rdf.tool.Munge.main(Munge.java:59)
22:50:03.134 [org.wikidata.query.rdf.tool.rdf.AsyncRDFHandler$RDFActionsReplayer] INFO  org.wikidata.query.rdf.tool.Munge - Switching to /srv/wdqs/munged/wikidump-000000001.ttl.gz
================
PASS |                                                                                                                                    |   0% (0/1) [00:01<?, ?hosts/s]
FAIL |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.10s/hosts]
100.0% (1/1) of nodes failed to execute command 'rm -rf /srv/wdqs.../wdqs/munged -- ': wdqs1010.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'rm -rf /srv/wdqs.../wdqs/munged -- '. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.wdqs.data-reload:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 226, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 19, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 225, in run
    munge(remote_host, args.skolemize)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 123, in munge
    .format(path=dump['path'], munge_path=dump['munge_path'], skolemize="--skolemize" if skolemize else ""),
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)

Looks like same error for both. Seems to be occurring at the munging step.

Mentioned in SAL (#wikimedia-operations) [2021-02-05T23:35:16Z] <ryankemper> T267927 Re-downloading latest dumps (main database, lexeme) in tmux session downloads_dumps on ryankemper@wdqs1009.eqiad.wmnet

Still waiting for the latest dumps to be downloaded (few more hours), then need to reboot WDQS hosts as part of https://phabricator.wikimedia.org/T274213, then can do the actual data-reload

Mentioned in SAL (#wikimedia-operations) [2021-02-09T18:37:40Z] <ryankemper> T267927 [WDQS Data Reload] Clearing old wikidata journal file to free disk space before beginning data reload:sudo systemctl status wdqs-blazegraph && sudo systemctl stop wdqs-blazegraph && sudo rm -fv /srv/wdqs/wikidata.jnl && sudo systemctl start wdqs-blazegraph on wdqs100[9,10]

Mentioned in SAL (#wikimedia-operations) [2021-02-09T18:40:25Z] <ryankemper> T267927 [WDQS Data Reload] sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 on ryankemper@cumin1001 tmux session wdqs_data_reload_1009

Mentioned in SAL (#wikimedia-operations) [2021-02-09T18:40:57Z] <ryankemper> T267927 [WDQS Data Reload] sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 on ryankemper@cumin1001 tmux session wdqs_data_reload_1009

Mentioned in SAL (#wikimedia-operations) [2021-02-09T18:41:53Z] <ryankemper> T267927 [WDQS Data Reload] Small typo in previous SAL log message, see subsequent SAL line for correction:

Mentioned in SAL (#wikimedia-operations) [2021-02-09T18:42:04Z] <ryankemper> T267927 [WDQS Data Reload] sudo cookbook sre.wdqs.data-reload wdqs1010.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 on ryankemper@cumin1001 tmux session wdqs_data_reload_1010

Gehel added a subscriber: Gehel.

Data reload failed on both wdqs1009 and wdqs1010.

wdqs1009: the host crashed, data reload continued manually, starting from chunk 407 (/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 407).

wdqs1010: looks like a disk error, investigation will continue on T274788. In the meantime, data reload will be done on wdqs2008 instead.

data reload started on wdqs2008:

gehel@cumin2001:~$ sudo -i cookbook sre.wdqs.data-reload --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' wdqs2008.codfw.wmnet --depool --proxy-server http://webproxy.eqiad.wmnet:8080

Mentioned in SAL (#wikimedia-operations) [2021-02-24T04:10:51Z] <ryankemper> T267927 [WDQS Data Reload] Running /srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 864 on ryankemper@wdqs2008 tmux session data_reload

The last Puppet run was at Tue Feb 16 13:22:35 UTC 2021 (11219 minutes ago). Puppet is disabled. T267927: Reload wikidata jnl from fresh dumps - gehel@cumin2001 - T267927

Because Puppet has been disabled for more than 2 weeks, the host has been removed from Puppet, and thus is alerting in the PhysicalHosts Netbox report.
Is it possible to re-enable Puppet?

The last Puppet run was at Tue Feb 16 13:22:35 UTC 2021 (11219 minutes ago). Puppet is disabled. T267927: Reload wikidata jnl from fresh dumps - gehel@cumin2001 - T267927

Because Puppet has been disabled for more than 2 weeks, the host has been removed from Puppet, and thus is alerting in the PhysicalHosts Netbox report.
Is it possible to re-enable Puppet?

Yep, will do.

loading the dump is completed on wdqs1009, restarting the updater

Data load on wdqs2008 has resulted (again) in what looks like a corrupted journal. This will need to be restarted, but I'll wait on feedback from @dcausse / @Zbyszko to see if we have any short term ideas on how to improve this.

Thanks, thinking a bit more about it, it might be better to put whatever blocks those reloads in Puppet behind a Hiera option. So puppet can stay enabled while not interfering with those actions.

Mentioned in SAL (#wikimedia-operations) [2021-02-25T18:50:05Z] <ryankemper> T267927 Trying to kick off data reload on wdqs2008 from cumin2001 fails because of spicerack.remote.RemoteError: No hosts provided. Doing some spelunking through IRC history looks like this happens when a host is not present in puppetDB. I'm confirmed wdqs2008 is absent on puppetboard, so running puppet agent to get it re-registered (hopefully)

Mentioned in SAL (#wikimedia-operations) [2021-02-25T18:59:18Z] <ryankemper> T267927 Manual puppet run got wdqs2008 present in puppetdb again. Now being blocked by lack of host key for wdqs2008 present on cumin2001, so I'm running puppet on cumin2001 to get the latest state of /etc/ssh/ssh_known_hosts

Mentioned in SAL (#wikimedia-operations) [2021-02-25T19:16:19Z] <ryankemper> T267927 Downloading dumps: sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2 on ryankemper@wdqs2008 tmux session download_latest_dumps

Downtimed wdqs2008 until 2021-03-04 21:56:59

Mentioned in SAL (#wikimedia-operations) [2021-02-26T04:23:31Z] <ryankemper> T267927 [WDQS Data Reload] sudo -i cookbook sre.wdqs.data-reload wdqs2008.codfw.wmnet --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --reuse-downloaded-dump --depool on ryankemper@cumin2001 tmux session wdqs_data_reload_2008

Note: Puppet is still disabled on wdqs2008 while the reload runs. It occurred to me that I'm not sure if puppet actually needs to be disabled during data reloads or if that's just a precaution we've historically taken - any insight here @Gehel?

Note: Puppet is still disabled on wdqs2008 while the reload runs. It occurred to me that I'm not sure if puppet actually needs to be disabled during data reloads or if that's just a precaution we've historically taken - any insight here @Gehel?

Im not sure what the answer to this is however i think that in general routine maintenance shouldn't require one to disable puppet if possible i think it would be good to implement ayounsi suggestion

Thanks, thinking a bit more about it, it might be better to put whatever blocks those reloads in Puppet behind a Hiera option. So puppet can stay enabled while not interfering with those actions.

Im not that familure with search but happy to help out on the puppet side of things if i can

fyi I noticed the following comment in the reload cook book

# FIXME: this cookbook is expected to run for weeks, we don't want to disable puppet for that long
#        but we want to ensure that the various services aren't restarted by puppet along the way.

Change 672383 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] C:query_service: Add paramter to control if we manage services

https://gerrit.wikimedia.org/r/672383

Change 672384 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] P:query_service: add ability to disable managing services

https://gerrit.wikimedia.org/r/672384

A drafted a couple of changes (see above) which try to add a hiera parameter to control this, not sure if i got everything but i think it should be a good start

Mentioned in SAL (#wikimedia-operations) [2021-03-24T19:42:09Z] <ryankemper> T267927 Re-enabledpuppet on wdqs2008 and ran puppet agent

Mentioned in SAL (#wikimedia-operations) [2021-03-24T19:56:58Z] <ryankemper> T267927 Host key is missing for wdqs2008 leading to data-transfer cookbook failing, looking into resolving

wdqs1009, wdqs1010, and wdqs2008 are done, so we need to data-transfer to the remaining instances.

It looks like wdqs1010 is warning (not critical) for disk space on /srv. Glancing at the box it's still got 44G left but is at 96% utilization:

/dev/mapper/vg0-srv 1.1T 999G 44G 96% /srv

This is essentially entirely attributable to wikidata.jnl which is 1 TB in size.

Change 675003 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/cookbooks@master] wdqs: only disable puppet during blazegraph rstrt

https://gerrit.wikimedia.org/r/675003

Change 672384 abandoned by Jbond:
[operations/puppet@production] P:query_service: add ability to disable managing services

Reason:

https://gerrit.wikimedia.org/r/672384

Change 672383 abandoned by Jbond:
[operations/puppet@production] C:query_service: Add paramter to control if we manage services

Reason:

https://gerrit.wikimedia.org/r/672383

Change 675003 merged by Ryan Kemper:
[operations/cookbooks@master] wdqs: only disable puppet during blazegraph rstrt

https://gerrit.wikimedia.org/r/675003

Mentioned in SAL (#wikimedia-operations) [2021-03-27T06:10:03Z] <ryankemper> T267927 sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O /srv/wdqs/latest-all.ttl.bz2 && sudo https_proxy=webproxy.codfw.wmnet:8080 wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 -O /srv/wdqs/latest-lexemes.ttl.bz2 on ryankemper@wdqs2008 tmux session download_dumps_2020-03-26

Mentioned in SAL (#wikimedia-operations) [2021-03-29T09:16:24Z] <ryankemper> T267927 sudo -i cookbook sre.wdqs.data-reload wdqs2008.codfw.wmnet --task-id T267927 --reload-data wikidata --reason 'T267927: Reload wikidata jnl from fresh dumps' --reuse-downloaded-dump --depool

Mentioned in SAL (#wikimedia-operations) [2021-04-14T09:06:28Z] <ryankemper> T267927 depool wdqs2001 following data transfer (catching up on lag)

Mentioned in SAL (#wikimedia-operations) [2021-04-14T09:12:03Z] <ryankemper> T267927 depooled wdqs1004 following data transfer (catching up on lag), current round of data transfers is done so there shouldn't be any left to depool

[EQIAD PUBLIC]
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1012.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[CODFW PUBLIC]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2002.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[EQIAD INTERNAL]
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1003.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1008.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

[CODFW INTERNAL]
DONE sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927
sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

Multiple transfers from the same source host is not supported. The various cookbook instances don't synchronize, so blazegraph is likely to be restarted before all transfers are completed, leading to corruption of data. Note that in any case, this would not speed up operations much since the bottleneck is the outgoing network bandwidth on the source server.

At the moment, wdqs1003 and wdqs1004 have corrupted data and have been depooled.

I was having trouble understanding what the restart status of blazegraph has to do with the transfer of /srv/wdqs/wikidata.jnl. (As an aside, the point that running multiple simultaneous transfers from the same source is bottlenecked by outgoing bandwidth anyway makes perfect sense to me).

It just clicked for me though - the cookbook transfers the file while blazegraph is disabled, so if another cookbook instance re-enables blazegraph while another cookbook instance is still receiving /srv/wdqs/wikidata.jnl, the wikidata.jnl will be getting mutated during the transfer.

Note: I updated my comment above to indicate that wdqs100[3,4] are not properly done due to the data corruption.

Mentioned in SAL (#wikimedia-operations) [2021-04-15T04:14:14Z] <ryankemper> T280108 T267927 wdqs2008 (source) caught up on lag, xfering to wdqs1004: sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

Mentioned in SAL (#wikimedia-operations) [2021-04-15T06:33:53Z] <ryankemper> !log T280108 T267927 data-transfer to wdqs1004 was successful; cookbook failed due to a newly introduced minor type error that didn't effect the transfer itself

Since we've seeded to wdqs1004.eqiad.wmnet we can start using that to transfer to other eqiad public nodes. Still need to seed to the other 3 (codfw public, codfw internal, eqiad internal) before we'll be at the peak "velocity".

Mentioned in SAL (#wikimedia-operations) [2021-04-15T16:17:13Z] <ryankemper> T280108 T267927 Merged https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/679702 and ran puppet-agent on cumin2001 before next round of wdqs data-transfers

Mentioned in SAL (#wikimedia-operations) [2021-04-15T16:21:44Z] <ryankemper> T280108 T267927 Current wdqs transfers in progress: wqds1004->wdqs1005, wdqs2008->wdqs2001

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:33:28Z] <ryankemper> T280108 T267927 Data transfers completed successfully; small issue with new wait_for_updater logic is preventing termination so I ctrl+c'd manually

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:46:40Z] <ryankemper> T280108 T267927 Manually re-enabled and ran puppet on wdqs1005 (had closed the tmux pane which terminated the cookbook without letting it do its final cleanup)

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:48:47Z] <ryankemper> T267927 pooled wdqs1005 (all caught up on lag)

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:56:19Z] <ryankemper> T267927 WDQS kicked off next round of data-transfers: wdqs1004->wdqs1006, wdqs2001->wdqs2002, wdqs2008->wdqs1003

Mentioned in SAL (#wikimedia-operations) [2021-04-16T03:05:21Z] <ryankemper> T267927 Last round of data-transfers finished successfully, proceeding to next round

Mentioned in SAL (#wikimedia-operations) [2021-04-16T03:09:50Z] <ryankemper> T267927 kicked off next round of data-transfers: wdqs1004->wdqs1007, wdqs2001->wdqs2003, wdqs1003->wdqs1008, wdqs2008->wdqs2004

Mentioned in SAL (#wikimedia-operations) [2021-04-16T03:22:22Z] <ryankemper> T267927 Pooled wdqs1006 and wdqs2002

Mentioned in SAL (#wikimedia-operations) [2021-04-16T17:00:36Z] <ryankemper> T267927 Following data transfers complete: wdqs1004->wdqs1007, wdqs2001->wdqs2003, wdqs1003->wdqs1008, wdqs2008->wdqs2004

Mentioned in SAL (#wikimedia-operations) [2021-04-16T17:03:46Z] <ryankemper> T267927 Pooled wdqs1007, wdqs2003, wdqs1008, wdqs2004

We can't proceed any further on this ticket until we address https://phabricator.wikimedia.org/T280382, because the short-term fix for disk space will require re-imaging the affected wdqs hosts which will require starting the data-transfer process over (not the data-reload, fortunately).

Mentioned in SAL (#wikimedia-operations) [2021-04-16T17:48:15Z] <ryankemper> T267927 Transferring from wdqs2008->wdqs2003 to resolve the data corruption on wdqs2003

Mentioned in SAL (#wikimedia-operations) [2021-04-17T00:08:09Z] <ryankemper> T267927 Reload of wdqs2003 complete

Mentioned in SAL (#wikimedia-operations) [2021-04-17T00:14:54Z] <ryankemper> T267927 sudo run-puppet-agent and sudo pool on wdqs2003