Page MenuHomePhabricator

reload data on wdqs1004
Closed, ResolvedPublic

Description

Looks like after investigation in T188045: wdqs1004 broken wdqs1004 is back to life. We need now to reload data on it. There are the following options:

  1. Stop Updater on one of the wdqs2* severs and copy the database from them
  2. Load data on one of new wdqs2* servers (T187800} and then copy the DB from there
  3. Wait until the latest dump https://dumps.wikimedia.org/wikidatawiki/entities/20180312/ is there (probably Thursday, dumps are slow right now) and then load the dump and parse it and reload the database.

Event Timeline

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptMar 13 2018, 12:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as High priority.Mar 13 2018, 12:47 AM

@Gehel is there any easy way to copy the .jnl file between machines? It's 413G but probably compressable so we'll be moving about 100-200G of data if we zip it up. But I am not sure how to do it. Or we reload from dump but then we'd have to wait till Thursday/Friday or use a week-old dump and let it catch up for a while.

Gehel added a comment.Mar 13 2018, 9:16 AM

@Smalyshev yes, there is a way to copy the data between wdqs nodes, I'll take care of it and document it here. The new wdqs cluster is not yet done reloading, so I'll take the data from wdqs2001 (I prefer to shutdown a node in codfw than running on a single node in eqiad, even if that should be fine).

Mentioned in SAL (#wikimedia-operations) [2018-03-13T10:00:04Z] <gehel> shuttind down blazegraph on wdqs2001 for data transfer to wdqs1004 - T189548

Data transfer done with:

wdqs1004 (receiving):

nc -l -p 9876 | gunzip | pv -b -r > wikidata.jnl

wdqs2001 (sending):

cat wikidata.jnl | gzip -9 | nc -w 3 wdqs1004.eqiad.wmnet 9876

The transfer is not encrypted (but does not contain any PII). Checking the result with sha256sum.

After experimenting a bit, I removed gzip from the pipeline. It looks like gzip is CPU bound (and not multi-threaded). Even with gzip -1, the transfer rate is slower than with no compression.

Data transfer completed from wdqs2001 to wdqs1004. Procedure is documented on wiki. The updater is catching up on a few hours of changes. Things look stable.

@Smalyshev: do you want to do a specific check before I re-pool wdqs1004?

Mentioned in SAL (#wikimedia-operations) [2018-03-13T18:42:18Z] <gehel> repool wdqs1004 & wdqs2001 now that data reload is completed T189548

Nodes are repooled, all seems good. This can be closed!

Smalyshev closed this task as Resolved.Mar 13 2018, 6:51 PM
Smalyshev claimed this task.