Page MenuHomePhabricator

Deploy the wdqs streaming updater to production
Closed, ResolvedPublic

Description

Values tracked during the process (please fill them when making progress)

  • Week of sept. 20:
    • Send com about the rollout
    • increase retention to 1month on codfw.rdf-streaming-updater.mutation (topic name is about to change) in kafka-main@codfw
    • make sure retention is 1month on eqiad.rdf-streaming-updater.mutation (topic name is about to change) in kafka-main@eqiad
  • Week of sept. 27:
    • (friday oct. 1): bootstrap flink, note DUMP_START_DATE and FLINK_(EQIAD|CODFW)_JOB_START
    • (friday oct. 1): pre-fetch the dumps to wdqs1009 and wdqs2008 and note LEXEME_DUMP and ENTITY_DUMP
    • (friday oct. 1): start the data-reload cookbook with --reload-data wikidata --skolemize [TODO: new options to manage kafka offsets with FLINK_EQIAD_JOB_START] on wdqs1009
    • (friday oct. 1): start the data-reload cookbook with --reload-data wikidata --skolemize [TODO: new options to manage kafka offsets with FLINK_CODFW_JOB_START] on wdqs2008
    • (friday oct. 1): merge the activation of the streaming updater profile on wdqs2008 while the reload is happening there
  • Week of oct. 4:
    • Monitor that the reload is progressing properly
  • Week of oct. 11:
    • Send a quick reminder com to users
    • Start data transfer (use the new option to activate kafka offsets propagation and always activate the streaming_updater profile via puppet on the target machine)
    • (internal cluster)
      • wdqs1009 -> wdqs1003
      • wdqs2008 -> wdqs2005
      • wdqs1003 -> wdqs1008
      • wdqs2008 -> wdqs2006
      • wdqs1003 -> wdqs1011
    • (external cluster)
      • wdqs1003 -> wdqs1004
      • wdqs2008 -> wdqs2001
      • wdqs1003 -> wdqs1005
      • wdqs2008 -> wdqs2002
      • wdqs1003 -> wdqs1006
      • wdqs2008 -> wdqs2003
      • wdqs1003 -> wdqs1007
      • wdqs2008 -> wdqs2004
      • wdqs1003 -> wdqs1012
      • wdqs2008 -> wdqs2007
      • wdqs1003 -> wdqs1013

Note: Migrations ended at 6am Oct 19.

Note: wdqs1010 is kept with the old updater and the old journal.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+3 -3
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+28 -6
Show related patches Customize query in gerrit

Event Timeline

Change 721280 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: Add a streaming updater profile

https://gerrit.wikimedia.org/r/721280

Change 721281 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: activate the streaming_updater role on wdqs2008

https://gerrit.wikimedia.org/r/721281

Change 721280 merged by Ryan Kemper:

[operations/puppet@production] wdqs: Prepare streaming updater settings

https://gerrit.wikimedia.org/r/721280

Change 721281 merged by Gehel:

[operations/puppet@production] wdqs: activate the streaming_updater role on wdqs2008

https://gerrit.wikimedia.org/r/721281

Change 730794 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2005

https://gerrit.wikimedia.org/r/730794

Change 730795 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2006

https://gerrit.wikimedia.org/r/730795

Change 730796 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2001

https://gerrit.wikimedia.org/r/730796

Change 730797 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2002

https://gerrit.wikimedia.org/r/730797

Change 730798 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2003

https://gerrit.wikimedia.org/r/730798

Change 730799 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2004

https://gerrit.wikimedia.org/r/730799

Change 730800 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2007

https://gerrit.wikimedia.org/r/730800

Change 730814 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1003

https://gerrit.wikimedia.org/r/730814

Change 730815 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1008

https://gerrit.wikimedia.org/r/730815

Change 730816 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1011

https://gerrit.wikimedia.org/r/730816

Change 730817 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1004

https://gerrit.wikimedia.org/r/730817

Change 730818 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1005

https://gerrit.wikimedia.org/r/730818

Change 730819 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1006

https://gerrit.wikimedia.org/r/730819

Change 730820 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1007

https://gerrit.wikimedia.org/r/730820

Change 730821 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1012

https://gerrit.wikimedia.org/r/730821

Change 730822 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1013

https://gerrit.wikimedia.org/r/730822

Mentioned in SAL (#wikimedia-operations) [2021-10-14T15:52:51Z] <ryankemper> T288231 ryankemper@wdqs2005:~$ sudo depool

Mentioned in SAL (#wikimedia-operations) [2021-10-14T15:54:18Z] <ryankemper> T288231 ryankemper@wdqs2008:~$ sudo depool

Change 730794 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2005

https://gerrit.wikimedia.org/r/730794

Mentioned in SAL (#wikimedia-operations) [2021-10-14T16:04:53Z] <ryankemper> T288231 ryankemper@wdqs2005:~$ sudo run-puppet-agent --force

Mentioned in SAL (#wikimedia-operations) [2021-10-14T16:07:07Z] <ryankemper> T288231 About to ctrl+c out of ongoing data transfer because puppet run following merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/730794 restarted blazegraph; we'll manually disable updater and kick off the transfer again

Mentioned in SAL (#wikimedia-operations) [2021-10-14T16:44:13Z] <ryankemper> T288231 Manually killed dangling pigz / nc processes on wdqs2008 (and wdqs2005 implicitly). Should be in the right state to re-start the data-transfer cookbook from again

[data-transfer cookbook -> kafka offset output]

(including this output since it's the first time we've relied on the new kafka offset part of the data-transfer process; won't be posting the output for any subsequent cookbooks runs)

Also note that the output is a little tricky to parse. I can't tell if it did succeed and the Error sending OffsetCommitRequest_v2... log lines were errors that were resolved upon retry, or if the cookbook falsely reported the offset. But judging by the fact that wdqs2005 is reporting the correct number of triples, I have to assume that it worked properly. (This also lines up with the fact that we do have log lines that seem to report success: Offsets set for target cluster "main", site "codfw" and consumer group "wdqs2005")

Transferring Kafka offsets
<BrokerConnection node_id=bootstrap-2 host=kafka-main2004.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: connecting to kafka-main2004.codfw.wmnet:9093 [('2620:0:860:104:10:192:48:38', 9093, 0, 0) IPv6]
Probing node bootstrap-2 broker version
<BrokerConnection node_id=bootstrap-2 host=kafka-main2004.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Connection complete.
Broker version identifed as 1.0.0
Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
<BrokerConnection node_id=bootstrap-1 host=kafka-main2004.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: connecting to kafka-main2004.codfw.wmnet:9093 [('2620:0:860:104:10:192:48:38', 9093, 0, 0) IPv6]
Probing node bootstrap-1 broker version
<BrokerConnection node_id=bootstrap-1 host=kafka-main2004.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Connection complete.
Broker version identifed as 1.0.0
Set configuration api_version=(1, 0, 0) to skip auto check_version requests on startup
Same cluster, setting offsets...
<BrokerConnection node_id=2004 host=kafka-main2004.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: connecting to kafka-main2004.codfw.wmnet:9093
[('2620:0:860:104:10:192:48:38', 9093, 0, 0) IPv6]
<BrokerConnection node_id=2001 host=kafka-main2001.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:101:10:192:0:17', 9093, 0, 0)]>: connecting to kafka-main2001.codfw.wmnet:9093 [('2620:0:860:101:10:192:0:17', 9093, 0, 0) IPv6]
<BrokerConnection node_id=2004 host=kafka-main2004.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Connection complete.
<BrokerConnection node_id=bootstrap-2 host=kafka-main2004.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=2005 host=kafka-main2005.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: connecting to kafka-main2005.codfw.wmnet:9093
[('2620:0:860:104:10:192:48:46', 9093, 0, 0) IPv6]
Group coordinator for wdqs2008 is BrokerMetadata(nodeId='coordinator-2003', host='kafka-main2003.codfw.wmnet', port=9093, rack=None)
Discovered coordinator coordinator-2003 for group wdqs2008
<BrokerConnection node_id=2001 host=kafka-main2001.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:101:10:192:0:17', 9093, 0, 0)]>: Connection complete.
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: connecting to kafka-main2003.codfw.wmnet:9093 [('2620:0:860:103:10:192:32:136', 9093, 0, 0) IPv6]
<BrokerConnection node_id=2005 host=kafka-main2005.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: Connection complete.
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: Connection complete.
Extracted offsets from source cluster "main", site "codfw" and consumer group "wdqs2008".
<BrokerConnection node_id=bootstrap-3 host=kafka-main2005.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: connecting to kafka-main2005.codfw.wmnet:9093 [('2620:0:860:104:10:192:48:46', 9093, 0, 0) IPv6]
Group coordinator for wdqs2005 is BrokerMetadata(nodeId='coordinator-2003', host='kafka-main2003.codfw.wmnet', port=9093, rack=None)
Discovered coordinator coordinator-2003 for group wdqs2005
Error sending OffsetCommitRequest_v2 to node coordinator-2003 [NodeNotReadyError: coordinator-2003]
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <connecting> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: connecting to kafka-main2003.codfw.wmnet:9093 [('2620:0:860:103:10:192:32:136', 9093, 0, 0) IPv6]
Error sending OffsetCommitRequest_v2 to node coordinator-2003 [NodeNotReadyError: coordinator-2003]
Error sending OffsetCommitRequest_v2 to node coordinator-2003 [NodeNotReadyError: coordinator-2003]
<BrokerConnection node_id=bootstrap-3 host=kafka-main2005.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: Connection complete.
Error sending OffsetCommitRequest_v2 to node coordinator-2003 [NodeNotReadyError: coordinator-2003]
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <handshake> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: Connection complete.
<BrokerConnection node_id=bootstrap-1 host=kafka-main2004.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=bootstrap-3 host=kafka-main2005.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: Closing connection.
Offsets set for target cluster "main", site "codfw" and consumer group "wdqs2005".
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=2004 host=kafka-main2004.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:104:10:192:48:38', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=2001 host=kafka-main2001.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:101:10:192:0:17', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=2005 host=kafka-main2005.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:104:10:192:48:46', 9093, 0, 0)]>: Closing connection.
<BrokerConnection node_id=coordinator-2003 host=kafka-main2003.codfw.wmnet:9093 <connected> [IPv6 ('2620:0:860:103:10:192:32:136', 9093, 0, 0)]>: Closing connection.
Done.
Starting services [systemctl start wdqs-blazegraph && sleep 10 && systemctl start wdqs-updater]

Mentioned in SAL (#wikimedia-operations) [2021-10-14T22:28:52Z] <ryankemper> T288231 ryankemper@wdqs2005:~$ sudo pool: transfer completed successfully; tests passing on host (used ssh -L 9999:localhost:80 wdqs2005.codfw.wmnet to establish tunnel)

Change 730795 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2006

https://gerrit.wikimedia.org/r/730795

Mentioned in SAL (#wikimedia-operations) [2021-10-14T22:32:47Z] <ryankemper> T288231 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/730795; proceeding to data-transfer on wdqs2006: sudo rm -fv /srv/wdqs/data_loaded on wdqs2006 followed by ryankemper@cumin1001:~$ sudo cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2006.codfw.wmnet --reason "streaming updater cutover for wdqs2005" --blazegraph_instance blazegraph --task-id T288231

Mentioned in SAL (#wikimedia-operations) [2021-10-14T22:33:48Z] <ryankemper> T288231 Forgot about running puppet-agent on wdqs2006; aborted cookbook run

Mentioned in SAL (#wikimedia-operations) [2021-10-14T22:35:00Z] <ryankemper> T288231 Ran puppet on wdqs2006, now back to the cookbook run

Mentioned in SAL (#wikimedia-operations) [2021-10-15T02:14:38Z] <ryankemper> T288231 wdqs2006 data transfer complete and all tests passing on the host. All of codfw wdqs-internal is on the new streaming updater

Change 730796 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2001

https://gerrit.wikimedia.org/r/730796

Change 730797 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2002

https://gerrit.wikimedia.org/r/730797

Change 730798 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2003

https://gerrit.wikimedia.org/r/730798

Change 730799 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2004

https://gerrit.wikimedia.org/r/730799

Change 730814 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1003

https://gerrit.wikimedia.org/r/730814

Change 730800 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs2007

https://gerrit.wikimedia.org/r/730800

Change 730815 merged by Gehel:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1008

https://gerrit.wikimedia.org/r/730815

Change 730816 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1011

https://gerrit.wikimedia.org/r/730816

Change 730817 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1004

https://gerrit.wikimedia.org/r/730817

Change 730818 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1005

https://gerrit.wikimedia.org/r/730818

Change 730819 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1006

https://gerrit.wikimedia.org/r/730819

Hi folks, I was trying to figure out why we had so many UNKNOWNs related to wdqs nodes in icinga and I noticed this task. The new streaming updater metrics seem to have a typo:

wdqs_streaming_updater_kafka_stream_consumer_lag_Value

Meanwhile this is the puppet config:

query           => "scalar(wdqs_streaming_updater_kafka_stream_consumer_lag{instance=\"${::hostname}:9101\"})",

Change 731282 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::query_service::monitor::wikidata: update streaming lag monitor

https://gerrit.wikimedia.org/r/731282

Change 731282 merged by Ryan Kemper:

[operations/puppet@production] profile::query_service::monitor::wikidata: update streaming lag monitor

https://gerrit.wikimedia.org/r/731282

Change 730820 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1007

https://gerrit.wikimedia.org/r/730820

Change 730821 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1012

https://gerrit.wikimedia.org/r/730821

Change 730822 merged by Ryan Kemper:

[operations/puppet@production] wdqs: enable the streaming updater on wdqs1013

https://gerrit.wikimedia.org/r/730822

Gehel claimed this task.