Service implementation for elastic10[68-83].eqiad.wmnet
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	RKemper
	Nov 2 2021, 5:54 AM

Description

See T279158 for dc-ops procurement, T281989 for dc-ops racking. This ticket is to track the search team's part: taking the fresh nodes and bringing them properly into service

Step 1: Set up hieradata

allocate between psi/omega, keeping rows as balanced as possible

Step 2: Enable cirrus roles

after completion of this step, the new hosts should have joined the cirrus elasticsearch clusters

Step 3: Prepare to decom old hosts

set new master configuration - https://phabricator.wikimedia.org/T294805#7473840
set new cluster replication seeds to new masters (note: we didn't realize this was needed till later on, so we actually did this manually as part of step 4)
manually ban from cluster

Step 4: Actually decom hosts

remove cirrus role and run decom cookbooks; then open decom tickets for dc-ops

Details

Subject	Repo	Branch	Lines +/-
elastic: officially decom 10[32-47]	operations/puppet	production	+17 -5
elasticsearch: decom elastic10[32-47] (step 4)	operations/puppet	production	+2 -65
elasticsearch: new masters for psi cluster	operations/puppet	production	+3 -3
elasticsearch: new master config (step 3)	operations/puppet	production	+13 -9
elastic: es pkg needs 3rd party comp	operations/puppet	production	+1 -1
elastic: make wmf-es-search-plugins req es package	operations/puppet	production	+7 -5
elasticsearch: activate role (step 2)	operations/puppet	production	+7 -2
Revert "elasticsearch: activate role (step 2)"	operations/puppet	production	+2 -7
elasticsearch: activate role (step 2)	operations/puppet	production	+7 -2
elasticsearch: activate role (step 2)	operations/puppet	production	+1 -1
elasticsearch: hiera for new eqiad nodes (step 1)	operations/puppet	production	+73 -3

Related Objects

Mentioned In: T311939: Degraded RAID on elastic2049
T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks
T213150: Configure elasticsearch crosscluster on production search servers
T300943: Service implementation for elastic20[61-86].codfw.wmnet
Mentioned Here: T302517: decommission elastic10[32-47].eqiad.wmnet
P20138 elastic1068 puppet failure
T281989: Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2021, 5:54 AM

RKemper updated the task description. (Show Details)Nov 2 2021, 5:54 AM

RKemper triaged this task as Medium priority.Nov 2 2021, 6:18 AM

RKemper updated the task description. (Show Details)

[Hosts that will be decom'd]

elastic1032     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A3  2620:0:861:101:10:64:0:233/64
elastic1033     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A3  2620:0:861:101:10:64:0:234/64
elastic1034     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A3  2620:0:861:101:10:64:0:235/64
elastic1035     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A3  2620:0:861:101:10:64:0:236/64
elastic1036     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B3  2620:0:861:102:10:64:16:45/64
elastic1037     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B3  2620:0:861:102:10:64:16:46/64
elastic1038     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B3  2620:0:861:102:10:64:16:47/64
elastic1039     Failed  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B3  2620:0:861:102:10:64:16:48/64
elastic1040     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     C5  2620:0:861:103:10:64:32:108/64
elastic1041     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     C5  2620:0:861:103:10:64:32:109/64
elastic1042     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     C5  2620:0:861:103:10:64:32:110/64
elastic1043     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     C5  2620:0:861:103:10:64:32:111/64
elastic1044     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A6  2620:0:861:101:10:64:0:85/64
elastic1045     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     A6  2620:0:861:101:10:64:0:86/64
elastic1046     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B6  2620:0:861:102:10:64:16:70/64
elastic1047     Active  —   Server  HP ProLiant DL360 Gen9  Equinix Ashburn     B6  2620:0:861:102:10:64:16:71/64

[New hosts w/ psi vs omega assignment]

elastic1068     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A4  2620:0:861:101:10:64:0:72/64   omega
elastic1069     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A4  2620:0:861:101:10:64:0:73/64   psi
elastic1070     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:74/64   omega
elastic1071     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:76/64   omega
elastic1072     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:77/64   psi
elastic1073     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:78/64   psi
elastic1074     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B2  2620:0:861:102:10:64:16:42/64  omega
elastic1075     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B2  2620:0:861:102:10:64:16:49/64  psi
elastic1076     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:50/64  omega
elastic1077     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:51/64  omega
elastic1078     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:52/64  psi
elastic1079     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:53/64  psi
elastic1080     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C4  2620:0:861:103:10:64:32:29/64  omega
elastic1081     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C4  2620:0:861:103:10:64:32:166/64 psi
elastic1082     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C7  2620:0:861:103:10:64:32:167/64 omega
elastic1083     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C7  2620:0:861:103:10:64:32:168/64 psi

[New hosts w/ psi vs omega assignment, separated into rows for visual convenience]

elastic1068     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A4  2620:0:861:101:10:64:0:72/64   omega
elastic1070     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:74/64   omega
elastic1071     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:76/64   omega

elastic1074     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B2  2620:0:861:102:10:64:16:42/64  omega
elastic1076     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:50/64  omega
elastic1077     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:51/64  omega

elastic1080     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C4  2620:0:861:103:10:64:32:29/64  omega
elastic1082     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C7  2620:0:861:103:10:64:32:167/64 omega


elastic1069     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A4  2620:0:861:101:10:64:0:73/64   psi
elastic1072     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:77/64   psi
elastic1073     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     A7  2620:0:861:101:10:64:0:78/64   psi

elastic1075     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B2  2620:0:861:102:10:64:16:49/64  psi
elastic1078     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:52/64  psi
elastic1079     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     B4  2620:0:861:102:10:64:16:53/64  psi

elastic1081     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C4  2620:0:861:103:10:64:32:166/64 psi
elastic1083     Staged  —   Server  Dell PowerEdge R440     Equinix Ashburn     C7  2620:0:861:103:10:64:32:168/64 psi

[new conftool-data entries corresponding to the above]

elastic1068.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1069.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1070.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1071.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1072.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1073.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1074.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1075.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1076.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1077.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1078.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1079.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1080.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1081.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]
elastic1082.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-omega-ssl]
elastic1083.eqiad.wmnet: [elasticsearch, elasticsearch-ssl, elasticsearch-psi-ssl]

Step 3

(Old master configuration)

(main cluster)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1036.eqiad.wmnet (B3)
      - elastic1040.eqiad.wmnet (C5)
      - elastic1054.eqiad.wmnet

(omega)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1034.eqiad.wmnet (A3)
      - elastic1038.eqiad.wmnet (B3)
      - elastic1040.eqiad.wmnet (C5)

(psi)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1048.eqiad.wmnet
      - elastic1050.eqiad.wmnet
      - elastic1052.eqiad.wmnet

(New master configuration)

(main cluster)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1074.eqiad.wmnet (B2)
      - elastic1081.eqiad.wmnet (C4)
      - elastic1054.eqiad.wmnet

(omega)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1068.eqiad.wmnet (A4)
      - elastic1076.eqiad.wmnet (B2)
      - elastic1080.eqiad.wmnet (C4)

(psi)
    unicast_hosts: # this is also the list of master eligible nodes
      - elastic1048.eqiad.wmnet
      - elastic1050.eqiad.wmnet
      - elastic1052.eqiad.wmnet

RKemper updated the task description. (Show Details)Nov 2 2021, 6:42 AM

RKemper updated the task description. (Show Details)Nov 2 2021, 6:44 AM

RKemper updated the task description. (Show Details)Nov 2 2021, 6:50 AM

Change 736116 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: hiera for new eqiad nodes (step 1)

https://gerrit.wikimedia.org/r/736116

Change 736117 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/736117

Change 736118 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: new master config (step 3)

https://gerrit.wikimedia.org/r/736118

Change 736119 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: decom elastic10[32-47] (step 4)

https://gerrit.wikimedia.org/r/736119

MPhamWMF set the point value for this task to 8.Nov 8 2021, 4:24 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2022-01-12T19:14:40Z] <mutante> elastic10180 - one power supply seeming failed - see icinga IPMI alert - [Status = Critical, PS Redundancy = Critical] T294805

Change 736116 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: hiera for new eqiad nodes (step 1)

https://gerrit.wikimedia.org/r/736116

Mentioned in SAL (#wikimedia-operations) [2022-01-25T23:20:23Z] <ryankemper> T294805 [Elastic] Merged https://gerrit.wikimedia.org/r/736116, step 1 of bringing new eqiad 10G refresh hosts into service

Mentioned in SAL (#wikimedia-operations) [2022-01-25T23:42:22Z] <ryankemper> T294805 [Elastic] Step 2: Disabling puppet in advance of merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/736117

Change 736117 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/736117

Change 757003 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/757003

Change 757003 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/757003

Mentioned in SAL (#wikimedia-operations) [2022-01-26T00:03:36Z] <ryankemper> T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/757003; running puppet on elastic1068 to make it join the fleet

Change 757005 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert \"elasticsearch: activate role (step 2)\"

https://gerrit.wikimedia.org/r/757005

Change 757005 merged by Ryan Kemper:

[operations/puppet@production] Revert \"elasticsearch: activate role (step 2)\"

https://gerrit.wikimedia.org/r/757005

Mentioned in SAL (#wikimedia-operations) [2022-01-26T00:11:43Z] <ryankemper> T294805 Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/757003 (elasticsearch-oss dependency issues, will pick this back up tomorrow); re-enabling puppet across elastic1*

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1068.eqiad.wmnet with OS stretch completed:

elastic1068 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh stretch OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201280434_ryankemper_14645_elastic1068.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 759317 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/759317

Change 759317 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: activate role (step 2)

https://gerrit.wikimedia.org/r/759317

Mentioned in SAL (#wikimedia-operations) [2022-02-03T20:16:17Z] <ryankemper> T294805 Disabled puppet on elastic1* in preparation for brining new hosts into service: ryankemper@cumin1001:~$ sudo cumin 'elastic1*' 'sudo disable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805"'

Mentioned in SAL (#wikimedia-operations) [2022-02-03T20:17:31Z] <ryankemper> T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/759317 to activate roles for elastic eqiad replacement hosts

Mentioned in SAL (#wikimedia-operations) [2022-02-03T20:22:56Z] <ryankemper> T294805 Running puppet on single elastic host: ryankemper@elastic1068:~$ sudo run-puppet-agent --force

Mentioned in SAL (#wikimedia-operations) [2022-02-03T20:26:14Z] <ryankemper> T294805 Running puppet on elastic1068 failed, looks like `/usr/share/elasticsearch/lib' wasn't there: https://phabricator.wikimedia.org/P20138

Mentioned in SAL (#wikimedia-operations) [2022-02-03T20:26:46Z] <ryankemper> T294805 Running puppet on elastic1068 failed, looks like /usr/share/elasticsearch/lib wasn't there: https://phabricator.wikimedia.org/P20138

Change 759588 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: make wmf-es-search-plugins req es package

https://gerrit.wikimedia.org/r/759588

Change 759588 merged by Ryan Kemper:

[operations/puppet@production] elastic: make wmf-es-search-plugins req es package

https://gerrit.wikimedia.org/r/759588

Mentioned in SAL (#wikimedia-operations) [2022-02-03T21:21:30Z] <ryankemper> T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/759588; hoping this resolves dependency issues. Running puppet agent on elastic1068

Change 759617 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: es pkg needs 3rd party comp

https://gerrit.wikimedia.org/r/759617

Change 759617 merged by Ryan Kemper:

[operations/puppet@production] elastic: es pkg needs 3rd party comp

https://gerrit.wikimedia.org/r/759617

Mentioned in SAL (#wikimedia-operations) [2022-02-03T22:13:38Z] <ryankemper> T294805 https://gerrit.wikimedia.org/r/c/operations/puppet/+/759617/ fixed the dependency issues, going to start bringing new hosts into service

Mentioned in SAL (#wikimedia-operations) [2022-02-03T22:18:16Z] <ryankemper> T294805 Bringing in new eqiad hosts in batches of 4, with 15-20 mins between batches: ryankemper@cumin1001:~$ sudo -E cumin -b 4 'elastic1*' 'sudo run-puppet-agent --force; sudo run-puppet-agent; sleep 900' tmux session es_eqiad

Mentioned in SAL (#wikimedia-operations) [2022-02-03T22:18:31Z] <ryankemper> T294805 Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&refresh=1m&from=now-3h&to=now as new hosts join the fleet

Mentioned in SAL (#wikimedia-operations) [2022-02-03T23:15:36Z] <ryankemper> T294805 Added a silence on alerts.wikimedia.org for CirrusSearchJVMGCOldPoolFlatlined

RKemper updated the task description. (Show Details)Feb 4 2022, 2:03 AM

RKemper mentioned this in T300943: Service implementation for elastic20[61-86].codfw.wmnet.Feb 4 2022, 2:08 AM

Change 759637 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic2035

https://gerrit.wikimedia.org/r/759637

Change 736118 merged by Bking:

[operations/puppet@production] elasticsearch: new master config (step 3)

https://gerrit.wikimedia.org/r/736118

Mentioned in SAL (#wikimedia-operations) [2022-02-07T22:45:23Z] <inflatador> T294805 puppet-merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/736118

Mentioned in SAL (#wikimedia-operations) [2022-02-07T22:48:15Z] <ryankemper> T294805 Disabled puppet across all of elastic1* in preparation for bringing new master hosts in

Mentioned in SAL (#wikimedia-operations) [2022-02-07T22:57:20Z] <ryankemper> T294805 Running puppet agent on new master elastic1074.eqiad.wmnet: sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent

Mentioned in SAL (#wikimedia-operations) [2022-02-07T22:59:52Z] <ryankemper> T294805 sudo systemctl restart elasticsearch_6@production-search-eqiad.service elasticsearch_6@production-search-omega-eqiad.service on elastic1074

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:04:36Z] <ryankemper> T294805 Bringing in new master elastic1081: sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:04:51Z] <ryankemper> T294805 Bringing in new master elastic1081: sudo systemctl restart elasticsearch_6@production-search-eqiad.service elasticsearch_6@production-search-psi-eqiad.service

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:06:20Z] <ryankemper> T294805 Running puppet and restarting elasticsearch services on elastic1040 to make it no longer a master

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:09:54Z] <ryankemper> T294805 Kicking out the final master elastic1036 (which is also the currently elected leader); after this we'll be back to 3 masters as intended

RKemper updated the task description. (Show Details)Feb 7 2022, 11:15 PM

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:27:43Z] <ryankemper> T294805 Main search cluster all done, proceeding to omega cluster

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:27:57Z] <ryankemper> T294805 Bringing in new master elastic1068

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:31:15Z] <ryankemper> T294805 Bringing in new omega master elastic1076

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:35:34Z] <ryankemper> T294805 Bringing in new omega master elastic1057

Mentioned in SAL (#wikimedia-operations) [2022-02-07T23:39:22Z] <ryankemper> T294805 Removed old masters elastic1034 and elastic1038 (and elastic1040 was removed earlier)

Change 760684 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: new masters for psi cluster

https://gerrit.wikimedia.org/r/760684

Change 760684 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: new masters for psi cluster

https://gerrit.wikimedia.org/r/760684

Mentioned in SAL (#wikimedia-operations) [2022-02-08T00:05:40Z] <ryankemper> T294805 new psi masters elastic1073, elastic1075, and elastic1083 are in

Mentioned in SAL (#wikimedia-operations) [2022-02-08T00:12:25Z] <ryankemper> T294805 old psi masters are out, done with all elastic master operations

Mentioned in SAL (#wikimedia-operations) [2022-02-08T00:12:29Z] <ryankemper> T294805 Re-enabling puppet across eqiad elastic fleet: ryankemper@cumin1001:~$ sudo cumin -b 8 'elastic1*' 'sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent' tmux session elastic

Mentioned in SAL (#wikimedia-operations) [2022-02-08T20:33:34Z] <ryankemper> T294805 Banned elastic10[32-47] from main, omega, and psi elasticsearch clusters. Shards are relocating on main and omega clusters as expected, but they don't seem to be moving on psi. Investigating that currently. Might have to do with row allocation constraints, but unsure currently

Mentioned in SAL (#wikimedia-operations) [2022-02-08T21:59:51Z] <ryankemper> T294805 elastic10[68-83] erroneously weren't in pybal, added them just now: sudo confctl select 'cluster=elasticsearch' set/pooled=yes:weight=10 (there's no hosts in the conftool-data list that we want depooled so we're okay setting all to pooled w/ equal weight)

Change 736119 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: decom elastic10[32-47] (step 4)

https://gerrit.wikimedia.org/r/736119

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: elastic[1032-1038,1040-1042,1044-1047].eqiad.wmnet

elastic1032.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1033.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1034.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1035.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1036.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1037.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1038.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1040.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1041.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1042.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1044.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1045.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1046.eqiad.wmnet (FAIL)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Failed to power off, manual intervention required: Remote IPMI for elastic1046.mgmt.eqiad.wmnet failed (exit=1): b''
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1047.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

We realized the cross-cluster replication seeds don't auto-update. They were still set to the old masters, so I set them to the new ones via the method shown below

Setting remote seeds for cross-cluster replication

Commands ran from mwmaint

Here's directory state before running the commands:

ryankemper@mwmaint1002:~/elastic$ ls
chi_eqiad_masters.lst  omega_eqiad_masters.lst  psi_eqiad_masters.lst  push_cross_cluster_conf.py

Here's example contents of chi_eqiad_masters.lst (this should match the masters listed in cirrus.yaml):

elastic1068.eqiad.wmnet:9500
elastic1076.eqiad.wmnet:9500
elastic1057.eqiad.wmnet:9500

And here's the commands:

python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9243/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9443/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst
python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9643/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst

New state:

Chi (main)

ryankemper@elastic1073:~$ curl -s localhost:9200/_cluster/settings | jq .persistent.cluster.remote
{
  "omega": {
    "seeds": [
      "elastic1068.eqiad.wmnet:9500",
      "elastic1076.eqiad.wmnet:9500",
      "elastic1057.eqiad.wmnet:9500"
    ]
  },
  "psi": {
    "seeds": [
      "elastic1073.eqiad.wmnet:9700",
      "elastic1075.eqiad.wmnet:9700",
      "elastic1083.eqiad.wmnet:9700"
    ]
  }
}

Omega

ryankemper@elastic1053:~$ curl -s localhost:9400/_cluster/settings | jq .persistent.cluster.remote
{
  "chi": {
    "seeds": [
      "elastic1054.eqiad.wmnet:9300",
      "elastic1074.eqiad.wmnet:9300",
      "elastic1081.eqiad.wmnet:9300"
    ]
  },
  "omega": {
    "seeds": [
      "elastic1068.eqiad.wmnet:9500",
      "elastic1076.eqiad.wmnet:9500",
      "elastic1057.eqiad.wmnet:9500"
    ]
  }
}

Psi

ryankemper@elastic1073:~$ curl -s localhost:9600/_cluster/settings | jq .persistent.cluster.remote
{
  "chi": {
    "seeds": [
      "elastic1054.eqiad.wmnet:9300",
      "elastic1074.eqiad.wmnet:9300",
      "elastic1081.eqiad.wmnet:9300"
    ]
  },
  "psi": {
    "seeds": [
      "elastic1073.eqiad.wmnet:9700",
      "elastic1075.eqiad.wmnet:9700",
      "elastic1083.eqiad.wmnet:9700"
    ]
  }
}

(Side note: can always zero out settings like so (from an elastic host): curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"persistent":{"cluster.remote.*":null}}')

RKemper updated the task description. (Show Details)Feb 10 2022, 6:26 PM

RKemper mentioned this in T213150: Configure elasticsearch crosscluster on production search servers.

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Feb 15 2022, 7:30 PM

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: elastic1046.eqiad.wmnet

Change 765575 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: officially decom 10[32-47]

https://gerrit.wikimedia.org/r/765575

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: elastic[1039,1043].eqiad.wmnet

elastic1039.eqiad.wmnet (FAIL)
- Host not found on Icinga, unable to downtime it
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic1043.eqiad.wmnet (FAIL)
- Host not found on Icinga, unable to downtime it
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

COMMON_STEPS (FAIL)
- Failed to run Homer on asw2-b-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'asw2-b-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - ryankemper@cumin1001 - T294805']' returned non-zero exit status 1.

ERROR: some step on some host failed, check the bolded items above

Change 765575 merged by Ryan Kemper:

[operations/puppet@production] elastic: officially decom 10[32-47]

https://gerrit.wikimedia.org/r/765575

RKemper updated the task description. (Show Details)Mar 1 2022, 7:43 PM

This is done. Old hosts are decom'd, and dc-ops decom ticket is open https://phabricator.wikimedia.org/T302517

Gehel closed this task as Resolved.Mar 7 2022, 12:26 PM

RKemper mentioned this in T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks.Mar 22 2022, 9:19 PM

Mentioned in SAL (#wikimedia-operations) [2022-06-02T19:53:04Z] <ryankemper> T294805 Marked elastic10[68-83] as Active in netbox (all except elastic10[77,80] were erroneously marked as Staged)

RKemper mentioned this in T311939: Degraded RAID on elastic2049.Jul 15 2022, 6:10 AM

Service implementation for elastic10[68-83].eqiad.wmnetClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

[Hosts that will be decom'd]

[New hosts w/ psi vs omega assignment]

[New hosts w/ psi vs omega assignment, separated into rows for visual convenience]

[new conftool-data entries corresponding to the above]

Step 3

(Old master configuration)

(New master configuration)

Setting remote seeds for cross-cluster replication

Commands ran from mwmaint

New state:

Chi (main)

Omega

Psi

Service implementation for elastic10[68-83].eqiad.wmnet
Closed, ResolvedPublic8 Estimated Story Points
Actions