Page MenuHomePhabricator

Q1:rerack elastic10[53-67]
Closed, ResolvedPublic

Description

This task will track the re-racking of elastic10[53-67] to enable 10G networking (T317816)

We'll need a couple days heads up to get these hosts offline before we're ready to actually physically re-rack them. Currently this ticket is just here to provide the information so we can figure out where there will be availability in 10G-capable racks.

Re-racking details

Each of these hosts need to be relocated into 10G racks. Ideally we want them distributed roughly evenly.

But first it'd be helpful to know what the general availability of 10G ports is in rows A-F.

For context, here's the existing row allocations of these hosts from netbox:

elastic1053     Active  —   Equinix Ashburn     eqiad row A     A5  Server  Dell    PowerEdge R440  2620:0:861:101:10:64:0:114/64   
elastic1054     Active  —   Equinix Ashburn     eqiad row A     A3  Server  Dell    PowerEdge R440  2620:0:861:101:10:64:0:115/64   
elastic1055     Active  —   Equinix Ashburn     eqiad row B     B1  Server  Dell    PowerEdge R440  2620:0:861:102:10:64:16:131/64  
elastic1056     Active  —   Equinix Ashburn     eqiad row B     B5  Server  Dell    PowerEdge R440  2620:0:861:102:10:64:16:132/64  
elastic1057     Active  —   Equinix Ashburn     eqiad row C     C3  Server  Dell    PowerEdge R440  2620:0:861:103:10:64:32:93/64   
elastic1058     Active  —   Equinix Ashburn     eqiad row C     C3  Server  Dell    PowerEdge R440  2620:0:861:103:10:64:32:94/64   
elastic1059     Active  —   Equinix Ashburn     eqiad row C     C8  Server  Dell    PowerEdge R440  2620:0:861:103:10:64:32:95/64   
elastic1060     Active  —   Equinix Ashburn     eqiad row D     D1  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:130/64  
elastic1061     Active  —   Equinix Ashburn     eqiad row D     D1  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:131/64  
elastic1062     Active  —   Equinix Ashburn     eqiad row D     D3  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:132/64  
elastic1063     Active  —   Equinix Ashburn     eqiad row D     D3  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:133/64  
elastic1064     Active  —   Equinix Ashburn     eqiad row D     D4  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:134/64  
elastic1065     Active  —   Equinix Ashburn     eqiad row D     D3  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:135/64  
elastic1066     Active  —   Equinix Ashburn     eqiad row D     D6  Server  Dell    PowerEdge R440  2620:0:861:107:10:64:48:136/64

Per host setup checklist

I took a swing at an abridged checklist since the template is about racking of new hosts AFAICT

elastic1053.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1054.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1055.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1056.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1057.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1058.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1059.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1060.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1061.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1062.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1063.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1064.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1065.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
elastic1066.eqiad.wmnet:
  • - re-rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).

Details

Other Assignee
Jclark-ctr

Event Timeline

RKemper renamed this task from Q1:rack/setup/install elastic10[53-67] to Q1:rerack elastic10[53-67].Oct 31 2022, 9:29 PM
RKemper created this task.
RKemper updated the task description. (Show Details)
RKemper updated the task description. (Show Details)
RKemper added a subscriber: Jclark-ctr.

Mentioned in SAL (#wikimedia-operations) [2023-03-03T14:09:23Z] <inflatador> bking@cumin2002 banning elastic1053-59 from the cluster in preparation for T322082

Icinga downtime and Alertmanager silence (ID=86f268fc-ff2d-4948-aa3f-f9d831ed4c29) set by bking@cumin2002 for 1 day, 0:00:00 on 14 host(s) and their services with reason: rerack

elastic[1053-1066].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-03T16:09:14Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1053 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T16:10:20Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1053 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T17:01:30Z] <inflatador> bking@cumin2002 ban elastic1059-1066 T322082

Mentioned in SAL (#wikimedia-operations) [2023-03-03T17:35:29Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1054 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T17:37:12Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1054 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T18:42:25Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1056 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T18:43:31Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1056 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:32:32Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:36:04Z] <bking@cumin2002> END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:36:28Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:39:10Z] <bking@cumin2002> END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Update location of elastic1055 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:49:57Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic hosts - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T19:51:45Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic hosts - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T20:23:27Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1058 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T20:25:05Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1058 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T20:52:56Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update location of elastic1059 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T20:55:21Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update location of elastic1059 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-03T20:58:15Z] <inflatador> bking@cumin2002 persistently unban all elastic nodes in eqiad T322082

Update: elastic1053-59 are have been re-racked. The remaining hosts (elastic1060-66, all in row D) should be finished by Wednesday. See the Etherpad page for more details on progress.

Mentioned in SAL (#wikimedia-operations) [2023-03-07T21:41:22Z] <inflatador> bking@cumin2002 ban elastic row D hosts to prepare for T322082

Icinga downtime and Alertmanager silence (ID=2faab0f0-8bed-4101-9f19-d26f3c99b3d7) set by bking@cumin2002 for 1 day, 0:00:00 on 7 host(s) and their services with reason: re-rack

elastic[1060-1066].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-07T21:58:40Z] <inflatador> bking@cumin2002 depool elastic row D hosts to prepare for T322082

Mentioned in SAL (#wikimedia-operations) [2023-03-08T14:25:32Z] <inflatador> bking@cumin2002 powering down elastic1060-66 for re-rack T322082

Icinga downtime and Alertmanager silence (ID=5c6bb325-a116-457c-9a58-2cbd8dfcfd42) set by bking@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: re-rack

elastic1061.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=3fca5960-ee3f-4527-8324-4e3bbd02c3f7) set by bking@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: re-rack

elastic1062.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:23:58Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1061 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:25:14Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1061 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:28:02Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1060 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:29:08Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1060 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:34:01Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1062 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T16:34:46Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1062 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T17:59:11Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update location of elastic1066 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:05:38Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update location of elastic1066 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:05:57Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1064 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:12:02Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1064 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:13:35Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1065 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:13:40Z] <bking@cumin2002> END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "update locatoin of elastic1065 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:18:06Z] <bking@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "update locatoin of elastic1064-65 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:19:09Z] <bking@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "update locatoin of elastic1064-65 - bking@cumin2002 - T322082"

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:27:22Z] <inflatador> bking@cumin2002 unban elastic1060-1066 to finish off T322082

Mentioned in SAL (#wikimedia-operations) [2023-03-08T18:28:17Z] <inflatador> bking@cumin2002 repool elastic1060-1066 to finish off T322082

This work is complete! Thanks @Jclark-ctr and everyone else who helped. Moving to "needs reporting" on the discovery-search board...

Jclark-ctr updated the task description. (Show Details)