Page MenuHomePhabricator

Upgrade eqiad rack D4 to 10G switch
Closed, ResolvedPublic

Description

This task will track/coordinate the work between ops-eqiad and netops to upgrade the new row D switch stack to have 3 10G switches, rather than just 2. Row D should be made to match the other rows in the setup of (5) 1G racks and (3) 10G racks.

Scheduled for Sept. 28th, 1pm UTC, 2h

This means a hard downtime of ~1h for all hosts in D4, see the full list on https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=38&status=active&role=server

  • [DCops] Connect switch to console
  • [DCops] Pre-polulate SFP-Ts
  • [Netops] Check new member's config and OS version (and turn on/off VC ports)

In maintenance window, preferably during a DC failover as row recabling can bring temporary instability:

  • [Service owners] Depool services
  • [Netops] Power off existing member
  • [DCops] Unplug existing member
  • [DCops] Rack and power on new member in final location
  • [Netops] Update switch stack config with new serial number
  • [DCops] Connect VC-cables
  • [DCops] Connect access ports
  • [Netops] Verify everything is online
  • [Service owners] repool services
  • [DCops] Update Netbox
  • [DCops] Wipe/decom old switch

Event Timeline

RobH triaged this task as Medium priority.Jun 5 2018, 5:36 PM
RobH created this task.
RobH mentioned this in Unknown Object (Task).
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 5 2018, 6:15 PM
Cmjohnson updated the task description. (Show Details)Jun 7 2018, 5:55 PM

I racked the switch in D4, updated racktables

Vvjjkkii renamed this task from upgrade row d to have 3 10G switches to clbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii removed ayounsi as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from clbaaaaaaa to upgrade row d to have 3 10G switches.Jul 2 2018, 6:54 AM
CommunityTechBot assigned this task to ayounsi.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@RobH @ayounsi Let's get the procurement items we need to move this task along please.

RobH removed a subscriber: RobH.Mar 3 2020, 6:23 PM
ayounsi updated the task description. (Show Details)May 11 2020, 2:05 PM
ayounsi mentioned this in Unknown Object (Task).Aug 26 2020, 12:14 PM
ayounsi renamed this task from upgrade row d to have 3 10G switches to Upgrade eqiad rack D4 to 10G switch.Aug 26 2020, 12:22 PM
ayounsi updated the task description. (Show Details)

Any expected downtime for row D hosts?

ayounsi updated the task description. (Show Details)EditedAug 26 2020, 12:26 PM

@Marostegui I'm going to send an email, but partially yes, this means a hard downtime of ~1h for all hosts in D4, see the full list on https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=38&status=active&role=server

Kormat added a subscriber: Kormat.

I'll be the contact person for the data-persistence team for this.

Change 623177 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] move dumps around on the snapshots in prep network upgrade work

https://gerrit.wikimedia.org/r/623177

Change 623177 merged by ArielGlenn:
[operations/puppet@production] move dumps around on the snapshots in prep for network upgrade work

https://gerrit.wikimedia.org/r/623177

RobH added a subtask: Unknown Object (Task).Aug 31 2020, 5:14 PM
ayounsi updated the task description. (Show Details)Sep 8 2020, 10:03 AM

Postponed to Sept. 17th, 1pm Eastern, 17:00 UTC

Everything ok from the DB point of view. All the DB hosts in D4 can have a hard downtime, nothing will be impacted from our side.

Change 626111 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] pybal: Move from conf1006 to conf1005 as config_host in esams

https://gerrit.wikimedia.org/r/626111

Change 626113 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/dns@master] Temporarily remove conf1006 from client SRV records

https://gerrit.wikimedia.org/r/626113

ayounsi updated the task description. (Show Details)Sep 15 2020, 3:50 PM
ayounsi reassigned this task from ayounsi to Cmjohnson.Thu, Sep 24, 3:04 PM
ayounsi updated the task description. (Show Details)

Change 626111 merged by JMeybohm:
[operations/puppet@production] pybal: Move from conf1006 to conf1005 as config_host in esams

https://gerrit.wikimedia.org/r/626111

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:02:29Z] <jayme> restarting pybal on lvs3007 for switching to conf1005 - T196487

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:06:09Z] <jayme> restarting pybal on lvs3006 for switching to conf1005 - T196487

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:07:02Z] <jayme> restarting pybal on lvs3005 for switching to conf1005 - T196487

Change 626113 merged by JMeybohm:
[operations/dns@master] Temporarily remove conf1006 from client SRV records

https://gerrit.wikimedia.org/r/626113

ayounsi added a comment.EditedMon, Sep 28, 8:50 AM

@Cmjohnson the console port is still not responding, could you please have a look before today's maintenance? As we still need to configure the switch (and maybe upgrade it).

I also updated Netbox with the console port info your provided on IRC "connected new switch to current d4 console and mgmt"

Mentioned in SAL (#wikimedia-operations) [2020-09-28T11:59:05Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1114 depooling: prep for rack switch upgrade T196487', diff saved to https://phabricator.wikimedia.org/P12815 and previous config saved to /var/cache/conftool/dbconfig/20200928-115904-kormat.json

@ayounsi I am not able to get the console to work on the new switch, it's plugged in, I verfied it worked by connecting to the current asw in d4 and get the prompt. I am not sure what I can do from there just yet. I did attempt to swap the switch w/a new spare and same result

Mentioned in SAL (#wikimedia-operations) [2020-09-28T13:45:09Z] <XioNoX> downtiming all eqiad row D hosts - T196487

ayounsi updated the task description. (Show Details)Mon, Sep 28, 2:23 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-28T15:26:36Z] <kormat@cumin1001> dbctl commit (dc=all): 'Repool db1114 T196487', diff saved to https://phabricator.wikimedia.org/P12818 and previous config saved to /var/cache/conftool/dbconfig/20200928-152635-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-09-29T08:21:47Z] <jayme> switching esams pybal back to conf1006 - T196487

Cmjohnson closed this task as Resolved.Thu, Oct 1, 7:01 PM

This has been completed

ayounsi reopened this task as Open.Thu, Oct 1, 8:18 PM

From the task description:

[DCops] Update Netbox

At least the status and name are incorrect (should be asw2-d4 for consistency)

[DCops] Wipe/decom old switch

Related to Arzhel's previous comment, getting these Netbox errors:

test_missing_assets_from_accounting
asw3-d4-eqiad Device with s/n TA3716160376 (WMF5429) not present in Accounting

test_offline_rack
asw2-d4-eqiad rack defined for status Offline device: eqiad-D4

test_connected_unracked
asw2-a5-eqiad connected console ports attached to unracked device asw2-a5-eqiad: console0

Cmjohnson closed this task as Resolved.Wed, Oct 7, 3:23 PM
Cmjohnson updated the task description. (Show Details)

ran script for the old asw2-d4 and changed name to old-asw2-d4. Changed name in netbox from asw3-d4 to asw2-d4