Page MenuHomePhabricator

eqiad/codfw virtual-chassis upgrades
Closed, ResolvedPublic

Description

Even though services are supposed to be designed to handle a row failure, those rows have became close to "too big to fail".
A planned row downtime is extremely costly in term of engineering time (preparation, communication, sync up, depools, monitoring, repool), even more so in eqiad (primary DC for many services and data-engineering presence). See for example T172459: eqiad row D switch upgrade
Same goes for outages as we've seen with the codfw row B issues.

As a reminder eqiad (except rows E and F) as well as codfw use Juniper's virtual-chassis technology, which creates one large logical switch out of all the top of rack switches of a given row. This was convenient back in the time before we had automation and before routing between top of racks was common. The downside is that they share a failure domain and use proprietary protocols.
Even though they have been quite stable when left alone, a small change or failure (T327001: asw-b2-codfw down) can have dramatic consequences.

We've also seen bugs creeping up, such as the one documented in T320566: Cr1-eqiad comms problem when moving to 40G row D handoff.

All those downsides are being fixed with the "new network design" currently live in eqiad rows E/F, and will be rolled to codfw rows A and B before the end of 2023 (we received part of the hardware). Other rows will follow progressively (exact timeline depending on hardware availability and budget).

Because of this apparent stability added to the impact of an upgrade and not having significant reasons to upgrades (features, security, etc) those switches are now running quite old and now unsupported software version.

However upgrading is becoming more urgent now, for the issues listed above, as well as new features needed for network automation, security, and the switch replacement timeline that won't happen soon enough.

So better bite the bullet and do it. It will also help service owners to make sure their servers are row redundant.
We should also look if we can leverage the recent automation work to help with the depools/repool/communication/etc

Schedule wise, it makes sens to do codfw first, then eqiad ideally during a DC switchover if the timelines are compatible.

Tentative schedule (every 2 weeks on Thursday):

RowDateTaskStatus
codfw AFeb 7th - 14:00-16:00 UTCT327925
codfw BFeb 21st - 14:00-16:00 UTCT327991
eqiad AMarch 7th - 14:00-16:00 UTC*T329073
eqiad BMarch 28th - 14:00-16:00 UTC*T330165
eqiad CApril 4th - 13:00-15:00 UTC*T331882
eqiad DApril 18th - 13:00-15:00 UTC*T333377
codfw CMay 2nd - 13:00-15:00 UTCT334049
codfw DMay 16th - 13:00-15:00 UTCT335042
  • During DC switchover

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
OpenNone
Resolvedayounsi
Resolvedayounsi
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedayounsi
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedcmooney
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedayounsi
InvalidMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedayounsi
ResolvedLadsgroup
Resolvedherron
Resolvedherron
Declinedherron
Openherron
Resolvedcmooney
Resolvedayounsi
Resolvedayounsi

Event Timeline

Script used to generate the servers lists:

1from collections import defaultdict
2from pprint import pprint
3import re
4servers_per_teams = defaultdict(list)
5for nodes, output in worker.get_results():
6 servers_per_teams[output.message().decode()[2:].replace('\n- ', ' and ')].extend(re.split('(?<![0-9]),',str(nodes), flags=re.IGNORECASE))
7
8pprint(servers_per_teams)
9
10phab_tag = {'Traffic': '#traffic',
11 'Infrastructure Foundations': '#infrastructure-foundations',
12 'WMCS': '#cloud-services-team',
13 'ServiceOps-Collab': '#serviceops-collab',
14 'Data Engineering': '#data-engineering',
15 'Search Platform': '#discovery-search',
16 'Observability': '#sre_observability',
17 'Core Platform': '#core-platform-team',
18 'Machine Learning': '#machine-learning-team',
19 'Data Persistence': '#data-persistence',
20 'ServiceOps': '#serviceops',
21 }
22
23for teams, servers_groups in servers_per_teams.items():
24 print(f"\n== {teams} ==")
25 for team in teams.split(' and '):
26 print(f"{phab_tag[team]}", end=' ')
27 print("\n|Servers|Depool action needed|Repool action needed|Status|")
28 print("|---|---|---|---|")
29 for servers in servers_groups:
30 servers_short = servers.replace('.codfw.wmnet', '').replace('.wikimedia.org', '')
31 print(f"|{servers_short}| | | |")

Volans added a parent task: Restricted Task.Apr 17 2023, 10:29 AM
ayounsi removed a parent task: Restricted Task.Apr 17 2023, 11:11 AM
ayounsi claimed this task.
ayounsi updated the task description. (Show Details)

All stacks have been upgraded. Hopefully for the last time!