Page MenuHomePhabricator

mw2420-mw2451 service implementation tracking
Closed, ResolvedPublic

Description

This task is created at the request of serviceops and tied as a dependency to parent racking task T326362.

Once T326362 shows complete, the service ops team can use this task to track service implementation.

Final mapping

TBA

Tentative Mapping of hosts to clusters (appservers, api, jobrunner)

These hosts are replacing mw22[59-90]. Thus we should try to go for a 1:1 mapping, correcting for possible imbalances.

Old hostOld Cluster Old Rack Proposed New host New Rack Proposed New Cluster Notes
mw2259jobrunnerB3mw2420A5appserverchange from jobrunner
mw2260jobrunnerB3mw2421A5appserverchange from jobrunner
mw2261apiB3mw2422A5api
mw2262apiB3mw2423A5api
mw2263jobrunnerB3mw2424A6apichange from jobrunner
mw2264jobrunnerB3mw2425A6appserverchange from jobrunner
mw2265jobrunnerB3mw2426A6jobrunner
mw2266jobrunnerB3mw2427A6jobrunner
mw2267jobrunnerB3mw2428B6jobrunner
mw2268appserverB3mw2429B6jobrunnerchange from appserver
mw2269appserverB3mw2430B6jobrunnerchange from appserver
mw2270appserverB3mw2431B6appserver
mw2271appserverD3mw2432B8appserver
mw2272appserverD3mw2433B8appserver
mw2273appserverD3mw2434B8apichange from appserver
mw2274appserverD3mw2435B8apichange from appserver
mw2275appserverD3mw2436C1apichange from appserver
mw2276appserverD3mw2437C1apichange from appserver
mw2277appserverD3mw2438C1appserver
mw2278jobrunnerD3mw2439C1appserverchange from jobrunner
mw2279jobrunnerD3mw2440C5apichange from jobrunner
mw2280appserverD4mw2441C5appserver
mw2281jobrunnerD4mw2442C5apichange from jobrunner
mw2282jobrunnerD4mw2443C5apichange from jobrunner
mw2283apiD4mw2444D5jobrunnerchange from api
mw2284apiD4mw2445D5jobrunnerchange from api
mw2285apiD4mw2446D5jobrunnerchange from api
mw2286apiD4mw2447D5appserverchange from api
mw2287apiD4mw2448D6appserverchange from api
mw2288apiD4mw2449D6appserverchange from api
mw2289apiD4mw2450D6api
mw2290apiD4mw2451D6api

Some stats

Resource allocation

clustercodfw nodes codfw cpus codfw memoryeqiad nodes eqiad cpus eqiad memory
appserver6630486.23 TB7331527.05TB
api6430486.18TB6227046.18TB
jobrunner2210481.88TB177282.09 TB

Old hosts mapping of clusters to racks (that is without taking mw2420-mw2451, but taking mw2259-mw2260 into account)

ClusterRack↓count♯percent%
apiA31610.67
appserverA31510.00
appserverD3128.00
apiB3106.67
appserverB3106.67
appserverC396.00
apiD485.33
apiC685.33
jobrunnerB374.67
apiD364.00
appserverA653.33
appserverB653.33
apiB653.33
apiA642.67
jobrunnerC642.67
appserverC642.67
jobrunnerA342.67
apiA542.67
appserverA542.67
apiC332.00
jobrunnerD321.33
jobrunnerD421.33
jobrunnerA521.33
jobrunnerC310.67

Old hosts mapping of clusters to rows (that is without taking mw2420-mw2451, but taking mw2259-mw2260 into account)

ClusterLocation↓count♯percent%
apicodfw row A2416.00
appservercodfw row A2416.00
apicodfw row B1510.00
appservercodfw row B1510.00
apicodfw row D149.33
appservercodfw row C138.67
appservercodfw row D128.00
apicodfw row C117.33
jobrunnercodfw row B74.67
jobrunnercodfw row A64.00
jobrunnercodfw row C53.33
jobrunnercodfw row D42.67

New hosts mapping of clusters to racks (that is considering mw2420-mw2451, but removing mw2259-mw2260)

ClusterRack↓count♯percent%
apiA31610.60
appserverA3159.93
appserverC395.96
apiB385.30
appserverB685.30
apiC685.30
appserverB374.64
apiD363.97
apiA563.97
appserverA653.31
apiB653.31
appserverD353.31
apiA642.65
jobrunnerC642.65
appserverC642.65
jobrunnerA342.65
appserverA542.65
jobrunnerA542.65
jobrunnerA642.65
appserverB842.65
apiD542.65
apiD642.65
apiC331.99
appserverC131.99
jobrunnerC531.99
jobrunnerC310.66
jobrunnerB610.66
jobrunnerC110.66
apiC510.66

New hosts mapping of clusters to rows (that is considering mw2420-mw2451, but removing mw2259-mw2260)

ClusterLocation↓count♯percent%
apicodfw row A2617.22
appservercodfw row A2415.89
appservercodfw row B1912.58
appservercodfw row C1610.60
apicodfw row D149.27
apicodfw row B138.61
apicodfw row C127.95
jobrunnercodfw row A127.95
jobrunnercodfw row C95.96
appservercodfw row D53.31
jobrunnercodfw row B10.66

Related Objects

Event Timeline

Icinga downtime and Alertmanager silence (ID=7c189d79-c66e-4544-923a-2145f8cedf2f) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 32 host(s) and their services with reason: In setup

mw[2420-2451].codfw.wmnet

We decided we'll put these into service after the upcoming DC switchover, so we'll make a plan at the March 6 serviceops meeting.

@akosiaris and @Clement_Goubert will come up with a cluster layout this week, and @Clement_Goubert wanted to try putting at least one or two into service themselves. Feel free to assign to me afterward to churn through the rest.

akosiaris triaged this task as Medium priority.Mar 9 2023, 7:56 AM

While I did provide data on specific racks, given our availability zones are centered around rows right now, I am gonna focus on rows. Looking at the data I note the following.

Using the 1:1 mapping as is:

  1. appservers end up having very low presence in row D (5 hosts). That's a regression from the previous situation where they had 12. They also have a very high presence in row A (24 hosts), although that does not constitute a regression, it is the same as previously.
  2. api has a more uniform distribution overall, but with higher presence in row A.
  3. jobrunners end up having 0 presence in row D, very high presence in row A (12 hosts) and minimal presence (1 host) in row B

Of the above, 1 and 3, if left as is, will end up causing severe issues in case of unavailability of rows in codfw and need to be addressed.

Playing around with data using the following constraints:

  • We are 40%+ skewed towards using row A across all mw2* hosts (this isn't easily fixable right now)
  • I can only easily mess around with the allocation of mw2420-mw2451 (the new ones) in clusters

I have the following proposal

ClusterHost
appservermw2420
appservermw2421
apimw2422
apimw2423
apimw2424
appservermw2425
jobrunnermw2426
jobrunnermw2427
jobrunnermw2428
jobrunnermw2429
jobrunnermw2430
appservermw2431
appservermw2432
appservermw2433
apimw2434
apimw2435
apimw2436
apimw2437
appservermw2438
appservermw2439
apimw2440
appservermw2441
apimw2442
apimw2443
jobrunnermw2444
jobrunnermw2445
jobrunnermw2446
appservermw2447
appservermw2448
appservermw2449
apimw2450
apimw2451

This ends up with the following overall stats:

Cluster↓count♯percent%
api6643.71
appserver6643.71
jobrunner1912.58
ClusterLocation↓count♯percent%
apicodfw row A2717.88
appservercodfw row A2717.88
appservercodfw row C1610.60
apicodfw row C1610.60
appservercodfw row B159.93
apicodfw row B159.93
apicodfw row D85.30
appservercodfw row D85.30
jobrunnercodfw row A85.30
jobrunnercodfw row C53.31
jobrunnercodfw row B31.99
jobrunnercodfw row D31.99

That looks a lot better balanced even without touching row A skew, we wouldn't dip below 50% capacity in any cluster if we lose row A (which was the concern for jobrunners).
We're around 20 to 30% php-fpm worker usage with codfw pooled RW and alone (as is the case at the moment). We'd probably be saturating on spikes, but we should still be ok.

Change 896063 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Assign mediawiki roles to mw2420-mw2451

https://gerrit.wikimedia.org/r/896063

Icinga downtime and Alertmanager silence (ID=c5ba1cf2-f027-43f9-8672-b4eb30f98ddc) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet

Change 896063 merged by Clément Goubert:

[operations/puppet@production] Assign mediawiki roles to mw2420-mw2451

https://gerrit.wikimedia.org/r/896063

Icinga downtime and Alertmanager silence (ID=17f33514-0b87-4f50-abfa-6cd2e1548410) set by cgoubert@cumin1001 for 5:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet

Icinga downtime and Alertmanager silence (ID=f7f64d19-c64a-4fb5-a8ab-f3218dfd9862) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet

Icinga downtime and Alertmanager silence (ID=33992616-b446-4bc5-bf17-27cb8c47e8d7) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-16T15:15:15Z] <claime> Pooling new mw hosts mw24[20-51].codfw.wmnet - T326363

Mentioned in SAL (#wikimedia-operations) [2023-03-16T15:39:50Z] <claime> Pooled new mw hosts mw24[20-51].codfw.wmnet - T326363

All done

{"mw2422.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2423.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2424.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2434.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2435.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2436.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2437.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2440.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2442.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2443.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2450.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2451.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2420.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2421.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2425.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2431.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2432.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2433.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2438.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2439.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2441.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2447.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2448.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2449.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2426.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2427.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2428.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2429.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2430.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2444.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2445.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2446.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw2426.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2427.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2428.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2429.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2430.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2444.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2445.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2446.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=jobrunner,service=nginx"}
{"mw2426.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2427.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2428.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2429.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2430.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2444.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2445.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2446.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2426.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2427.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2428.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2429.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2430.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2444.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2445.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}
{"mw2446.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=videoscaler,service=nginx"}