Page MenuHomePhabricator

bring 25 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2401)
Closed, ResolvedPublic

Description

35 new servers were procured in T271156.

Then they are being racked and get an OS in T274171 which was possible after we made room for them in T277119 et al.

This brings them to "insetup" state. But from there they need to get appserver puppet roles, we need to decide which hardware gets which role
exactly, add regexes in site.pp matching that, add them to conftool-data, create mcrouter certificates and more.

This type of ticket is often a missing step in our existing workflows.

  • mw2377.codfw.wmnet - jobrunner
  • mw2378.codfw.wmnet - jobrunner
  • mw2379.codfw.wmnet - jobrunner
  • mw2380.codfw.wmnet - jobrunner
  • mw2381.codfw.wmnet - jobrunner
  • mw2382.codfw.wmnet - jobrunner
  • mw2383.codfw.wmnet - app
  • mw2384.codfw.wmnet - app
  • mw2385.codfw.wmnet - app
  • mw2386.codfw.wmnet - app
  • mw2387.codfw.wmnet - app
  • mw2388.codfw.wmnet - app
  • mw2389.codfw.wmnet - app
  • mw2390.codfw.wmnet - app
  • mw2391.codfw.wmnet - app
  • mw2392.codfw.wmnet - app
  • mw2393.codfw.wmnet - app
  • mw2394.codfw.wmnet - app
  • mw2395.codfw.wmnet - api
  • mw2396.codfw.wmnet - api
  • mw2397.codfw.wmnet - api
  • mw2398.codfw.wmnet - api
  • mw2399.codfw.wmnet - api
  • mw2400.codfw.wmnet - api
  • mw2401.codfw.wmnet - api

more servers are being racked in A5 in T279599

Event Timeline

Change 674727 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site/conftool-data: turn new servers mw2377,mw2378 into jobrunners

https://gerrit.wikimedia.org/r/674727

Change 674732 had a related patch set uploaded (by Dzahn; author: Dzahn):
[labs/private@master] add fake mcrouter certs for mw2377,mw2378

https://gerrit.wikimedia.org/r/674732

Change 674732 merged by Dzahn:
[labs/private@master] add fake mcrouter certs for mw2377,mw2378

https://gerrit.wikimedia.org/r/674732

Change 674727 merged by Dzahn:
[operations/puppet@production] site/conftool-data: turn new servers mw2377,mw2378 into jobrunners

https://gerrit.wikimedia.org/r/674727

Dzahn renamed this task from bring 35 new mediawiki appserver in codfw into production (mw2377 and up) to bring 35 new mediawiki appserver in codfw into production (mw2377 - mw2402).Mar 25 2021, 12:20 AM
Dzahn updated the task description. (Show Details)
Dzahn added a project: SRE.
Dzahn added subscribers: wkandek, Papaul.
jijiki triaged this task as High priority.Mar 29 2021, 9:01 PM

codfw:

number of appservers ("apaches"): 49

number of API appservers ("api"): 54

number of jobrunners/videoscalers ("jobrunner"): 18

eqiad:

number of appservers ("apaches"): 63

number of API appservers ("api"): 63

number of jobrunners/videoscalers ("jobrunner"): 24

Dzahn renamed this task from bring 35 new mediawiki appserver in codfw into production (mw2377 - mw2402) to bring 35 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402).Mar 31 2021, 8:14 PM
Dzahn updated the task description. (Show Details)
Dzahn renamed this task from bring 35 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) to bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402).Mar 31 2021, 8:18 PM
Dzahn updated the task description. (Show Details)

Adding 4 more jobrunners, 12 app servers and 8 API servers.

This will let us finish the decom task T277780 and bring the number of app servers to 61 and API servers to 62. (vs 63/63 in eqiad).

Change 676153 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: add 12 more appserver and 8 more API servers

https://gerrit.wikimedia.org/r/676153

Change 676153 merged by Dzahn:

[operations/puppet@production] site/conftool-data: add 24 new codfw appservers with insetup role

https://gerrit.wikimedia.org/r/676153

Change 676437 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: turn 4 more new servers into jobrunners

https://gerrit.wikimedia.org/r/676437

Change 676437 merged by Dzahn:

[operations/puppet@production] site/conftool-data: turn 4 more new servers into jobrunners

https://gerrit.wikimedia.org/r/676437

Change 676442 had a related patch set uploaded (by Dzahn; author: Dzahn):

[labs/private@master] add fake mcrouter certs for mw2379 through mw2402

https://gerrit.wikimedia.org/r/676442

Change 676442 merged by Dzahn:

[labs/private@master] add fake mcrouter certs for mw2379 through mw2402

https://gerrit.wikimedia.org/r/676442

Change 676484 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: turn 12 new codfw servers into mw appservers

https://gerrit.wikimedia.org/r/676484

Change 676484 merged by Dzahn:

[operations/puppet@production] site: turn 12 new codfw servers into mw appservers

https://gerrit.wikimedia.org/r/676484

Change 676673 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: turn 8 new codfw servers into API appservers

https://gerrit.wikimedia.org/r/676673

Mentioned in SAL (#wikimedia-operations) [2021-04-02T21:19:05Z] <mutante> generating mcrouter certs for mw2395 through mw2404 (T278396)

Change 676673 merged by Dzahn:

[operations/puppet@production] site/conftool-data: turn 8 new codfw servers into API appservers

https://gerrit.wikimedia.org/r/676673

Change 676677 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet

https://gerrit.wikimedia.org/r/676677

Dzahn updated the task description. (Show Details)

Change 676677 merged by Dzahn:

[operations/puppet@production] site/conftool-data: mw2397 through mw2402 back to insetup, not ready yet

https://gerrit.wikimedia.org/r/676677

20:44 < mutante> !log mw2385 through mw2394 - serial rebooting
20:58 < mutante> !log mw238* - scap pull via cumin not possible because it doesnt work as root
21:07 < mutante> !log mw2383 through mw2394 - 'uptime && scap pull' via ssh -C (not cumin because it needs to run as non-root)
21:19 < mutante> !log generating mcrouter certs for mw2395 through mw2404 (T278396)
21:42 < mutante> !log pooled 12 brand-new codfw appservers running on new hardware generation
21:48 < mutante> !log mw2395, mw2396 - reboot - becoming API servers
22:08 < mutante> !log pooled mw2395,mw2396 as API appservers running on new hardware

Mentioned in SAL (#wikimedia-operations) [2021-04-07T20:30:51Z] <mutante> mw2397 through mw2402 - new hardware moving into production, initial puppet runs as appservers, added to monitoring etc (T278396)

Mentioned in SAL (#wikimedia-operations) [2021-04-07T21:22:33Z] <mutante> mw2397 through mw2402 - pooled as new API appservers after scap pull and all monitoring green (T278396)

Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)

rack A3 completed

mw2397 - mw2402 set to Active in Netbox

Dzahn renamed this task from bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) to bring 25 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2401).Apr 7 2021, 10:13 PM
Dzahn updated the task description. (Show Details)

Change 678926 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners

https://gerrit.wikimedia.org/r/678926

Change 678926 merged by Dzahn:

[operations/puppet@production] site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners

https://gerrit.wikimedia.org/r/678926