Page MenuHomePhabricator

move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet)
Closed, ResolvedPublic

Description

After all 86 new servers (mw2[291-2377].codfw.wmnet) have been racked in T241852 by dcops, this is where serviceops takes over and does the steps to move them into production until they are actually pooled and serve traffic.

We want a separate ticket because different people are doing it and otherwise tickets stay open on the dcops workboard even though their part is done.

15 servers are blocked by T247018


mw2291 through mw2324 are npooled and status active in netbox (34 servers) https://gerrit.wikimedia.org/r/q/topic:%22appservers-codfw%22+(status:open%20OR%20status:merged)

mw2325 through mw2334 are pooled and status active in netbox (10 servers) https://gerrit.wikimedia.org/r/c/operations/puppet/+/577408

mw2335 through mw2349 are not pooled, not in site.pp and status planned in netbox (15 servers) (blocked by T247018)

mw2350 through mw2376 are pooled, in site.pp and and status active in netbox ((27 servers) (https://gerrit.wikimedia.org/r/c/operations/puppet/+/577409)

total: 86 servers

Event Timeline

Only 71 of 86 servers can be done until T247018 is resolved first because there is not enough rack space for them.

Change 577388 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add mw2301 through mw2309 as api and appservers

https://gerrit.wikimedia.org/r/577388

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install

mw[2301-2309].codfw.wmnet

Change 577388 merged by Dzahn:
[operations/puppet@production] site: add mw2301 through mw2309 as api and appservers

https://gerrit.wikimedia.org/r/577388

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install

mw[2301-2309].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2020-03-06T01:38:12Z] <mutante> added 9 more appservers to codfw pool split between appserver and API appservers, weight 15 (like all in codfw) T247021

mw2301 thru mw2309 pooled and set to active in netbox

Dzahn renamed this task from move all 86 new codfw appservers into production to move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet).Mar 6 2020, 1:42 AM
Dzahn updated the task description. (Show Details)

mw2291 through mw2324 are now pooled and status active in netbox (34 servers)

mw2325 through mw2334 are not pooled but in site.pp and status staged in netbox (10 servers)

mw2335 through mw2349 are not pooled, not in site.pp and status planned in netbox (15 servers) (blocked by T247018)

mw2350 through mw2376 are not pooled, in site.pp and and status staged in netbox ((27 servers)

total: 86 servers

Change 577408 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add mw2325-mw2334 as API and appservers, codfw rack B6

https://gerrit.wikimedia.org/r/577408

Change 577409 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add mw2350-2376 as API and appservers, codfw rack C6

https://gerrit.wikimedia.org/r/577409

Dzahn triaged this task as High priority.

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install

mw[2325-2329].codfw.wmnet

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install

mw[2331-2334].codfw.wmnet

Change 577408 merged by Dzahn:
[operations/puppet@production] add mw2325-mw2334 as API and appservers, codfw rack B6

https://gerrit.wikimedia.org/r/577408

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install

mw[2331-2334].codfw.wmnet

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install

mw[2325-2329].codfw.wmnet
{"mw2325.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=apache2"}
{"mw2325.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2326.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=apache2"}
{"mw2326.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2327.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=apache2"}
{"mw2327.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2328.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2328.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=api_appserver,service=apache2"}
{"mw2329.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=apache2"}
{"mw2329.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2330.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=apache2"}
{"mw2330.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2331.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=apache2"}
{"mw2331.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2332.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw2332.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=apache2"}
{"mw2333.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=apache2"}
{"mw2333.codfw.wmnet": {"pooled": "yes", "weight": 15}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2334.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=apache2"}
{"mw2334.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}

mw2325 through mw2334 set to Active in Netbox

10 servers pooled at 18:05 UTC, March 6th.

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install

mw[2350-2376].codfw.wmnet

Change 577409 merged by Dzahn:
[operations/puppet@production] add mw2350-2376 as API and appservers, codfw rack C6

https://gerrit.wikimedia.org/r/577409

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install

mw[2350-2376].codfw.wmnet

Change 578630 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: remove duplicate regex for mw2366-mw2376

https://gerrit.wikimedia.org/r/578630

Change 578630 merged by Dzahn:
[operations/puppet@production] site: fix duplicate regex and row for mw2366-mw2376

https://gerrit.wikimedia.org/r/578630

Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 6 host(s) and their services with reason: new_install

mw[2366,2368,2370,2372,2374,2376].codfw.wmnet

mw2350 through mw2376 are all pooled in production and set to "Active" in netbox now.

Dzahn changed the task status from Open to Stalled.Apr 10 2020, 9:32 AM

stalled by T247018

Dzahn changed the task status from Stalled to Open.May 22 2020, 12:19 PM

Change 599749 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new appservers mw2336 through mw2339

https://gerrit.wikimedia.org/r/599749

Change 604339 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add management and production IPs for mw2335-mw2339

https://gerrit.wikimedia.org/r/604339

Change 604339 merged by Dzahn:
[operations/dns@master] add management and production IPs for mw2335-mw2339

https://gerrit.wikimedia.org/r/604339

Change 599749 merged by Dzahn:
[operations/puppet@production] site: add new appservers mw2335 through mw2339

https://gerrit.wikimedia.org/r/599749

Mentioned in SAL (#wikimedia-operations) [2020-06-17T14:13:33Z] <mutante> generating new mcrouter certs for mw2335 - mw2339 (T247021)

Change 606195 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake mcrouter certs for mw2335 - mw2339

https://gerrit.wikimedia.org/r/606195

Change 606195 merged by Dzahn:
[labs/private@master] add fake mcrouter certs for mw2335 - mw2339

https://gerrit.wikimedia.org/r/606195

Change 606197 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool: add mw2335 - mw2339

https://gerrit.wikimedia.org/r/606197

Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install

mw[2335-2339].codfw.wmnet

Change 606197 merged by Dzahn:
[operations/puppet@production] conftool: add mw2335 - mw2339

https://gerrit.wikimedia.org/r/606197

Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install

mw[2335-2339].codfw.wmnet

mw2335 through mw2339 in rack C3 have also been taken into production now.

This should complete the ticket. All hosts from mw2291 through mw2376 are pooled.

Just mw2377 seems to be missing.

76 servers are pooled as appservers. 10 have been used for kubernetes. Adds up to 86.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2339.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006181816_dzahn_145765_mw2339_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2339.codfw.wmnet']

and were ALL successful.