Page MenuHomePhabricator

bring 43 new mediawiki appserver in eqiad into production
Closed, ResolvedPublic

Description

45 (43) new MediaWiki appservers were procured in T271155.

43 are being racked in T273915.

Just like T278396 for codfw this is the task to:

  • make a plan how many become appserver, API server, jobrunner/videoscaler, canaries, or ... dedicated jobrunner (T279100)
  • add appropriate regexes to site.pp, matching the puppet roles above
  • add hosts to conftool-data to the right sections
  • initial puppet run, reboot, check monitoring, set weight, pool
  • setup canaries and recreate hieradata (removed in 702659)
  • old servers have to be decom'ed in parallel (-> T280203)

Rack A3

  • mw1414.eqiad.wmnet - appserver
  • mw1415.eqiad.wmnet - appserver
  • mw1416.eqiad.wmnet - appserver
  • mw1417.eqiad.wmnet - appserver
  • mw1418.eqiad.wmnet - appserver
  • mw1419.eqiad.wmnet - appserver
  • mw1420.eqiad.wmnet - appserver
  • mw1421.eqiad.wmnet - API server
  • mw1422.eqiad.wmnet - API server

Rack B3

  • mw1423.eqiad.wmnet - API server
  • mw1424.eqiad.wmnet - API server
  • mw1425.eqiad.wmnet - API server
  • mw1426.eqiad.wmnet - API server
  • mw1427.eqiad.wmnet - API server
  • mw1428.eqiad.wmnet - API server
  • mw1429.eqiad.wmnet - appserver
  • mw1430.eqiad.wmnet - appserver
  • mw1431.eqiad.wmnet - appserver
  • mw1432.eqiad.wmnet - appserver
  • mw1433.eqiad.wmnet - appserver

Rack C3

  • mw1434.eqiad.wmnet - appserver
  • mw1435.eqiad.wmnet - appserver
  • mw1436.eqiad.wmnet - appserver

Rack D8

  • mw1437.eqiad.wmnet - jobrunner canary
  • mw1438.eqiad.wmnet - jobrunner canary
  • mw1439.eqiad.wmnet - appserver
  • mw1440.eqiad.wmnet - appserver
  • mw1441.eqiad.wmnet - appserver
  • mw1442.eqiad.wmnet - appserver
  • mw1443.eqiad.wmnet - API server
  • mw1444.eqiad.wmnet - API server (!) - NOT REACHABLE via SSH - FIXED
  • mw1445.eqiad.wmnet - jobrunner
  • mw1446.eqiad.wmnet - jobrunner
  • mw1447.eqiad.wmnet - Canary API server
  • mw1448.eqiad.wmnet - Canary API server
  • mw1449.eqiad.wmnet - Canary API server
  • mw1450.eqiad.wmnet - Canary API server

Rack A1

  • mw1451.eqiad.wmnet - appserver
  • mw1452.eqiad.wmnet - appserver

Rack A8

  • mw1453.eqiad.wmnet - appserver
  • mw1454.eqiad.wmnet - appserver
  • mw1455.eqiad.wmnet - appserver
  • mw1456.eqiad.wmnet - appserver

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+3 -13
operations/puppetproduction+1 -1
operations/puppetproduction+4 -899
operations/puppetproduction+16 -1
operations/puppetproduction+10 -1
operations/puppetproduction+1 -1
operations/puppetproduction+45 -0
operations/puppetproduction+1 -1
operations/puppetproduction+13 -11
operations/puppetproduction+4 -4
operations/puppetproduction+7 -0
labs/privatemaster+0 -0
operations/puppetproduction+5 -1
operations/puppetproduction+9 -0
operations/puppetproduction+8 -3
operations/puppetproduction+7 -0
operations/puppetproduction+7 -0
operations/puppetproduction+14 -7
operations/puppetproduction+1 -1
operations/puppetproduction+9 -0
operations/puppetproduction+4 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -1
operations/puppetproduction+3 -2
labs/privatemaster+0 -0
labs/privatemaster+0 -0
operations/puppetproduction+14 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 705721 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: add mw1437,mw1438 as canary jobrunners

https://gerrit.wikimedia.org/r/705721

Jelto updated the task description. (Show Details)

Change 705721 merged by Dzahn:

[operations/puppet@production] site/conftool: add mw1437,mw1438 as canary jobrunners

https://gerrit.wikimedia.org/r/705721

Dzahn updated the task description. (Show Details)

Change 705927 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: add mw1439, mw1440 as jobrunners

https://gerrit.wikimedia.org/r/705927

FYI if that helps this is the current row-distribution of the API appservers in eqiad:

{'B': 19, 'D': 18, 'C': 17, 'A': 9}

Full details at P16841

Change 705943 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] conftool: convert mw1421, mw1422 from app to API servers for balance

https://gerrit.wikimedia.org/r/705943

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1421.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211607_dzahn_31447_mw1421_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1421.eqiad.wmnet']

Of which those FAILED:

['mw1421.eqiad.wmnet']

FYI I've updated the pastes for eqiad and codfw with some more detailed data, all yours now :)

Change 705943 merged by Dzahn:

[operations/puppet@production] conftool: convert mw1421, mw1422 from app to API servers for balance

https://gerrit.wikimedia.org/r/705943

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1421.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107221018_dzahn_8762_mw1421_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1422.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107221024_dzahn_13185_mw1422_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1421.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw1422.eqiad.wmnet']

and were ALL successful.

Change 706485 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] site/conftool: add mw1439,mw1440,mw1441,mw1442 as canary API appservers

https://gerrit.wikimedia.org/r/706485

Change 707252 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: add mw1437 through mw1440 as appservers, rack D8

https://gerrit.wikimedia.org/r/707252

Change 707252 merged by Dzahn:

[operations/puppet@production] site/conftool: add mw1439 through mw1442 as appservers, rack D8

https://gerrit.wikimedia.org/r/707252

Change 707298 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: add mw1443 through mw1446 as API appservers

https://gerrit.wikimedia.org/r/707298

Dzahn updated the task description. (Show Details)

Change 707300 had a related patch set uploaded (by Jelto; author: Jelto):

[labs/private@master] add mcrouter certs for mw1422.eqiad.wmnet to mw1442.eqiad.wmnet

https://gerrit.wikimedia.org/r/707300

Change 707298 merged by Dzahn:

[operations/puppet@production] site/conftool: add mw1443 through mw1446 as API appservers

https://gerrit.wikimedia.org/r/707298

Change 707300 merged by Dzahn:

[labs/private@master] add mcrouter certs for mw1422.eqiad.wmnet to mw1446.eqiad.wmnet

https://gerrit.wikimedia.org/r/707300

Change 705927 abandoned by Dzahn:

[operations/puppet@production] site/conftool: add mw1439, mw1440 as jobrunners

Reason:

already used as appservers

https://gerrit.wikimedia.org/r/705927

Mentioned in SAL (#wikimedia-operations) [2021-07-23T12:15:39Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw1439.eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-07-23T12:15:47Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1439.eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-07-23T12:16:12Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw[1440-1442].eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-07-23T12:16:19Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw[1440-1442].eqiad.wmnet with reason: setup new canary mw api servers in eqiad D8 https://phabricator.wikimedia.org/T279309

@wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a look at special case mw1444 which should be ready but isn't reachable yet. Thanks!

Dzahn triaged this task as High priority.Jul 28 2021, 11:14 AM

Change 708526 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: convert mw134-mw1436 from API to app servers

https://gerrit.wikimedia.org/r/708526

mw1434 has an issue with IPMI

Remote IPMI failed for mgmt 'mw1434.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'mw1434.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1.

Change 708526 merged by Dzahn:

[operations/puppet@production] site/conftool: convert mw1434-mw1436 from API to app servers

https://gerrit.wikimedia.org/r/708526

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107281345_dzahn_32210.log.

Completed auto-reimage of hosts:

['mw1435.eqiad.wmnet', 'mw1436.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw1434.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107281439_dzahn_10859.log.

Completed auto-reimage of hosts:

['mw1434.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-07-28T15:53:58Z] <mutante> mw1434,mw1435,mw1436: scap pull, repooled, reimaged, converted from API to appserver for balancing (T279309)

@wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a look at special case mw1444 which should be ready but isn't reachable yet. Thanks!

due to the balancing between rows, when you install the remaining servers please put them in A or at least _not_ in D please. thanks!

@Dzahn Thanks! putting the rest in A will speed up racking currently i was waiting on rack C

Change 709041 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: convert 4 appservers to jobrunners in row D for balance

https://gerrit.wikimedia.org/r/709041

Change 709041 merged by Dzahn:

[operations/puppet@production] site/conftool: convert 4 appservers to jobrunners in row D for balance

https://gerrit.wikimedia.org/r/709041

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['mw1439.eqiad.wmnet', 'mw1440.eqiad.wmnet', 'mw1445.eqiad.wmnet', 'mw1446.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107301356_dzahn_2612.log.

Completed auto-reimage of hosts:

['mw1439.eqiad.wmnet', 'mw1440.eqiad.wmnet', 'mw1445.eqiad.wmnet', 'mw1446.eqiad.wmnet']

and were ALL successful.

Change 709064 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove already installed servers from insetup regex

https://gerrit.wikimedia.org/r/709064

Change 709064 merged by Dzahn:

[operations/puppet@production] site: remove already installed servers from insetup regex

https://gerrit.wikimedia.org/r/709064

Mentioned in SAL (#wikimedia-operations) [2021-08-13T09:35:42Z] <mutante> mw1444 - signed puppet cert, initial run (after hardware fix) T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T09:42:23Z] <mutante> mw1448, mw1449, mw1450 - powering on via mgmt - OS install, initial setup (T279309, T273915)

Change 712928 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: add MAC addresses for mw1448 through mw1456

https://gerrit.wikimedia.org/r/712928

Change 712928 merged by Dzahn:

[operations/puppet@production] DHCP: add MAC addresses for mw1448 through mw1456

https://gerrit.wikimedia.org/r/712928

Mentioned in SAL (#wikimedia-operations) [2021-08-13T11:11:35Z] <jelto> mw1455 - powering on via mgmt - OS install, initial setup (T279309, T273915)

Script wmf-auto-reimage was launched by jelto on cumin1001.eqiad.wmnet for hosts:

mw1455.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202108131211_jelto_16075_mw1455_eqiad_wmnet.log.

Change 712939 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove mw1444 from 'insetup' role

https://gerrit.wikimedia.org/r/712939

Completed auto-reimage of hosts:

['mw1455.eqiad.wmnet']

and were ALL successful.

Change 712939 merged by Dzahn:

[operations/puppet@production] site: remove mw1444 from 'insetup' role

https://gerrit.wikimedia.org/r/712939

Mentioned in SAL (#wikimedia-operations) [2021-08-13T13:21:37Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1447-1449].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T13:21:45Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1447-1449].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T13:21:54Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on mw1450.eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T13:22:03Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1450.eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Change 706485 merged by Jelto:

[operations/puppet@production] site/conftool: add mw1447,mw1448,mw1449,mw1450 as canary API appservers

https://gerrit.wikimedia.org/r/706485

Change 712970 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool: add mw1451 through mw1456 as apppservers, A1, A8

https://gerrit.wikimedia.org/r/712970

Change 712970 merged by Dzahn:

[operations/puppet@production] site/conftool: add mw1451 through mw1456 as apppservers, A1, A8

https://gerrit.wikimedia.org/r/712970

Mentioned in SAL (#wikimedia-operations) [2021-08-13T17:05:44Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T17:06:02Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T17:32:12Z] <jelto@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Mentioned in SAL (#wikimedia-operations) [2021-08-13T17:32:19Z] <jelto@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on mw[1451-1452,1454-1455].eqiad.wmnet with reason: setup new mediawiki servers in eqiad https://phabricator.wikimedia.org/T279309

Dzahn updated the task description. (Show Details)

done !:) All new servers are finally in production now.

Change 713607 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] hieradata::hosts::mw1 cleanup old canary api server hieradata

https://gerrit.wikimedia.org/r/713607

Change 713607 merged by Jelto:

[operations/puppet@production] hieradata: cleanup old canary api server

https://gerrit.wikimedia.org/r/713607

Change 767787 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] scap: Switch mw1306 to mw1318 for scap proxy role

https://gerrit.wikimedia.org/r/767787

Change 767788 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] mw130[2-6]: Remove and decomission

https://gerrit.wikimedia.org/r/767788

Change 767787 merged by Alexandros Kosiaris:

[operations/puppet@production] scap: Switch mw1306 to mw1318 for scap proxy role

https://gerrit.wikimedia.org/r/767787

Change 767788 merged by Alexandros Kosiaris:

[operations/puppet@production] mw130[2-6]: Remove and decomission

https://gerrit.wikimedia.org/r/767788