Page MenuHomePhabricator

rack and setup mw1307-1348
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of 42 new mw hosts for eqiad. These were ordered on T159963. @Joe please update task with desired racking locations....right now I am thinking of splitting half in row A and half in row B.

mw1307-28

  • receive in system on procurement task T159963
  • bios/drac/serial setup/testing
  • mgmt dns entries added for both asset tag and hostname
  • production dns entries added
  • network port setup
  • operations/puppet update
  • OS installation
  • puppet/salt accept/initial run
  • handoff for service implementation

mw1329-48

  • receive in system on procurement task T159963
  • bios/drac/serial setup/testing
  • mgmt dns entries added for both asset tag and hostname
  • production dns entries added
  • network port setup
  • operations/puppet update
  • OS installation
  • puppet/salt accept/initial run
  • handoff for service implementation

Plan to follow outlined in https://phabricator.wikimedia.org/T165519#3289089

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 381172 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hieradata::regex: remove some new appservers from the downtime list

https://gerrit.wikimedia.org/r/381172

Change 381172 merged by Elukey:
[operations/puppet@production] hieradata::regex: remove some new appservers from the downtime list

https://gerrit.wikimedia.org/r/381172

Completed auto-reimage of hosts:

['mw1308.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1309.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201709290732_elukey_16598.log.

Completed auto-reimage of hosts:

['mw1309.eqiad.wmnet']

and were ALL successful.

Change 381464 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hieradata::regex: remove mw130[89] from the whitelist appservers

https://gerrit.wikimedia.org/r/381464

Change 381464 merged by Elukey:
[operations/puppet@production] hieradata::regex: remove mw130[89] from the whitelist appservers

https://gerrit.wikimedia.org/r/381464

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1310.eqiad.wmnet', 'mw1311.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710020743_elukey_8217.log.

Completed auto-reimage of hosts:

['mw1310.eqiad.wmnet', 'mw1311.eqiad.wmnet']

and were ALL successful.

Change 381969 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add mw videoscaler hiera config for the new eqiad hosts

https://gerrit.wikimedia.org/r/381969

Change 381969 merged by Elukey:
[operations/puppet@production] Add mw videoscaler hiera config for the new eqiad hosts

https://gerrit.wikimedia.org/r/381969

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1307.eqiad.wmnet', 'mw1318.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710040915_elukey_32136.log.

Completed auto-reimage of hosts:

['mw1307.eqiad.wmnet', 'mw1318.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)Oct 4 2017, 12:11 PM

The mw1307-28 batch has been completed!

elukey updated the task description. (Show Details)Oct 4 2017, 12:24 PM

Change 383315 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove new appserver rule from the hiera regex yaml

https://gerrit.wikimedia.org/r/383315

Change 383315 merged by Elukey:
[operations/puppet@production] Remove new appserver rule from the hiera regex yaml

https://gerrit.wikimedia.org/r/383315

elukey moved this task from In Progress to Stalled on the User-Elukey board.Oct 10 2017, 3:29 PM

After re-reading the task mw1329-48 are the only hosts left (20), that should all be in Row C as far as I get, in the following config:

  • 4 jobrunners
  • 5 app servers
  • 10 api appservers
  • 1 videoscaler

After this batch we'll also be able to decom mw1180-mw1189 (app/api-servers).

Chris whenever you are ready let's:

  1. decom mw1161-9 with T177387
  2. rack mw1329-48
  3. decom mw1180-mw1189

Change 392650 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries for new mw hosts mw13[61-69] T165519

https://gerrit.wikimedia.org/r/392650

Change 392650 merged by Cmjohnson:
[operations/dns@master] Adding dns entries for new mw hosts mw13[61-69] T165519

https://gerrit.wikimedia.org/r/392650

Change 393787 had a related patch set uploaded (by BBlack; owner: Cmjohnson):
[operations/dns@master] [Corrected] Adding dns entries for new mw hosts mw13[29-37] T165519

https://gerrit.wikimedia.org/r/393787

Change 393789 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1110 for maintenance

https://gerrit.wikimedia.org/r/393789

Change 393789 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1110 for maintenance

https://gerrit.wikimedia.org/r/393789

Change 393787 merged by BBlack:
[operations/dns@master] [Corrected] Adding dns entries for new mw hosts mw13[29-37] T165519

https://gerrit.wikimedia.org/r/393787

Change 394028 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1110 after arp problems with minimal load

https://gerrit.wikimedia.org/r/394028

Change 394028 abandoned by Jcrespo:
mariadb: Repool db1110 after arp problems with minimal load

https://gerrit.wikimedia.org/r/394028

elukey moved this task from Stalled to In Progress on the User-Elukey board.Dec 12 2017, 8:27 AM

Change 397749 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add mw13[29-37] to site.pp and conftool

https://gerrit.wikimedia.org/r/397749

Change 397749 merged by Elukey:
[operations/puppet@production] Add mw13[29-37] to site.pp and conftool

https://gerrit.wikimedia.org/r/397749

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

mw1329.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201712181322_elukey_14924_mw1329_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1329.eqiad.wmnet']

Of which those FAILED:

['mw1329.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1330.eqiad.wmnet', 'mw1331.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712191034_elukey_29017.log.

Next steps:

  1. image all the hosts in https://gerrit.wikimedia.org/r/397749 and put them in production (January)
  2. decom old row C appservers mw118[0-9]
  3. rack / image / productionize mw13[38-48] (10 api + 1 vs)

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1330.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712191112_elukey_5719.log.

Completed auto-reimage of hosts:

['mw1330.eqiad.wmnet']

Of which those FAILED:

['mw1330.eqiad.wmnet']

Completed auto-reimage of hosts:

['mw1330.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1332.eqiad.wmnet', 'mw1333.eqiad.wmnet', 'mw1334.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712191306_elukey_608.log.

Completed auto-reimage of hosts:

['mw1334.eqiad.wmnet', 'mw1333.eqiad.wmnet']

Of which those FAILED:

['mw1334.eqiad.wmnet', 'mw1333.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1333.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712201323_elukey_24110.log.

Mentioned in SAL (#wikimedia-operations) [2017-12-20T17:48:56Z] <elukey> new mw jobrunner in production (mw1334) - T165519

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1333.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201712210824_elukey_4338.log.

Completed auto-reimage of hosts:

['mw1333.eqiad.wmnet']

Of which those FAILED:

['mw1333.eqiad.wmnet']

Current status of the hosts:

elukey@puppetmaster1001:~$ sudo -i confctl select 'name=mw133.*.eqiad.wmnet' get | sort
{"mw1330.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}
{"mw1330.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1331.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}
{"mw1331.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1332.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}
{"mw1332.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1333.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}
{"mw1333.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1334.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1334.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}
{"mw1335.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1335.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}
{"mw1336.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1336.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}
{"mw1337.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1337.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=jobrunner,service=nginx"}

So the new appservers will be put in production in January, but they are basically already ready and in standby. The jobrunners, since they start working after the first puppet run, will be reimaged in January and put in service. mw1334 is the only jobrunner that I mistakenly reimaged and enabled for live traffic during these days.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1335.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801021335_elukey_26408.log.

Mentioned in SAL (#wikimedia-operations) [2018-01-02T13:41:43Z] <elukey> enable live traffic for new appservers mw1329->mw1333 (T165519)

Completed auto-reimage of hosts:

['mw1335.eqiad.wmnet']

Of which those FAILED:

['mw1335.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1335.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801021415_elukey_2254.log.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1335.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801021430_elukey_5399.log.

Completed auto-reimage of hosts:

['mw1335.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1336.eqiad.wmnet', 'mw1337.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801030819_elukey_6168.log.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['mw1336.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801030949_elukey_26329.log.

Completed auto-reimage of hosts:

['mw1336.eqiad.wmnet']

Of which those FAILED:

['mw1336.eqiad.wmnet']

Completed auto-reimage of hosts:

['mw1336.eqiad.wmnet']

and were ALL successful.

Next steps:

  1. image all the hosts in https://gerrit.wikimedia.org/r/397749 and put them in production (January)
  2. decom old row C appservers mw118[0-9]
  3. rack / image / productionize mw13[38-48] (10 api + 1 vs)

Step 1) done, waiting for 2) and 3) now.

elukey moved this task from In Progress to Stalled on the User-Elukey board.Jan 3 2018, 1:30 PM

Change 403425 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] adding dns entries both production and mgmt for mw1338-mw1348.

https://gerrit.wikimedia.org/r/403425

Change 403425 merged by Cmjohnson:
[operations/dns@master] adding dns entries both production and mgmt for mw1338-mw1348.

https://gerrit.wikimedia.org/r/403425

Change 403691 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mac addresses for mw1338-48

https://gerrit.wikimedia.org/r/403691

Change 403691 merged by Cmjohnson:
[operations/puppet@production] Adding mac addresses for mw1338-48

https://gerrit.wikimedia.org/r/403691

Cmjohnson updated the task description. (Show Details)Jan 11 2018, 4:30 PM

the final 10 servers have been racked. 9 of 10 are now ready to be installed. There is an issue with the idrac setup on mw1340 but will be addressed today.
The 9 are ready for install if you want to tackle now or wait for the mw1340

Change 403715 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding mw1340 to dhcp file

https://gerrit.wikimedia.org/r/403715

Change 403715 merged by Cmjohnson:
[operations/puppet@production] Adding mw1340 to dhcp file

https://gerrit.wikimedia.org/r/403715

Cmjohnson assigned this task to elukey.Jan 11 2018, 6:01 PM

assigning this to @elukey to complete installs.

Cmjohnson closed subtask T183895: Decommission mw1180-1200 as Resolved.

Change 403928 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] site.pp: add mw1338->48

https://gerrit.wikimedia.org/r/403928

Change 403928 merged by Giuseppe Lavagetto:
[operations/puppet@production] site.pp: add mw1338->48

https://gerrit.wikimedia.org/r/403928

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

mw1338.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201801151143_oblivian_2512_mw1338_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1338.eqiad.wmnet']

Of which those FAILED:

['mw1338.eqiad.wmnet']

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

mw1338.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201801151326_oblivian_25183_mw1338_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['mw1339.eqiad.wmnet', 'mw1340.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801151344_oblivian_25264.log.

Completed auto-reimage of hosts:

['mw1338.eqiad.wmnet']

Of which those FAILED:

['mw1338.eqiad.wmnet']

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['mw1339.eqiad.wmnet', 'mw1341.eqiad.wmnet', 'mw1342.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801160725_oblivian_12938.log.

Completed auto-reimage of hosts:

['mw1341.eqiad.wmnet', 'mw1342.eqiad.wmnet', 'mw1339.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['mw1343.eqiad.wmnet', 'mw1344.eqiad.wmnet', 'mw1345.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801160925_oblivian_10565.log.

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

['mw1346.eqiad.wmnet', 'mw1347.eqiad.wmnet', 'mw1348.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201801161014_oblivian_14174.log.

Completed auto-reimage of hosts:

['mw1345.eqiad.wmnet', 'mw1344.eqiad.wmnet', 'mw1343.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

mw1347.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201801161134_oblivian_5271_mw1347_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1347.eqiad.wmnet']

Of which those FAILED:

['mw1347.eqiad.wmnet']

Completed auto-reimage of hosts:

['mw1347.eqiad.wmnet']

Of which those FAILED:

['mw1347.eqiad.wmnet']

Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts:

mw1347.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201801161148_oblivian_8341_mw1347_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1347.eqiad.wmnet']

Of which those FAILED:

['mw1347.eqiad.wmnet']
Joe moved this task from Blocked on others to Doing on the User-Joe board.Jan 16 2018, 1:26 PM
Joe updated the task description. (Show Details)
Joe closed this task as Resolved.Jan 16 2018, 1:31 PM