Page MenuHomePhabricator

Reconfigure hardware and reimage restbase201[3-8].codfw.wmnet
Closed, ResolvedPublic

Description

During the approval process the Platform Engineering was still debating internally about per-host storage density, and in the interest of making sure the hardware would arrive on time, @mark approved the order as-is w/ 5 1.92T SSDs. The outcome of those discussions was ultimately that we standardize on 5T/host, and what should have happened was that 2 SSDs from each host were removed before completing their setup. This was not communicated, it was entirely my fault (@Eevans), and I apologize for the wasted effort and lost time (I know we're pressed for time on this).

None of the 6 hosts are in use, they can be taken down at any time; Can we please have 2 SSDs per-host removed, leaving them with 3 @ 1.92T SSDs each, and then re-image them?

NOTE: This will require partman changes

Event Timeline

Eevans triaged this task as High priority.Nov 30 2018, 5:57 PM
Eevans created this task.
Eevans added a project: User-Eevans.

Change 476912 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Partman: added 3SSD JBOD config for restbase201[3-8]

https://gerrit.wikimedia.org/r/476912

Change 476915 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] hieradata: reconfigure restbase2013 for 3-SSD JBOD

https://gerrit.wikimedia.org/r/476915

set downtime in icinga for 4 days:

[icinga1001:~] $ for server in $(seq 14 18); do echo restbase20${server}.codfw.wmnet; sudo icinga-downtime -h restbase20${server} -r https://phabricator.wikimedia.org/T210863 -d 345600; done

Change 476912 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: add 3SSD JBOD config for restbase201[3-8]

https://gerrit.wikimedia.org/r/476912

Change 476915 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: reconfigure restbase2013 for 3-SSD JBOD

https://gerrit.wikimedia.org/r/476915

Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts:

restbase2013.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201812030809_filippo_36226_restbase2013_codfw_wmnet.log.

Change 477214 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use 3 ssd as jbod for restbase201[3-8]

https://gerrit.wikimedia.org/r/477214

Change 477214 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use 3 ssd as jbod for restbase201[3-8]

https://gerrit.wikimedia.org/r/477214

This is completed, thanks @Papaul and all involved.

Completed auto-reimage of hosts:

['restbase2013.codfw.wmnet']

Of which those FAILED:

['restbase2013.codfw.wmnet']

Completed auto-reimage of hosts:

['restbase2013.codfw.wmnet']

Of which those FAILED:

['restbase2013.codfw.wmnet']

This is expected, I disabled puppet after the reboot so wmf-reimage-host couldn't complete successfully.

RobH closed subtask Unknown Object (Task) as Declined.Jan 25 2019, 9:00 PM