Page MenuHomePhabricator

implement wdqs1001/1002 disk upgrades (extend lvm)
Closed, ResolvedPublic

Description

Once T120712 is completed, the newly installed disks will need to be setup into a software raid1 and have the lvm extended onto these disks for increased storage space.

Discussion on past tasks T119579 with @Smalyshev shows that we need to extend space in place and not reimage these systems at this time. Once the holidays are past and everyone on his team is back from holiday/vacation, we can then look at reimaging these one at a time (in January.)

So this task can track initial implementation, and then stall until January for reimage.

Event Timeline

RobH raised the priority of this task from to Medium.
RobH updated the task description. (Show Details)

@RobH this has been somewhat neglected, can we ressurect it and do the space extension? Or do we have to do full reimage?

At first glance, neither of these is showing the extra disks. It turns out these have the H310 controller, which means we need to enable the disks in the controller. If this was a typical raid controller, I just would do this while they are in service. The H310 is not one however, which is why we run it in bypass mode. Due to my well founded caution in dealing the H310 controllers, I rather we schedule a potential downtime window for attempting to enable the disks and extending the LVM manually. Please note that manually extending the LVM is non-ideal compared to re-image (which was proposed to happen before or around now anyhow if possible).

The benefits of the re-image is we would apply a new automatic partitioning recipe to these systems in production use. If a system crashes and requires a replacement/re-installation in the future, we have a known good recipe. Right now we had a known good recipe, but then we'll manually extend the other disks into it. This is non-ideal in long term use, as we cannot automatically re-create the setup on another system.

Please advise if we can schedule potential downtime on one of the hosts to tinker with the H310 to enable the disks. Alternatively, we can take one offline and reimage it entirely, which would give the above listed benefits.

How long would reimaging take? In principle, I'm OK with it but would like to know when it can happen and for how long it would be offline.

Barring any issues during the reimagine (as long as nothing breaks) it can typically be reinstalled in around an hour. This includes enabling the new disks, adding/updating the partman recipe (which we can do the bulk of before downtime), reinstalling the OS, and the initial post OS puppet runs.

This doesn't include service re-implementation or the transfer of any data to or from the machine before or after reinstallation.

OK, that sounds good. I spoke today with Blazegraph folks, and looks like we will have implementation of geospatial indexing by the end of the month. Which will then require full database reload. So we probably want to make these two things at the same time. The timing however is not great since we have datacenter switch on the week of 21st, then hackathon on the week of 28th and the next one. So for now I'd target it for second week of April. Does that sound good?

That works for me, and I'll be around that entire week to assist on the operations end of things.

Would we be able to include the re-image in that downtime window, or are we not certain about that quite yet?

Yes, that's the plan - to do both reimage & full reload.

Change 282966 had a related patch set uploaded (by Gehel):
Increase max size for /var/lib/wdqs to take all remaining space

https://gerrit.wikimedia.org/r/282966

Change 282966 merged by Gehel:
Increase max size for /var/lib/wdqs to take all remaining space

https://gerrit.wikimedia.org/r/282966

Reinstall in progress to account for new disks in wdqs1002. The wdqs1001 server will still need to be reinstalled later on...

Change 288402 had a related patch set uploaded (by Gehel):
Depooling wdqs1001 for reinstall and new disks

https://gerrit.wikimedia.org/r/288402

Change 288407 had a related patch set uploaded (by Gehel):
Increased wdqs partition now that we have new disks ready.

https://gerrit.wikimedia.org/r/288407

Change 288407 merged by Gehel:
Increased wdqs partition now that we have new disks ready.

https://gerrit.wikimedia.org/r/288407

Change 288402 merged by Gehel:
Depooling wdqs1001 for reinstall and new disks

https://gerrit.wikimedia.org/r/288402

Mentioned in SAL [2016-05-12T16:23:58Z] <gehel> removing wdqs1001 from rotation for disk upgrade (T120714)

Smalyshev claimed this task.