Page MenuHomePhabricator

reimage wdqs1003 / wdqs200[123] with RAID
Closed, ResolvedPublic

Description

There is a mess on how we configured disks on the current production WDQS servers.

  • wdqs200[12] have a RAID controller (LSI Logic / Symbios Logic MegaRAID SAS-3 3108) but we are not using it. (I do remember a discussion with @RobH about those controllers, were the conclusion was that those controllers are low-end enough that software raid would be a better idea)
  • wdqs100[45] are configured with software raid (raid1-lvm-ext4-srv-noswap)
  • wdqs1003 / wdqs200[123] are configured with no raid (lvm-ext-srv)

Obviously, we should have some form of raid on all those servers. Moving all of them to raid1-lvm-ext4-srv-noswap seems like the right solution. This will require a reimage.

@RobH : could you confirm that using raid1-lvm-ext4-srv-noswap for all nodes and not the hardware raid controller is reasonable?

Details

Related Gerrit Patches:
operations/puppet : productionwdqs: migrate to stretch
operations/puppet : productionchanging all wdqs nodes to use similar partman recipes

Event Timeline

Gehel created this task.Mar 8 2018, 10:20 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2018, 10:20 AM
MoritzMuehlenhoff triaged this task as Medium priority.Mar 8 2018, 11:03 AM
RobH added a comment.Mar 8 2018, 6:16 PM

Ok, just to confirm everything:

wdqs200[12] have a RAID controller (LSI Logic / Symbios Logic MegaRAID SAS-3 3108) but we are not using it. (I do remember a discussion with @RobH about those controllers, were the conclusion was that those controllers are low-end enough that software raid would be a better idea)

That is correct, those nodes were ordered on T139482, and have just the onboard sata controller according to the order linked there. We should set those to sw raid, as its not a real hw raid controller.

So as far as standardization, there is now growing support, on our last SRE on-site, to start including swap in our partman configurations again. I'd recommend then we move to raid1-lvm-ext1-srv for all of the wdqs nodes. I realize the use of swap now is the opposite of my stance a year ago, but things change!

Change 417346 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] changing all wdqs nodes to use similar partman recipes

https://gerrit.wikimedia.org/r/417346

Change 417346 merged by RobH:
[operations/puppet@production] changing all wdqs nodes to use similar partman recipes

https://gerrit.wikimedia.org/r/417346

EBjune added a subscriber: EBjune.Mar 8 2018, 7:49 PM
RobH added a comment.Mar 8 2018, 8:00 PM

I had a typo in that patchset, but I followed it up with a fix (just neglected to link the bug in the fix's commit message.)

Change 417899 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: migrate to stretch

https://gerrit.wikimedia.org/r/417899

Change 417899 merged by Gehel:
[operations/puppet@production] wdqs: migrate to stretch

https://gerrit.wikimedia.org/r/417899

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptMar 13 2018, 12:48 AM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 13 2018, 4:39 PM
Gehel claimed this task.Mar 13 2018, 5:27 PM

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804170756_gehel_7421.log.

Completed auto-reimage of hosts:

['wdqs1003.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804171424_gehel_1099.log.

Completed auto-reimage of hosts:

['wdqs2001.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2018-04-17T14:55:49Z] <gehel> starting data reimport after re-image for wdqs2001 - T189192

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804190930_gehel_11239.log.

Completed auto-reimage of hosts:

['wdqs2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804191254_gehel_24890.log.

Completed auto-reimage of hosts:

['wdqs2003.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804200748_gehel_8274.log.

Completed auto-reimage of hosts:

['wdqs1005.eqiad.wmnet']

and were ALL successful.

All wdqs servers are now running RAID on Debian Stretch. Data is fully reloaded.

Great, thanks!

Smalyshev closed this task as Resolved.Apr 26 2018, 4:24 AM