Page MenuHomePhabricator

Audit/fix hosts with no RAID configured
Closed, ResolvedPublic

Description

As part of the RAID work (T84050 etc.), I added a fact that encodes all the RAID configured on every server, in a comma-separated list.

Auditing that list from servermon's fact query to include hosts where no RAID is configure and is_virtual equals to false, results in this:

  • bast4001.wikimedia.org (T133699)
  • chromium.wikimedia.org & hydrogen.wikimedia.org (T123727)
  • labnet1002.eqiad.wmnet (T136718)
  • lvs100[123456].wikimedia.org, being decom'd in favor of new hardware in T184293 (T136737)
  • magnesium.wikimedia.org
  • maps-test200[1234].codfw.wmnet (T140440) actually had RAID but was not detected
  • mw* (T106381)
  • osmium.eqiad.wmnet (T132530)
  • rcs100[12].eqiad.wmnet (T140441)
  • rdb100[56].eqiad.wmnet (T140442)
  • snapshot100[1234].eqiad.wmnet (T140439)

This list is quite troubling; it's also troubling that for many of those hosts, we do have second disks, but they were never configured. That's the case for e.g. rdb1005/rdb1006 or chromium:

root@chromium:~# fdisk  -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000081

   Device Boot      Start         End      Blocks   Id  System
root@chromium:~#

All of the above should audited and reformatted to be using RAID, when needed.

The list is by no means exhaustive; there are other hosts that are reported as having a RAID controller, but perhaps they are configured as JBOD (or single-disk RAID0s). Hosts with just the "mpt" controller reported are especially susceptible to that (e.g. labnet1001 was not on the list above, but has no RAID configured).

Related Objects

Event Timeline

faidon created this task.May 30 2016, 3:36 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 30 2016, 3:36 PM
faidon updated the task description. (Show Details)May 30 2016, 3:53 PM
faidon updated the task description. (Show Details)Jun 1 2016, 5:13 PM
faidon added subscribers: ArielGlenn, chasemp.

@Dzahn, could you perhaps take on some of these as part of your jessie-fication/un-precise-fication effort?

yep, magnesium will be soon gone . then chromium and hydrogen will be next

Dzahn updated the task description. (Show Details)Jun 4 2016, 1:05 AM

magnesium shut down, so that's off the list

faidon updated the task description. (Show Details)Jun 4 2016, 1:52 AM
faidon assigned this task to Dzahn.Jun 30 2016, 8:50 AM

Change 299879 had a related patch set uploaded (by Dzahn):
install_server: let osmium use mw-raid1 partman

https://gerrit.wikimedia.org/r/299879

Change 299879 merged by Dzahn:
install_server: let osmium use mw-raid1 partman

https://gerrit.wikimedia.org/r/299879

Dzahn updated the task description. (Show Details)Jul 20 2016, 5:27 AM

osmium has software RAID1 now (from mw-raid1 partman recipe)

Dzahn added a comment.EditedAug 23 2016, 10:47 PM

[hydrogen:~] $ grep active /proc/mdstat
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
md0 : active raid1 sda1[0] sdb1[1]

Dzahn updated the task description. (Show Details)Aug 24 2016, 11:28 PM

root@chromium:~# grep active /proc/mdstat
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
md0 : active raid1 sda1[0] sdb1[1]

Change 307755 had a related patch set uploaded (by Andrew Bogott):
Switch primary nova-network and nova-api server over to labnet1001

https://gerrit.wikimedia.org/r/307755

Change 307755 merged by Andrew Bogott:
Switch primary nova-network and nova-api server over to labnet1001

https://gerrit.wikimedia.org/r/307755

Change 307772 had a related patch set uploaded (by Andrew Bogott):
Install labs::openstack::nova::network on labnet1001

https://gerrit.wikimedia.org/r/307772

Change 307772 merged by Andrew Bogott:
Install labs::openstack::nova::network on labnet1001

https://gerrit.wikimedia.org/r/307772

ArielGlenn updated the task description. (Show Details)Oct 26 2016, 3:33 PM
faidon updated the task description. (Show Details)Nov 8 2016, 4:57 PM
fgiunchedi updated the task description. (Show Details)Apr 30 2018, 7:33 AM

@Dzahn, what's the status of this? I did a cursory search and saw both existing critical hosts (like rdb1005/6 that were mentioned above) with no RAID, as well as a couple of new hosts that were introduced since this task (lvs2009/lvs2010). Perhaps we need a monitoring check for this to avoid being introduced again?

Regarding rdb1005/1006 i have pinged in May, August and September on T140442#4186806 because to me it was the last one keeping this ticket open and i would like to finally close it.

I have admittedly not checked for new ones being introduced meanwhile. I agree adding monitoring for it makes sense, i will create a separate task for that.

Regarding lvs2009/2010 there is now T205970 but i would say it's part of T196560 which is still open / in progress.

In general i would say we should use or reopen the existing installation tickets when this happens rather than keeping this mega ticket for all things RAID.

Dzahn updated the task description. (Show Details)Dec 4 2018, 5:40 PM
Dzahn closed this task as Resolved.Dec 4 2018, 5:43 PM

all subtasks here are resolved, the specific ones mentioned were rdb1005/1006 (closed) and lvs2009/2010 (setup in progress). We still want T206131 to add monitoring for it, so that would show us any new ones. ok to resolve here?