Page MenuHomePhabricator

rack upgraded storage capacity in labstore100[67].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the addtion of a single D3600 disk shelf being added to labstore1006 and labstore1007 (so the order came in with 2 shelves, one for each system.)

Due to the location of labstore systems, some other systems may require downtime or shifting within the rack to ensure the shelves are located close enough to labstore systems for use.

labstore1006 upgrade:

  • - receive in DL3600 on procurement task T193079 , name labstore1006-array2
  • - review placement of labstore1006, new shelf should be able to chain into the system off the existing shelf. The rack for this is VERY full, so this may require a bit of shuffling to make this work. Please note it can move within the same row without any software changes to the system (just switch stack)
  • - downtime and relocate any other servers as needed to ensure the new shelf fits in the rack with its labstore system
  • - rack new shelf, daisy chain off existing shelf
  • - update raid to add new shelf to array for the other external shelf
  • - boot system into OS and ensure OS can see new shelf
  • - handoff to cloud team for them to format and extend their LVM of data onto the new shelf

labstore1007 upgrade:

  • - receive in DL3600 on procurement task T193079 , name labstore1007-array2
  • - review placement of labstore1007, it is likely easier to downtime labstore1007 and its shelf, moving them to U14-19 and relocating backup1001 (new and shelves arent racked yet, up in the rack. please note backup1001 will have 2 2U shelves.
  • - downtime and relocate any other servers as needed to ensure the new shelf fits in the rack with its labstore system
  • - rack new shelf, daisy chain off existing shelf
  • - update raid to add new shelf to array for the other external shelf
  • - boot system into OS and ensure OS can see new shelf
  • - handoff to cloud team for them to format and extend their LVM of data onto the new shelf

Event Timeline

RobH triaged this task as Lowest priority.
RobH raised the priority of this task from Lowest to Medium.
RobH created this task.
RobH created this object in space Restricted Space.
RobH updated the task description. (Show Details)

@faidon Can i move flerovium higher in the rack in D2. Since the disk shelves are not here it's just the 1u server. This would much easier than having to move both labstore1007 and it's current array to make room for another disk shelf.

I am not sure who or if this was thought out but labstore1006 will need to have ganeti1005, contint1001 and oresdb1002 moved to make room for the additional disk shelf. The down for each will be 5mins +/-

@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)

RobH shifted this object from the Restricted Space space to the S1 Public space.
Cmjohnson renamed this task from upgrade storage capacity in labstore100[67].eqiad.wmnet to rack upgraded storage capacity in labstore100[67].eqiad.wmnet.Jun 11 2018, 3:39 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.

i was able to relocate a few servers in d2 to make room for the new disk shelf (LS1007). For LS1006, I just removed 2 decom'd servers from u24 and 25. This is close enough to the labstore1006 so I do not need to move any other servers. They are in the racked w/asset tags.

We ran into trouble here:

  • RAID issues reported and errors, and the /srv/dumps path was changed to ro
  • Chris set shelves back to before
  • labstore1006 & 7 are in the same seeming bad state w/ /srv/dumps in ro
    • We are able to read spot checked dumps data from 1006 in toolforge
  • we decide to reboot labstore1007 and it came up in a recovery console
  • @Ariel called faidon who hopped on to help debug

Currently labstore1007 is in recovery mode with @faidon on console, and labstore1006 is yet to be rebooted.

labstore1007 has been restored to service and NFS clients and web users are pointed at it (https://gerrit.wikimedia.org/r/c/operations/puppet/+/442913)

At the console @faidon chose the option to 'repair' disks that were once marked bad. We think since these disks were not actually bad that this is essentially a no-op. The repair seems to have restored the disks to normal use. We determined that bringing labstore1007 back online and verifying data integrity was most important. Once all clients were point at labstore1007 then labstore1006 was rebooted with the same treatment. We did determine that originally the new shelves were cabled incorrectly resulting in the errors.

Remaining issues:

@Ariel is currently running fsck on labstore1006 while it is depooled.

We need to schedule working on the remaining issues on labstore1006 to then do the same for labstore1007.

Big thank you to @faidon, @Cmjohnson, @Bstorm, and @Ariel

Cabling information grabbed from these two documents: D3600 manual: http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04219600-1.pdf
D3000 series wiring guide: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05252635

Both seems to suggest that redundant path configuration is possible with multiple enclosures with no possibility directly suggested for a single enclosure (which is surprising, but not impossible, considering the wiring paths). That is probably a moot point once the additional shelves are set up.

As a follow up, I had to force umount the labstore1006's nfs mountpoint on stat100[5,6] and notebook100[3,4] since the system load sky rocketed and there were no more open files left (probably some application was reading from dumps when the issue happened?).

Fsck on labstore1006 completed last night, it did not take long.

labstore1006 after 'repair' of logical drives during hp boot sequence, and umount of the filesystem:
root@labstore1006:~# fsck -fv /dev/mapper/data-dumps
fsck from util-linux 2.25.2
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

     1564624 inodes used (0.14%, out of 1098842112)
       50870 non-contiguous files (3.3%)
        4548 non-contiguous directories (0.3%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 1521049/13568/1
 11065992062 blocks used (62.94%, out of 17581465600)
           0 bad blocks
        2686 large files

     1474006 regular files
       53022 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
       37587 symbolic links (29998 fast symbolic links)
           0 sockets
------------
     1564615 files

I have since reenabled puppet (which would have tried to remount in the middle of the fsck).
@chasemp I added labstore1006 back to statistics_servers in hiera to shut up the incessant whining from the rsync on there that tries to pull from stat1005.

Vvjjkkii renamed this task from rack upgraded storage capacity in labstore100[67].eqiad.wmnet to sgbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
ArielGlenn renamed this task from sgbaaaaaaa to rack upgraded storage capacity in labstore100[67].eqiad.wmnet.Jul 1 2018, 6:26 AM
ArielGlenn assigned this task to Cmjohnson.
ArielGlenn lowered the priority of this task from High to Medium.
ArielGlenn updated the task description. (Show Details)

New shelf is now live and part of the /srv/dumps filesystem on labstore1006. It isn't fully restored to service yet, but everything looks good to do so.

Change 446476 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps: fail over dumps web to labstore1006

https://gerrit.wikimedia.org/r/446476

Change 446497 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446497

The cert on labstore1006 was valid from Mar 14 through Jun 12, dead already. The command

ariel@labstore1006:/etc/acme/cert$ openssl x509 -inform PEM -text -in dumps.crt

gave me the info.

I just assumed it probably was, if acme wasn't running.

Change 446497 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446497

Change 446692 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: set the ttl lower to prepare for web failover on dumps

https://gerrit.wikimedia.org/r/446692

Change 446692 merged by Bstorm:
[operations/dns@master] dumps distribution: set the ttl lower to prepare for web failover on dumps

https://gerrit.wikimedia.org/r/446692

Change 446991 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446991

Change 446991 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446991

Change 446476 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps web to labstore1006

https://gerrit.wikimedia.org/r/446476

Change 447467 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: remove labstore1007 from NFS etc.

https://gerrit.wikimedia.org/r/447467

Change 447467 merged by Bstorm:
[operations/puppet@production] dumps distribution: remove labstore1007 from NFS etc.

https://gerrit.wikimedia.org/r/447467

Change 447484 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: put back labstore1007 as a stats server

https://gerrit.wikimedia.org/r/447484

Change 447484 merged by Bstorm:
[operations/puppet@production] dumps distribution: put back labstore1007 as a stats server

https://gerrit.wikimedia.org/r/447484

Change 448059 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: set labstore1007 as the VPS NFS host

https://gerrit.wikimedia.org/r/448059

Change 448059 merged by Bstorm:
[operations/puppet@production] dumps distribution: set labstore1007 as the VPS NFS host

https://gerrit.wikimedia.org/r/448059

I think all tasks on this are done at this point @Cmjohnson -- the servers are both done, in service and the handoff task was accomplished in the middle of it.

Bstorm updated the task description. (Show Details)