⚓ T196651 rack upgraded storage capacity in labstore100[67].eqiad.wmnet

Subject	Repo	Branch	Lines +/-
dumps distribution: set labstore1007 as the VPS NFS host	operations/puppet	production	+1 -1
dumps distribution: put back labstore1007 as a stats server	operations/puppet	production	+1 -1
dumps distribution: remove labstore1007 from NFS etc.	operations/puppet	production	+3 -2
dumps distribution: fail over dumps web to labstore1006	operations/dns	master	+1 -1
dumps distribution: failing web services over to labstore1006	operations/puppet	production	+3 -3
dumps distribution: set the ttl lower to prepare for web failover on dumps	operations/dns	master	+1 -1
dumps distribution: failing web services over to labstore1006	operations/puppet	production	+3 -3

RobH reassigned this task from RobH to • Cmjohnson.Jun 7 2018, 4:22 PM

RobH triaged this task as Lowest priority.

RobH raised the priority of this task from Lowest to Medium.

RobH created this task.

RobH created this object in space Restricted Space.

RobH updated the task description. (Show Details)

• Cmjohnson added a project: ops-eqiad.Jun 7 2018, 4:34 PM

@faidon Can i move flerovium higher in the rack in D2. Since the disk shelves are not here it's just the 1u server. This would much easier than having to move both labstore1007 and it's current array to make room for another disk shelf.

I am not sure who or if this was thought out but labstore1006 will need to have ganeti1005, contint1001 and oresdb1002 moved to make room for the additional disk shelf. The down for each will be 5mins +/-

@Cmjohnson regarding flerovium, sure, no problem, go ahead. (The others would need coordination with their respective service owners)

RobH reassigned this task from faidon to • Cmjohnson.Jun 8 2018, 3:01 PM

RobH shifted this object from the Restricted Space space to the S1 Public space.

RobH removed a project: procurement.Jun 8 2018, 6:17 PM

• Cmjohnson renamed this task from upgrade storage capacity in labstore100[67].eqiad.wmnet to rack upgraded storage capacity in labstore100[67].eqiad.wmnet.Jun 11 2018, 3:39 PM

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.

• Cmjohnson moved this task from Up next to Racking Tasks on the ops-eqiad board.Jun 26 2018, 3:57 PM

i was able to relocate a few servers in d2 to make room for the new disk shelf (LS1007). For LS1006, I just removed 2 decom'd servers from u24 and 25. This is close enough to the labstore1006 so I do not need to move any other servers. They are in the racked w/asset tags.

• chasemp mentioned this in T198407: Degraded RAID on labstore1007.Jun 28 2018, 3:18 PM

We ran into trouble here:

RAID issues reported and errors, and the /srv/dumps path was changed to ro
Chris set shelves back to before
labstore1006 & 7 are in the same seeming bad state w/ /srv/dumps in ro
- We are able to read spot checked dumps data from 1006 in toolforge
we decide to reboot labstore1007 and it came up in a recovery console
@Ariel called faidon who hopped on to help debug

Currently labstore1007 is in recovery mode with @faidon on console, and labstore1006 is yet to be rebooted.

labstore1007 has been restored to service and NFS clients and web users are pointed at it (https://gerrit.wikimedia.org/r/c/operations/puppet/+/442913)

At the console @faidon chose the option to 'repair' disks that were once marked bad. We think since these disks were not actually bad that this is essentially a no-op. The repair seems to have restored the disks to normal use. We determined that bringing labstore1007 back online and verifying data integrity was most important. Once all clients were point at labstore1007 then labstore1006 was rebooted with the same treatment. We did determine that originally the new shelves were cabled incorrectly resulting in the errors.

Remaining issues:

both labstore1006 and labstore1007 warn they are not using redundant paths for storage cabling
We need to add the new shelves
bring labstore1006 back online with a revert to https://gerrit.wikimedia.org/r/c/operations/puppet/+/442913

@Ariel is currently running fsck on labstore1006 while it is depooled.

We need to schedule working on the remaining issues on labstore1006 to then do the same for labstore1007.

Big thank you to @faidon, @Cmjohnson, @Bstorm, and @Ariel

Cabling information grabbed from these two documents: D3600 manual: http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04219600-1.pdf
D3000 series wiring guide: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05252635

Both seems to suggest that redundant path configuration is possible with multiple enclosures with no possibility directly suggested for a single enclosure (which is surprising, but not impossible, considering the wiring paths). That is probably a moot point once the additional shelves are set up.

• Bstorm mentioned this in T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers.Jun 28 2018, 8:22 PM

ArielGlenn subscribed.Jun 28 2018, 9:01 PM

Nemo_bis added a project: Datasets-General-or-Unknown.Jun 28 2018, 10:43 PM

Nemo_bis subscribed.

As a follow up, I had to force umount the labstore1006's nfs mountpoint on stat100[5,6] and notebook100[3,4] since the system load sky rocketed and there were no more open files left (probably some application was reading from dumps when the issue happened?).

Fsck on labstore1006 completed last night, it did not take long.

labstore1006 after 'repair' of logical drives during hp boot sequence, and umount of the filesystem:

root@labstore1006:~# fsck -fv /dev/mapper/data-dumps
fsck from util-linux 2.25.2
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

     1564624 inodes used (0.14%, out of 1098842112)
       50870 non-contiguous files (3.3%)
        4548 non-contiguous directories (0.3%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 1521049/13568/1
 11065992062 blocks used (62.94%, out of 17581465600)
           0 bad blocks
        2686 large files

     1474006 regular files
       53022 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
       37587 symbolic links (29998 fast symbolic links)
           0 sockets
------------
     1564615 files

I have since reenabled puppet (which would have tried to remount in the middle of the fsck).
@chasemp I added labstore1006 back to statistics_servers in hiera to shut up the incessant whining from the rsync on there that tries to pull from stat1005.

• Vvjjkkii renamed this task from rack upgraded storage capacity in labstore100[67].eqiad.wmnet to sgbaaaaaaa.Jul 1 2018, 1:05 AM

• Vvjjkkii removed • Cmjohnson as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

ArielGlenn renamed this task from sgbaaaaaaa to rack upgraded storage capacity in labstore100[67].eqiad.wmnet.Jul 1 2018, 6:26 AM

ArielGlenn assigned this task to • Cmjohnson.

ArielGlenn lowered the priority of this task from High to Medium.

ArielGlenn removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

ArielGlenn updated the task description. (Show Details)

ArielGlenn removed a subscriber: Ariel.Jul 1 2018, 2:59 PM

New shelf is now live and part of the /srv/dumps filesystem on labstore1006. It isn't fully restored to service yet, but everything looks good to do so.

• Bstorm added a subtask: T199248: Smart alert on labstore1006 and labstore1007.Jul 10 2018, 4:48 PM

Change 446476 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps: fail over dumps web to labstore1006

https://gerrit.wikimedia.org/r/446476

gerritbot added a project: Patch-For-Review.Jul 17 2018, 8:29 PM

Change 446497 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446497

The cert on labstore1006 was valid from Mar 14 through Jun 12, dead already. The command

ariel@labstore1006:/etc/acme/cert$ openssl x509 -inform PEM -text -in dumps.crt

gave me the info.

I just assumed it probably was, if acme wasn't running.

Change 446497 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446497

Change 446692 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: set the ttl lower to prepare for web failover on dumps

https://gerrit.wikimedia.org/r/446692

Change 446692 merged by Bstorm:
[operations/dns@master] dumps distribution: set the ttl lower to prepare for web failover on dumps

https://gerrit.wikimedia.org/r/446692

Change 446991 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446991

Change 446991 merged by Bstorm:
[operations/puppet@production] dumps distribution: failing web services over to labstore1006

https://gerrit.wikimedia.org/r/446991

Change 446476 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps web to labstore1006

https://gerrit.wikimedia.org/r/446476

Change 447467 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: remove labstore1007 from NFS etc.

https://gerrit.wikimedia.org/r/447467

Change 447467 merged by Bstorm:
[operations/puppet@production] dumps distribution: remove labstore1007 from NFS etc.

https://gerrit.wikimedia.org/r/447467

Change 447484 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: put back labstore1007 as a stats server

https://gerrit.wikimedia.org/r/447484

Change 447484 merged by Bstorm:
[operations/puppet@production] dumps distribution: put back labstore1007 as a stats server

https://gerrit.wikimedia.org/r/447484

Change 448059 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: set labstore1007 as the VPS NFS host

https://gerrit.wikimedia.org/r/448059

Change 448059 merged by Bstorm:
[operations/puppet@production] dumps distribution: set labstore1007 as the VPS NFS host

https://gerrit.wikimedia.org/r/448059

I think all tasks on this are done at this point @Cmjohnson -- the servers are both done, in service and the handoff task was accomplished in the middle of it.

• Bstorm closed this task as Resolved.Aug 1 2018, 4:35 PM

• Bstorm updated the task description. (Show Details)

• Bstorm mentioned this in T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server.Sep 4 2018, 3:12 PM

• Bstorm mentioned this in T217473: labstore1006 spontaneous reboot.Mar 2 2019, 3:47 PM

• Bstorm closed subtask T199248: Smart alert on labstore1006 and labstore1007 as Resolved.May 7 2020, 6:04 PM

Maintenance_bot removed a project: Patch-For-Review.May 7 2020, 6:11 PM

rack upgraded storage capacity in labstore100[67].eqiad.wmnet
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Cmjohnson	T196651 rack upgraded storage capacity in labstore100[67].eqiad.wmnet
Resolved	• Bstorm	T199248 Smart alert on labstore1006 and labstore1007
Resolved	colewhite	T199236 Handle SMART for multiple shelves and controllers

rack upgraded storage capacity in labstore100[67].eqiad.wmnetClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

rack upgraded storage capacity in labstore100[67].eqiad.wmnet
Closed, ResolvedPublic
Actions

Related Objects
Search...