Page MenuHomePhabricator

restbase-dev1006 has a broken disk
Closed, ResolvedPublic

Description

/srv is mounted read-only and I see the typical errors in the logs that indicate that /dev/sdd is broken, and needs to be replaced.

The server will likely need to be reimaged after the disk is changed, given it will break several layers of things configured there.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:38:55Z] <mobrovac> restbase-dev1006 puppet disabled - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:39:48Z] <mobrovac> restbase-dev1006 stop restbase - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:40:01Z] <mobrovac> restbase-dev1006 decommission cass-a - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:43:29Z] <_joe_> disable notifications in icinga for restbase-dev1006 T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:45:37Z] <mobrovac> restbase-dev1006 decommission cass-b - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T07:05:04Z] <mobrovac> restbase-dev1006 force-stop the cassandra instances, fsync exception during decomm - T224260

mobrovac added a subscriber: Eevans.
mobrovac subscribed.

The node is now ready to be taken over DC-Ops for disk replacement.

Change 513259 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Temporarily remove restbase-dev1006 from targets

https://gerrit.wikimedia.org/r/513259

Change 513259 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Temporarily remove restbase-dev1006 from targets

https://gerrit.wikimedia.org/r/513259

Cmjohnson added subscribers: RobH, Cmjohnson.

@RobH this disk will need to be ordered outside of the warranty. These servers were shipped without disks, the procurement task states that the disk from RBDEV1001-1003 will be used. They are 800GB Intell SSDS

description: ATA Disk
product: INTEL SSDSC2BX80
physical id: 0.0.0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 0150
serial: BTHC632208JK800NGN
size: 745GiB (800GB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=3bc4fb3b

Can we please move forward with ordering a fixed disk? This broken disk causes subtle errors for all fleet-wide Cumin/debdeploy runs touching e.g. dpkg as it stalls I/O almost infinitely.

i have manully disabled the /usr/local/sbin/smart-data-dump cron job to reduce spam

@RobH @Cmjohnson I had an issue today again running cumin to the fleet because restbase-dev1006 was stalling due to disk errors :(

RobH mentioned this in Unknown Object (Task).Jun 27 2019, 6:57 PM
RobH added a subtask: Unknown Object (Task).

I've put in T226756, in the future, please followup with me directly on orders (or file hardware-requests or procurement tasks). Assigning a random task to me doesn't always get them looked at in a timely manner.

I assume that this host will be reimaged, but in case it's not, please manually run:

apt-get remove python-conftool

once fixed.

The replacement SSD has been ordered on T226756. It should arrive within a week or so, then this can progress. (once the linked T226756 is resolved this can move forward.)

I've assigned this to @Cmjohnson for him to install the SSD once it arrives on the linked procurement task (which is also assigned to him for receiving.)

@Volans I have the new ssd, are you positive that /dev/sda is in slot 0?

@Cmjohnson I've nothing to do with this host, I just commented because unable to remove a package as part of the conftool upgrade.

From a quick look at mdstat and dmesg it looks to me that sdd is the broken one:

$ cat /proc/mdstat
Personalities : [raid1] [raid0]
md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      3004026880 blocks super 1.2 512k chunks

md1 : active (auto-read-only) raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]

md0 : active raid1 sda1[0] sdd1[3](F) sdc1[2] sdb1[1]
      29279232 blocks super 1.2 [4/3] [UUU_]

unused devices: <none>
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 Sense Key : Illegal Request [current]
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 Add. Sense: Logical block address out of range
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 CDB: Read(10) 28 00 03 9b e0 00 00 00 08 00
[Thu Jun 13 21:40:36 2019] blk_update_request: I/O error, dev sdd, sector 60547072

That from /dev/disk/by-id/ is:

ata-INTEL_SSDSC2BX800G4_BTHC632208WF800NGN -> ../../sdd

and other commands to get more info gets stuck...

So please double check and coordinate with the host service owners.

This server is still on Jessie, might be the best option to simply reimage as Stretch and re-bootstrap?

This server is still on Jessie, might be the best option to simply reimage as Stretch and re-bootstrap?

WFM

This host pops up because it's the only one where i can't upgrade scap with debdeploy. I tried to ssh to it manually and it asks me for a password. So that isn't properly reinstalled yet but running.

@Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't' be breaking anything. Is that correct?

@Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't' be breaking anything. Is that correct?

Correct.

@Eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to me

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Jul 24 2019, 7:17 PM

@Eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to me

Thanks @Cmjohnson! The machine still needs to be reimaged (preferably with Stretch). Let me know if I should find someone else to do that.

Change 525351 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Switch restbase-dev1006 to Stretch

https://gerrit.wikimedia.org/r/525351

Change 525351 merged by Dzahn:
[operations/puppet@production] Switch restbase-dev1006 to Stretch

https://gerrit.wikimedia.org/r/525351

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907242110_dzahn_180052_restbase-dev1006_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T21:22:22Z] <mutante> <+icinga-wm> RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. (T224260)

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

Removing the ops-eqiad and DC-Ops tag, if a hardware issue presents itself please add the tags back

Please note this has a netbox state of 'active' when it is NOT actually active or online.

As such, I've changed the status to 'planned'. Once the system is online, please change the status in netbox to active.

https://netbox.wikimedia.org/dcim/devices/858/

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907262314_dzahn_253389_restbase-dev1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907262316_dzahn_253819_restbase-dev1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

Mentioned in SAL (#wikimedia-operations) [2019-07-26T23:27:33Z] <mutante> restbase-dev1006 - does not boot - hangs at "attempting to boot from C:" - entering "Legacy BIOS One Time Boot Menu" (T224260)

Mentioned in SAL (#wikimedia-operations) [2019-07-26T23:51:05Z] <mutante> restbase-dev1006 - manually booting into PXE to debug boot issue / start Debian installer (T224260)

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

actually.. puppet run is not failing anymore now. :)

though.. i had to restart nagios-nrpe-server once and Icinga checks are in a weird mixed state:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=restbase-dev1006

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

I'll have a look; Thanks!

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

The Cassandra instances are now restored.

Icinga is still not happy though, and there are some ACPI errors in dmesg.

[ 1155.458730] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1155.511851] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1155.569835] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1215.459082] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1215.512970] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1215.572356] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1275.462866] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1275.517937] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1275.579099] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1335.468198] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
...

Mentioned in SAL (#wikimedia-operations) [2019-07-29T06:31:28Z] <_joe_> restarting nrpe on restbase-dev1006 T224260

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

The Cassandra instances are now restored.

Icinga is still not happy though, and there are some ACPI errors in dmesg.

Curious that these show up, we explicitly blacklist the acpi power meter kernel module, a reboot should make the messages go away.

I see all Icinga alerts are back to OK now. Looks like this ticket is done. Is that right?

@Eevans Please reopen if something is missing.

Change 534519 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] staging/sessionstore: restbase-dev1006 is back online

https://gerrit.wikimedia.org/r/534519

Change 534519 merged by Eevans:
[operations/deployment-charts@master] staging/sessionstore: restbase-dev1006 is back online

https://gerrit.wikimedia.org/r/534519