restbase-dev1006 has a broken disk
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	May 24 2019, 6:37 AM

Description

/srv is mounted read-only and I see the typical errors in the logs that indicate that /dev/sdd is broken, and needs to be replaced.

The server will likely need to be reimaged after the disk is changed, given it will break several layers of things configured there.

Details

Subject	Repo	Branch	Lines +/-
staging/sessionstore: restbase-dev1006 is back online	operations/deployment-charts	master	+1 -2
Switch restbase-dev1006 to Stretch	operations/puppet	production	+0 -1
Scap: Temporarily remove restbase-dev1006 from targets	mediawiki/services/restbase/deploy	master	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Resolved	DAlangi_WMF	T324678 Migrate proton (chromium-render) away from restbase
Open	None	T167603 Any Chinese Wiki's projects about "Download as PDF" can not auto change to Simplified Chinese or Traditional Chinese
Resolved	ovasileva	T147553 [EPIC] Page previews broken on many projects
Open	ovasileva	T244262 [Epic] Enable page previews and reference previews as a beta feature on all projects
Open	None	T111231 Page previews for Wikidata
Invalid	None	T148854 Use RESTBase for zhwiki
Resolved	• Pchelolo	T188164 Popups don‘t support language variant conversion and {{lang}} template
Resolved	• mobrovac	T190689 FY17/18 Q4 Program 7 Services Goal: Language variants support
Resolved	Eevans	T186751 Restablish RESTBase dev environment with Cassandra 3.11.2
Resolved	Dzahn	T185494 Degraded RAID on restbase-dev1006
Resolved	Eevans	T224260 restbase-dev1006 has a broken disk
		Unknown Object (Task)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: SRE. · View Herald TranscriptMay 24 2019, 6:37 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:38:55Z] <mobrovac> restbase-dev1006 puppet disabled - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:39:48Z] <mobrovac> restbase-dev1006 stop restbase - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:40:01Z] <mobrovac> restbase-dev1006 decommission cass-a - T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:43:29Z] <_joe_> disable notifications in icinga for restbase-dev1006 T224260

Mentioned in SAL (#wikimedia-operations) [2019-05-24T06:45:37Z] <mobrovac> restbase-dev1006 decommission cass-b - T224260

• mobrovac edited projects, added Services (watching), Platform Team Legacy (Watching / External), Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)); removed Services.May 24 2019, 7:01 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-24T07:05:04Z] <mobrovac> restbase-dev1006 force-stop the cassandra instances, fsync exception during decomm - T224260

The node is now ready to be taken over DC-Ops for disk replacement.

• MoritzMuehlenhoff assigned this task to • Cmjohnson.May 24 2019, 7:11 AM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 28 2019, 2:53 PM

• mobrovac mentioned this in T224554: Migrate Restbase-dev cluster to Stretch.May 29 2019, 12:11 PM

Change 513259 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Scap: Temporarily remove restbase-dev1006 from targets

https://gerrit.wikimedia.org/r/513259

Change 513259 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Scap: Temporarily remove restbase-dev1006 from targets

https://gerrit.wikimedia.org/r/513259

Maintenance_bot removed a project: Patch-For-Review.May 30 2019, 8:10 AM

@RobH this disk will need to be ordered outside of the warranty. These servers were shipped without disks, the procurement task states that the disk from RBDEV1001-1003 will be used. They are 800GB Intell SSDS

description: ATA Disk
product: INTEL SSDSC2BX80
physical id: 0.0.0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 0150
serial: BTHC632208JK800NGN
size: 745GiB (800GB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=3bc4fb3b

• Cmjohnson moved this task from Hardware Failure / Troubleshoot to Blocked on the ops-eqiad board.Jun 11 2019, 3:50 PM

Can we please move forward with ordering a fixed disk? This broken disk causes subtle errors for all fleet-wide Cumin/debdeploy runs touching e.g. dpkg as it stalls I/O almost infinitely.

i have manully disabled the /usr/local/sbin/smart-data-dump cron job to reduce spam

@RobH @Cmjohnson I had an issue today again running cumin to the fleet because restbase-dev1006 was stalling due to disk errors :(

jijiki merged a task: T223825: Degraded RAID on restbase-dev1006.Jun 27 2019, 3:09 PM

jijiki added subscribers: ops-monitoring-bot, Volans, fgiunchedi.

I've put in T226756, in the future, please followup with me directly on orders (or file hardware-requests or procurement tasks). Assigning a random task to me doesn't always get them looked at in a timely manner.

I assume that this host will be reimaged, but in case it's not, please manually run:

apt-get remove python-conftool

once fixed.

• MoritzMuehlenhoff mentioned this in T227394: Degraded RAID on restbase-dev1006.Jul 8 2019, 6:46 AM

The replacement SSD has been ordered on T226756. It should arrive within a week or so, then this can progress. (once the linked T226756 is resolved this can move forward.)

I've assigned this to @Cmjohnson for him to install the SSD once it arrives on the linked procurement task (which is also assigned to him for receiving.)

@Volans I have the new ssd, are you positive that /dev/sda is in slot 0?

• Cmjohnson moved this task from Blocked to Hardware Failure / Troubleshoot on the ops-eqiad board.Jul 16 2019, 8:24 PM

@Cmjohnson I've nothing to do with this host, I just commented because unable to remove a package as part of the conftool upgrade.

From a quick look at mdstat and dmesg it looks to me that sdd is the broken one:

$ cat /proc/mdstat
Personalities : [raid1] [raid0]
md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      3004026880 blocks super 1.2 512k chunks

md1 : active (auto-read-only) raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]

md0 : active raid1 sda1[0] sdd1[3](F) sdc1[2] sdb1[1]
      29279232 blocks super 1.2 [4/3] [UUU_]

unused devices: <none>

[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 Sense Key : Illegal Request [current]
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 Add. Sense: Logical block address out of range
[Thu Jun 13 21:40:36 2019] sd 3:0:0:0: [sdd] tag#7 CDB: Read(10) 28 00 03 9b e0 00 00 00 08 00
[Thu Jun 13 21:40:36 2019] blk_update_request: I/O error, dev sdd, sector 60547072

That from /dev/disk/by-id/ is:

ata-INTEL_SSDSC2BX800G4_BTHC632208WF800NGN -> ../../sdd

and other commands to get more info gets stuck...

So please double check and coordinate with the host service owners.

This server is still on Jessie, might be the best option to simply reimage as Stretch and re-bootstrap?

In T224260#5349319, @MoritzMuehlenhoff wrote:

This server is still on Jessie, might be the best option to simply reimage as Stretch and re-bootstrap?

WFM

This host pops up because it's the only one where i can't upgrade scap with debdeploy. I tried to ssh to it manually and it asks me for a password. So that isn't properly reinstalled yet but running.

@Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't' be breaking anything. Is that correct?

In T224260#5362558, @Cmjohnson wrote:

@Eevans is this your server? I think I understand that the server is going to be re-installed anyway so if I pull the wrong disk to replace I won't' be breaking anything. Is that correct?

Correct.

@Eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to me

Eevans awarded a token.Jul 24 2019, 7:07 PM

• Cmjohnson closed subtask Unknown Object (Task) as Resolved.Jul 24 2019, 7:17 PM

In T224260#5362801, @Cmjohnson wrote:

@Eevans the disk has been replaced. I am resolving this task, if you find the problem is not fixed, please re-open and assign to me

Thanks @Cmjohnson! The machine still needs to be reimaged (preferably with Stretch). Let me know if I should find someone else to do that.

Change 525351 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Switch restbase-dev1006 to Stretch

https://gerrit.wikimedia.org/r/525351

gerritbot added a project: Patch-For-Review.Jul 24 2019, 7:48 PM

Change 525351 merged by Dzahn:
[operations/puppet@production] Switch restbase-dev1006 to Stretch

https://gerrit.wikimedia.org/r/525351

Maintenance_bot removed a project: Patch-For-Review.Jul 24 2019, 9:10 PM

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907242110_dzahn_180052_restbase-dev1006_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T21:22:22Z] <mutante> <+icinga-wm> RECOVERY - Device not healthy -SMART- on restbase-dev1006 is OK: All metrics within thresholds. (T224260)

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

The failed install might be due to https://phabricator.wikimedia.org/T222960#5327461 ?

Removing the ops-eqiad and DC-Ops tag, if a hardware issue presents itself please add the tags back

CCicalese_WMF edited projects, added Platform Engineering (Needs Cleaning - Cassandra Operational); removed Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), Platform Team Legacy (Watching / External).Jul 26 2019, 5:44 PM

Please note this has a netbox state of 'active' when it is NOT actually active or online.

As such, I've changed the status to 'planned'. Once the system is online, please change the status in netbox to active.

https://netbox.wikimedia.org/dcim/devices/858/

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907262314_dzahn_253389_restbase-dev1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase-dev1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907262316_dzahn_253819_restbase-dev1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1006.eqiad.wmnet']

Of which those FAILED:

['restbase-dev1006.eqiad.wmnet']

Dzahn added a parent task: T185494: Degraded RAID on restbase-dev1006.Jul 26 2019, 11:26 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-26T23:27:33Z] <mutante> restbase-dev1006 - does not boot - hangs at "attempting to boot from C:" - entering "Legacy BIOS One Time Boot Menu" (T224260)

Mentioned in SAL (#wikimedia-operations) [2019-07-26T23:51:05Z] <mutante> restbase-dev1006 - manually booting into PXE to debug boot issue / start Debian installer (T224260)

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

cc: @fgiunchedi

actually.. puppet run is not failing anymore now. :)

though.. i had to restart nagios-nrpe-server once and Icinga checks are in a weird mixed state:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=restbase-dev1006

In T224260#5370545, @Dzahn wrote:

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

I'll have a look; Thanks!

In T224260#5370545, @Dzahn wrote:

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

The Cassandra instances are now restored.

Icinga is still not happy though, and there are some ACPI errors in dmesg.

[ 1155.458730] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1155.511851] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1155.569835] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1215.459082] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1215.512970] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1215.572356] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1275.462866] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
[ 1275.517937] ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff896e7edc5e60), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
[ 1275.579099] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)
[ 1335.468198] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
...

Mentioned in SAL (#wikimedia-operations) [2019-07-29T06:31:28Z] <_joe_> restarting nrpe on restbase-dev1006 T224260

In T224260#5370616, @Eevans wrote:

In T224260#5370545, @Dzahn wrote:

@Eevans Alright, despite the issues above the server has been reinstalled now and is on stretch. I checked that puppet got so far to create all the shell users. So you should have access again.

Currently puppet fails because the scap deploy-local step fails. Would you know how to take it from here?

The Cassandra instances are now restored.

Icinga is still not happy though, and there are some ACPI errors in dmesg.

Curious that these show up, we explicitly blacklist the acpi power meter kernel module, a reboot should make the messages go away.

I see all Icinga alerts are back to OK now. Looks like this ticket is done. Is that right?

@Eevans Please reopen if something is missing.

Change 534519 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] staging/sessionstore: restbase-dev1006 is back online

https://gerrit.wikimedia.org/r/534519

Change 534519 merged by Eevans:
[operations/deployment-charts@master] staging/sessionstore: restbase-dev1006 is back online

https://gerrit.wikimedia.org/r/534519

Eevans mentioned this in rDEPLOYCHARTS7673dc22599b: staging/sessionstore: restbase-dev1006 is back online.Sep 4 2019, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 4 2019, 8:10 PM

• Cmjohnson mentioned this in T253607: Degraded RAID on restbase-dev1004.Jun 12 2020, 12:54 PM