Degraded RAID on ocg1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Mar 22 2017, 9:13 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ocg1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sdb3[1] sda3[0]
      477511488 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sdb1[1] sda1[0](F)
      9756544 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Details

	Subject	Repo	Branch	Lines +/-
	ocg1001: put back into rotation	operations/puppet	production	+0 -1
	ocg: enable ocg1003, disable ocg1001	operations/puppet	production	+0 -0

Customize query in gerrit

Related Objects

Mentioned In: T84723: reinstall OCG servers
Mentioned Here: T150160: Remote IPMI doesn't work for ~2% of the fleet
T155692: ocg1001.eqiad.wmnet ipmi error
T84723: reinstall OCG servers

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Mar 22 2017, 9:13 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 9:13 PM

MoritzMuehlenhoff assigned this task to • Cmjohnson.Mar 23 2017, 7:45 AM

The disk is internal and the server will need to be powered off to replace the disk. Please coordinate scheduled downtime w/cmjohnson to replace

Someone, unfortunately, needs to follow the process outlined here: https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host

@Dzahn, can I ask you to have a look at that and coordinate with Chris? Thanks!

I read the instructions but i have questions:

"remove the host from the round-robin DNS name specified in the Collection extension configuration, so it is no longer the target of new job requests from PHP. This is the $wgCollectionMWServeURL variable"

This sounds like it needs code deployment of an extension. I have not done software deployment (in years, and never of an extension) don't know how to and would have to learn from scratch. Wasn't really planning on that.

"Once the DNS change has propagated and you've restarted OCG with the decommission configuration (restarting will wait for any existing jobs on that host to complete), you would run something like: .. sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-host-cache.js"

Where is this script located and where do you run it?

Change 347781 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

gerritbot added a project: Patch-For-Review.Apr 11 2017, 10:25 PM

Change 347781 merged by Dzahn:
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

Mentioned in SAL (#wikimedia-operations) [2017-04-11T22:36:34Z] <mutante> ocg1003 started picking up jobs (mw-ocg-latexer) after it was enabled with gerrit:347781, ocg1001 was disabled in the same change. Also ganglia graphs confirm it. T84723 T161158

Mentioned in SAL (#wikimedia-operations) [2017-04-11T23:11:23Z] <mutante> ocg1001 - scheduled downtime in icinga for host and all services, confirmed it's not actively doign things anymore, shutting down for hardware replacement (T161158)

See above, i have deactivated and shut down this host. Please replace the disk / fix the RAID and let me know when done so we can activate it again. You can do this anytime without further sync. I was able to activate ocg1003 instead, so we still have 2 servers running as before.

Mentioned in SAL (#wikimedia-operations) [2017-04-11T23:23:00Z] <mutante> ocg: clearing host cache for ocg1001 which is shutdown for hardware repair. (on ocg1003: sudo -u ocg -g ocg nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service/scripts/clear-host-cache.js -c /etc/ocg/mw-ocg-service.js ocg1001) T161158

To answer my own questions from above:

Once the DNS change has propagated and you've restarted OCG with the decommission configuration (restarting will wait for any existing jobs on that host to complete),

This happened automatically nowadays. Puppet did the service restart. (Info: /Stage[main]/Ocg/File[/etc/ocg/mw-ocg-service.js]: Scheduling refresh of Service[ocg]) after merging the config change (Hiera setting). There is no DNS change, just pybal/confctl.

sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-host-cache.js

I ran this on ocg1003: sudo -u ocg -g ocg nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service/scripts/clear-host-cache.js -c /etc/ocg/mw-ocg-service.js ocg1001

"remove the host from the round-robin DNS name specified in the Collection extension configuration..

This means "[puppetmaster1001:~] $ sudo -i confctl select name=ocg1001.eqiad.wmnet set/pooled=no" nowadays. (afaict, added to wikitech)

@Cmjohnson See above, i have deactivated and shut down this host. Please replace the disk / fix the RAID and let me know when done so we can activate it again. You can do this anytime without further sync. I was able to activate ocg1003 instead, so we still have 2 servers running as before.

@Dzahn I replaced the disk in slot 0 which is /dev/sda. I changed bios order to boot from /dev/sdb but it does not appear grub is installed. If I leave it be it defaults to a fresh install. Please take a look.

Relating it also to T155692

thanks @Cmjohnson! I'll take a look at it today. Pretty sure we can just reinstall this.

@Cmjohnson I attempted a reinstall but it consistently fails at the partitioning step with:

┌─────────────┤ [!!] Partition disks ├─────────────┐
│                                                  │
│ Input/output error during read on /dev/sda       │
│                                                  │
│ ERROR!!!                                         │
│                                                  │
│                    Retry                         │
│                    Ignore                        │
│                    Cancel                        │
│                                                  │
│     <Go Back>                                    │
│                                                  │
└──────────────────────────────────────────────────┘

I changed the boot order in BIOS (port A was still first, switched to port B), did not change the error. Still "during read on /dev/sda" at partitioning step.

Both identical drives are detected during boot and in BIOS though....

@Cmjohnson Somehow the new /dev/sda also seems to be broken. Maybe it was used in something else before? Or it was this disk that was broken the whole time and we replaced the wrong one? I dunno, but from /var/log/syslog in installer shell this looks pretty much like broken hardware (again?/still?).

Apr 25 01:01:38 kernel: [   55.257648]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.257649] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.257650] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.259626] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.259629] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.259720] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.259722] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.259723] ata1.00: failed command: READ DMA
Apr 25 01:01:38 kernel: [   55.259725] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 14 dma 4096 in
Apr 25 01:01:38 kernel: [   55.259725]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.259726] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.259727] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.261056] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.261059] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.261152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.261153] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.261155] ata1.00: failed command: READ DMA
Apr 25 01:01:38 kernel: [   55.261157] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 17 dma 4096 in
Apr 25 01:01:38 kernel: [   55.261157]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.261158] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.261159] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.262484] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.262487] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.262586] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.262587] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.262589] ata1.00: failed command: READ DMA
Apr 25 01:01:38 kernel: [   55.262591] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 20 dma 4096 in
Apr 25 01:01:38 kernel: [   55.262591]          res 51/04:08:00:00:00/00:00:00:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.262592] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.262593] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.264628] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.264631] sd 0:0:0:0: [sda]  
Apr 25 01:01:38 kernel: [   55.264632] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 25 01:01:38 kernel: [   55.264633] sd 0:0:0:0: [sda]  
Apr 25 01:01:38 kernel: [   55.264634] Sense Key : Aborted Command [current] [descriptor]
Apr 25 01:01:38 kernel: [   55.264635] Descriptor sense data with sense descriptors (in hex):
Apr 25 01:01:38 kernel: [   55.264635]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Apr 25 01:01:38 kernel: [   55.264640]         00 00 00 00 
Apr 25 01:01:38 kernel: [   55.264642] sd 0:0:0:0: [sda]  
Apr 25 01:01:38 kernel: [   55.264643] Add. Sense: No additional sense information
Apr 25 01:01:38 kernel: [   55.264644] sd 0:0:0:0: [sda] CDB: 
Apr 25 01:01:38 kernel: [   55.264644] Read(10): 28 00 00 00 00 00 00 00 08 00
Apr 25 01:01:38 kernel: [   55.264651] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.264759] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.264761] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.264762] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.264765] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 23 dma 4096 in
Apr 25 01:01:38 kernel: [   55.264765]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.264766] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.264767] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.266098] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.266101] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.266196] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.266197] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.266198] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.266201] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 25 dma 4096 in
Apr 25 01:01:38 kernel: [   55.266201]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.266202] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.266203] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.267528] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.267531] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.267628] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.267629] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.267630] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.267633] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 28 dma 4096 in
Apr 25 01:01:38 kernel: [   55.267633]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.267634] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.267635] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.269636] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.269638] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.269732] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.269734] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.269735] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.269737] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 0 dma 4096 in
Apr 25 01:01:38 kernel: [   55.269737]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.269738] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.269739] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.271080] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.271084] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.271177] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.271178] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.271179] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.271182] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 3 dma 4096 in
Apr 25 01:01:38 kernel: [   55.271182]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)
Apr 25 01:01:38 kernel: [   55.271183] ata1.00: status: { DRDY ERR }
Apr 25 01:01:38 kernel: [   55.271184] ata1.00: error: { ABRT }
Apr 25 01:01:38 kernel: [   55.272508] ata1.00: configured for UDMA/133
Apr 25 01:01:38 kernel: [   55.272511] ata1: EH complete
Apr 25 01:01:38 kernel: [   55.272602] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 25 01:01:38 kernel: [   55.272603] ata1.00: irq_stat 0x40000001
Apr 25 01:01:38 kernel: [   55.272604] ata1.00: failed command: READ DMA EXT
Apr 25 01:01:38 kernel: [   55.272607] ata1.00: cmd 25/00:08:28:60:38/00:00:3a:00:00/e0 tag 6 dma 4096 in
Apr 25 01:01:38 kernel: [   55.272607]          res 51/04:08:28:60:38/00:00:3a:00:00/e0 Emask 0x1 (device error)

Any other disk to try? can we replace sda one more time?

I will replace the disk again. The disk I used was a "used" disk but was wiped.

The disk has been replaced

Thanks, i will do a reinstall later today.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T20:20:38Z] <mutante> ocg1001 - re-added to puppet, initial run, reinstall ongoing (T161158)

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Apr 27 2017, 8:28 PM

Please ensure also that remote IPMI is working, eventually applying the fix in T150160, because right now is not:

neodymium  0 ~$ ipmitool -I lanplus -H ocg1001.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

it has been reinstalled and re-added to puppet and salt but i saw errors about failed deployment of the ocg app, so i didn't repool it yet... then after being afk for a while i came back and look at it.. and now ocg is running and puppet run without errors. who fixed it? :)

looks like it's ok.

<+icinga-wm>IRECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 72745 msg: ocg_render_job_queue 0 msg

@ocg1001:~# ps aux | grep ocg
root      3667  0.0  0.0  11876   912 pts/1    S+   22:32   0:00 grep --color=auto ocg
ocg      27042  0.0  0.0 910480 27060 ?        Ssl  20:51   0:00 /usr/bin/nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service.js -c /etc/ocg/mw-ocg-service.js
ocg      27051  0.1  0.0 992684 63708 ?        Sl   20:51   0:07 /usr/bin/nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service.js -c /etc/ocg/mw-ocg-service.js
ocg      27053  0.0  0.0 852824 39428 ?        Sl   20:51   0:00 /usr/bin/nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service.js -c /etc/ocg/mw-ocg-service.js
ocg      27054  0.1  0.0 856380 60376 ?        Sl   20:51   0:08 /usr/bin/nodejs-ocg /srv/deploymen

so we could pool it if needed, but i won't yet so that we can get the IPMI issue done

@Cmjohnson it's still ok to take it down, it's not getting traffic. i'll wait for the IPMI issue before putting it back to work. would be nice though if it can be prioritized, since both 1002 and 1003 are in row D as we saw with today's outage.

@Dzahn, the ipmi works on the host but is not reachable neodymium. This is something that is intermittent and not necessarily related to this task. There is an open ipmi task already so resolving this

root@neodymium:/home/cmjohnson# ipmitool -I lanplus -H ocg1001.mgmt.eqiad.wmnet -U root -E chassis status -v
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

@Cmjohnson Yes, that is what @Volans asked about on T161158#3219587. Can you apply the fix on T150160 please?

It would be nice to keep this open until ocg1001 is actually back in service.

Change 351301 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] ocg1001: put back into rotation

https://gerrit.wikimedia.org/r/351301

Change 351301 merged by Giuseppe Lavagetto:
[operations/puppet@production] ocg1001: put back into rotation