Page MenuHomePhabricator

analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted
Closed, ResolvedPublic

Description

Icinga alerted that analytics1045 has both a RAID failure and degraded systemd state.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1045&service=MegaRAID
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1045&service=Check+systemd+state

The failed systemd service is:

● var-lib-hadoop-data-j.mount loaded failed failed start /var/lib/hadoop/data/j

Event Timeline

Dzahn created this task.Sep 5 2019, 5:51 AM
Restricted Application added a project: Analytics. · View Herald TranscriptSep 5 2019, 5:51 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dzahn removed a subscriber: Dzahn.Sep 5 2019, 5:52 AM
jbond added a subscriber: jbond.Sep 9 2019, 10:21 AM

I tried running the followin command on the server however the Current Cache policy remains as WriteThrough

analytics1045 ~ % sudo megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll                  [10:19:33]
                                     
Set Write Policy to Forced WriteBack on Adapter 0, VD 0 (target id: 0) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 1 (target id: 1) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 2 (target id: 2) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 3 (target id: 3) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 4 (target id: 4) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 5 (target id: 5) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 6 (target id: 6) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 7 (target id: 7) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 8 (target id: 8) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 10 (target id: 10) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 11 (target id: 11) success
Set Write Policy to Forced WriteBack on Adapter 0, VD 12 (target id: 12) success

Exit Code: 0x00
analytics1045 ~ % sudo megacli -LDInfo -LAll -aAll | grep "Cache Policy:"                   [10:20:36]
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
Default Cache Policy: WriteBack, ReadAdaptive, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, Write Cache OK if Bad BBU
jbond triaged this task as Normal priority.Sep 9 2019, 10:21 AM
jbond added projects: DC-Ops, ops-eqiad.

Hi @Dzahn @jbond - looks like this host is out of warranty, and about 3/4 of a year away from a hardware refresh....so just wanted to double-check if you're considering to retire this system soon or if you'd like us to purchase the hardware part for replacement? Thanks, Willy

Dzahn added a comment.Sep 16 2019, 7:29 PM

That's a question for the analytics team, please.

Ottomata added subscribers: Nuria, Ottomata.EditedSep 16 2019, 8:40 PM

Hello!

In our FY2019-2020 hardware budgeting, we had planned to replace these nodes in Q4, when they actually need to be replaced.

We could probably operate down a worker node for a while, but 9 months is a long time. I'll ask my team tomorrow what we should do. Ping @Nuria @elukey

Dzahn removed a subscriber: Dzahn.Sep 16 2019, 8:48 PM

This host can keep running with one disk less, fixed it with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538303/

Thanks @elukey . Should we ignore/resolve this alert then? Thanks, Willy

elukey closed this task as Resolved.Tue, Oct 8, 5:00 AM

Thanks @elukey . Should we ignore/resolve this alert then? Thanks, Willy

Yes correct!