Page MenuHomePhabricator

Suspected faulty SSD on graphite1001
Closed, ResolvedPublic

Description

I noticed graphite1001 was reporting unusually high iowait and people reported slower than usual results, looking further it seems one of the SSDs is constantly at 100% utilization and generally slower than the rest:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    3.00     0.00  1144.00   762.67   134.06 8116.00    0.00 8116.00 333.33 100.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

smartctl is also somewhat slow to report results, I suspect because the ssd is busy

root@graphite1001:~# smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-100-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSC2BB600G4
Serial Number:    XXX
LU WWN Device Id: 5 5cd2e4 04ba1d41f
Firmware Version: D2010370
User Capacity:    600,127,266,816 bytes [600 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Feb  2 12:51:34 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x79) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17925
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3
175 Program_Fail_Count_Chip 0x0033   100   100   010    Pre-fail  Always       -       455676396140
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   080   080   000    Old_age   Always       -       20 (Min/Max 16/25)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       31
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
225 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       61499962
226 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       102400
227 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       30
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1075505
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   001   001   000    Old_age   Always       -       0
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       61499962
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       27237993

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@graphite1001:~#

@Cmjohnson do you have an spare ssd to try and swap with this?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH created subtask Unknown Object (Task).Feb 2 2017, 6:28 PM

Note that the same behaviour is now showing up on sdb too. I've asked @RobH to bump quantity to order in T157065, assuming worst case all SSDs will eventually show the same behaviour and will need replacement.

In terms of mitigation strategies:

  • graphite1001 appears to be slower than usual but otherwise working
  • graphite1001 is the main entry point for statsd traffic (statsd.eqiad.wmnet) and carbon traffic (graphite-in.eqiad.wmnet)
  • statsite aggregates statsd traffic and sends it to carbon-c-relay via graphite-in.eqiad.wmnet
  • graphite1003 / graphite2002 are dedicated to cassandra metrics, cassandra sends directly to graphite1003
  • the carbon-c-relay on both graphite1001 and graphite1003 takes care of sending the traffic locally (to eqiad) and remotely (to its counterpart in codfw)

If graphite1001's disks are slow but the machine itself is usable we can point grafana to graphite2002 instead and stop carbon-cache on graphite1001 from writing to disk. graphite1001/graphite1003 will still receive metrics and send to eqiad/codfw as usual, clients (grafana) will read and send queries to codfw though.

A full failover (i.e. pointing metrics producers to codfw, in case graphite1001 was fully down) requires switching DNS for statsd.eqiad.wmnet to point to graphite2001 and ditto for diamond (in puppet). Essentially like it was done for tungsten -> graphite1001 in https://gerrit.wikimedia.org/r/#/c/188539 https://gerrit.wikimedia.org/r/#/c/188036 and related task T85909.

Change 335761 had a related patch set uploaded (by Filippo Giunchedi):
cache: move graphite/performance backends to graphite2001

https://gerrit.wikimedia.org/r/335761

Change 335762 had a related patch set uploaded (by Filippo Giunchedi):
graphite: move performance::site to graphite2001

https://gerrit.wikimedia.org/r/335762

Change 335763 had a related patch set uploaded (by Filippo Giunchedi):
graphite: move alerts to graphite2001

https://gerrit.wikimedia.org/r/335763

Change 335764 had a related patch set uploaded (by Filippo Giunchedi):
diamond: switch to graphite2001

https://gerrit.wikimedia.org/r/335764

Change 335765 had a related patch set uploaded (by Filippo Giunchedi):
graphite: switch graphite alerts to graphite2001

https://gerrit.wikimedia.org/r/335765

Change 335766 had a related patch set uploaded (by Filippo Giunchedi):
graphite: switch to graphite2001

https://gerrit.wikimedia.org/r/335766

I've staged the patches needed for failover in a series of reviews above. There's also a graphite-codfw dashboard at https://grafana.wikimedia.org/dashboard/db/graphite-codfw for which some graphite-related metrics won't be right until the failover happens, the system metrics are correct however.

The trickiest part is likely to be DNS changes since not all statsd/graphite clients might pick up the CNAME change. Also note that graphite1003 / graphite2002 (i.e. cassandra metrics) are unaffected and don't need to be failed over.

Change 335762 merged by Filippo Giunchedi:
graphite: move performance::site to graphite2001

https://gerrit.wikimedia.org/r/335762

Change 335761 merged by Filippo Giunchedi:
cache: move graphite/performance backends to graphite2001

https://gerrit.wikimedia.org/r/335761

Mentioned in SAL (#wikimedia-operations) [2017-02-03T17:05:34Z] <godog> fail over read traffic from graphite1001 to graphite2001 https://gerrit.wikimedia.org/r/335761 - T157022

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

I've also searched from graphite1001's address in router configs and the only place it shows up is analytics-in4 filter for carbon/statsd traffic.

graphite2001 has been added to cr1/cr2 for analytics-in4

@fgiunchedi I have the ssds on-site. The disk is in a 3.5" internal disk bay and will need to be powered off for the replacement.

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

Can you elaborate? Where was stat1001 used in ferm configuration? resolve() is only resolved during ferm startup/restarts, i.e. if an IP behind a CNAME changes, ferm needs a reload.

The replacement SSDs have arrived onsite, and planning for replacing them can take place on this task.

The replacement SSDs have arrived onsite, and planning for replacing them can take place on this task.

Thanks @RobH and @Cmjohnson ! We'll need to switch the write traffic to graphite2001 before taking down graphite1001, I'm planning to do that tomorrow (Wed 8th, still traveling) or Thurs 9th at the latest once I'm back home

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

Can you elaborate? Where was stat1001 used in ferm configuration? resolve() is only resolved during ferm startup/restarts, i.e. if an IP behind a CNAME changes, ferm needs a reload.

stat1001 is still used on eventlog1001 in ferm's rsync rules and indeed that was what prevented a ferm reload. The only place I can still see stat1001 is in statistics_servers in hieradata

eventlog1001:~$ sudo grep -ir stat1001 /etc/ferm/
/etc/ferm/conf.d/10_eventlogging_rsyncd:&R_SERVICE(tcp, 873, @resolve((stat1001.eqiad.wmnet stat1002.eqiad.wmnet stat1003.eqiad.wmnet analytics1027.eqiad.wmnet dataset1001.wikimedia.org thorium.eqiad.wmnet)));
eventlog1001:~$ host stat1001.eqiad.wmnet
Host stat1001.eqiad.wmnet not found: 3(NXDOMAIN)

Change 335764 merged by Filippo Giunchedi:
diamond: switch to graphite2001

https://gerrit.wikimedia.org/r/335764

Mentioned in SAL (#wikimedia-operations) [2017-02-09T14:48:58Z] <godog> move diamond traffic to graphite2001 - T157022

Change 336804 had a related patch set uploaded (by Filippo Giunchedi):
diamond: reload on handler config file changes

https://gerrit.wikimedia.org/r/336804

Change 335766 merged by Filippo Giunchedi:
graphite: switch to graphite2001

https://gerrit.wikimedia.org/r/335766

Mentioned in SAL (#wikimedia-operations) [2017-02-09T16:38:11Z] <godog> flip dns records for statsd/carbon to graphite2001 - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-09T18:35:35Z] <godog> bounce zuul to pick up statsd DNS change - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:06:53Z] <godog> roll-restart restbase in codfw to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:20:29Z] <godog> restart of jmxtrans on analytics by elukey - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:30:39Z] <godog> roll-restart restbase in eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:34:01Z] <godog> roll-restart ocg to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:37:14Z] <elukey> roll-restart of aqs to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:39:18Z] <godog> roll-restart parsoid in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T10:54:44Z] <godog> roll-restart karthoterian in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:10:01Z] <godog> restart navtiming ve asset-check statsd-mw-js-deprecate on hafnium to pick up statsd.eqiad.wmnet change - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:16:53Z] <godog> roll-restart tilerator in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:19:12Z] <godog> roll-restart jmxtrans on conf* in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:23:00Z] <godog> roll-restart parsoid on ruthenium to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:25:16Z] <godog> roll-restart nodepool on labnodepool1001 to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:30:13Z] <godog> roll-restart changeprop on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:36:21Z] <godog> roll-restart mathoid/citoid/mobileapps/cxserver/eventstreams/graphoid on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:41:20Z] <godog> roll-restart trendingedits on scb in eqiad/codfw to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T11:46:29Z] <godog> roll-restart tileratorui in codfw/eqiad to pick up new statsd.eqiad.wmnet - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-10T12:00:10Z] <godog> bounce mwerrors on eventlog1001 to pick up statsd cname change - T157022

@fgiunchedi I have the ssds on-site. The disk is in a 3.5" internal disk bay and will need to be powered off for the replacement.

@Cmjohnson graphite1001 is now drained of traffic, we can shut it and swap SSDs

Change 335763 merged by Filippo Giunchedi:
graphite: move alerts to graphite2001

https://gerrit.wikimedia.org/r/335763

Mentioned in SAL (#wikimedia-operations) [2017-02-10T20:11:37Z] <godog> silence graphite1001 for ssd reinstall - T157022

Change 337077 had a related patch set uploaded (by Filippo Giunchedi):
install_server: reinstall graphite1001 with jessie

https://gerrit.wikimedia.org/r/337077

Change 337077 merged by Filippo Giunchedi:
install_server: reinstall graphite1001 with jessie

https://gerrit.wikimedia.org/r/337077

all 4 disks have been swapped. The server is on and accessible via mgmt

Mentioned in SAL (#wikimedia-operations) [2017-02-10T22:42:36Z] <godog> start rsync of whisper metrics graphite2001 -> graphite1001 - T157022

Read traffic has been switched over to graphite2001 now and seems to work.

Note that graphite2001 was unable to talk to eventlog1001, the root cause is that stat1001 was still in ferm's configuration and @resolve wouldn't work for it, thus not reloading rules (puppet didn't fail either on this)

Can you elaborate? Where was stat1001 used in ferm configuration? resolve() is only resolved during ferm startup/restarts, i.e. if an IP behind a CNAME changes, ferm needs a reload.

stat1001 is still used on eventlog1001 in ferm's rsync rules and indeed that was what prevented a ferm reload. The only place I can still see stat1001 is in statistics_servers in hieradata

eventlog1001:~$ sudo grep -ir stat1001 /etc/ferm/
/etc/ferm/conf.d/10_eventlogging_rsyncd:&R_SERVICE(tcp, 873, @resolve((stat1001.eqiad.wmnet stat1002.eqiad.wmnet stat1003.eqiad.wmnet analytics1027.eqiad.wmnet dataset1001.wikimedia.org thorium.eqiad.wmnet)));
eventlog1001:~$ host stat1001.eqiad.wmnet
Host stat1001.eqiad.wmnet not found: 3(NXDOMAIN)

FWIW I think the reason this went unnoticed is something similar to https://gerrit.wikimedia.org/r/#/c/337384/ since 'ferm reload' will fail only a single puppet run but not subsequent ones if the rule files don't change

Change 337386 had a related patch set uploaded (by Filippo Giunchedi):
coal: run on jessie

https://gerrit.wikimedia.org/r/337386

Change 337386 merged by Filippo Giunchedi:
coal: run on jessie

https://gerrit.wikimedia.org/r/337386

Change 336804 merged by Filippo Giunchedi:
diamond: require $handler to be defined

https://gerrit.wikimedia.org/r/336804

Change 338733 had a related patch set uploaded (by Filippo Giunchedi):
Revert "diamond: switch to graphite2001"

https://gerrit.wikimedia.org/r/338733

Change 338733 merged by Filippo Giunchedi:
Revert "diamond: switch to graphite2001"

https://gerrit.wikimedia.org/r/338733

Mentioned in SAL (#wikimedia-operations) [2017-02-20T11:00:11Z] <godog> switch diamond traffic to graphite1001 - T157022

Change 338745 had a related patch set uploaded (by Filippo Giunchedi):
cache: move graphite/performance to graphite1001

https://gerrit.wikimedia.org/r/338745

Change 338745 merged by Filippo Giunchedi:
cache: move graphite/performance to graphite1001

https://gerrit.wikimedia.org/r/338745

Change 338938 had a related patch set uploaded (by Filippo Giunchedi):
Revert "graphite: switch to graphite2001"

https://gerrit.wikimedia.org/r/338938

Change 338938 merged by Filippo Giunchedi:
Revert "graphite: switch to graphite2001"

https://gerrit.wikimedia.org/r/338938

Mentioned in SAL (#wikimedia-operations) [2017-02-21T08:53:27Z] <godog> switch statsd/graphite DNS to graphite1001 - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T10:55:50Z] <elukey> rolling restart of the analyics jmxtrans daemons for T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T11:02:14Z] <elukey> rolling restart of cassandra-metrics-collector on aqs1* for T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:06:30Z] <godog> roll-restart restbase after statsd move to graphite1001 - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:18:00Z] <mobrovac@tin> Started restart [mathoid/deploy@ba3217e]: Restarting for Graphite DNS switch T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:19:23Z] <mobrovac@tin> Started restart [citoid/deploy@95df861]: Restarting for Graphite DNS switch T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:21:17Z] <mobrovac@tin> Started restart [cxserver/deploy@0e4ae4f]: Restarting for Graphite DNS switch T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:30:19Z] <mobrovac@tin> Started restart [graphoid/deploy@da37386]: Restarting for Graphite DNS switch T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:34:41Z] <mobrovac@tin> Started restart [mobileapps/deploy@cd3b897]: Restarting for Graphite DNS switch T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:39:23Z] <elukey> restart jmxtrans on kafka[12]00[123] for T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:40:05Z] <godog> restart navtiming ve asset-check statsd-mw-js-deprecate on hafnium to pick up statsd.eqiad.wmnet change - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T17:35:55Z] <godog> roll-restart parsoid in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T17:47:12Z] <godog> roll-restart jmxtrans in codfw/eqiad on conf* to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T17:50:36Z] <godog> roll-restart ocg in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T18:03:50Z] <godog> roll-restart trendingedits in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T18:04:41Z] <godog> roll-restart eventstreams in codfw/eqiad to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T18:12:10Z] <godog> roll-restart zuul on cont1001 to pick up statsd.eqiad.wmnet DNS changes - T157022

Mentioned in SAL (#wikimedia-operations) [2017-02-21T18:12:56Z] <godog> roll-restart nodepool on labnodepool1001 to pick up statsd.eqiad.wmnet DNS changes - T157022

Switchback to graphite1001 has been completed, I've updated T88997: Improve graphite failover for followup on what services didn't follow the CNAME change correctly.

carbon-cache alerts on graphite2001 - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=carbon-cache

saw puppet is disabled there with link to this ticket. acked in icinga

Change 335765 abandoned by Filippo Giunchedi:
graphite: switch graphite alerts to graphite2001

Reason:
Not needed anymore

https://gerrit.wikimedia.org/r/335765

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:53 PM