Page MenuHomePhabricator

Investigate slownesses on an-worker1132
Closed, ResolvedPublic

Description

an-worker1132 is facing multiple alerts like below one

PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

There is indeed lots of loads on the CPU which seems to be linked to really slow disks: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-worker1132&var-datasource=thanos&var-cluster=analytics&from=1677659659968&to=1677694130819

Need to investigate root cause and to remove the node from the cluster

Event Timeline

Change 893644 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] hadoop: exclude an-worker1132 node from hdfs and yarn

https://gerrit.wikimedia.org/r/893644

Change 893644 merged by Nicolas Fraison:

[operations/puppet@production] hadoop: exclude an-worker1132 node from hdfs and yarn

https://gerrit.wikimedia.org/r/893644

Node an-worker1132 under decommission

Downtime node to avoid false alert

For reference disk bench from an-worker1131

nfraison@an-worker1131:/var/lib/hadoop/data/d/test$ sudo sysbench fileio --file-test-mode=rndrw run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      7863.21
    writes/s:                     5242.14
    fsyncs/s:                     16781.84

Throughput:
    read, MiB/s:                  122.86
    written, MiB/s:               81.91

General statistics:
    total time:                          10.0066s
    total number of events:              299078

Latency (ms):
         min:                                    0.00
         avg:                                    0.03
         max:                                   24.48
         95th percentile:                        0.16
         sum:                                 9901.14

Threads fairness:
    events (avg/stddev):           299078.0000/0.00
    execution time (avg/stddev):   9.9011/0.00

On an-worker1132 all disks are having same stats with no more 4MiB for read/write per sec and 238/158 iops

/var/lib/hadoop/data/l
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      238.00
    writes/s:                     158.67
    fsyncs/s:                     516.56

Throughput:
    read, MiB/s:                  3.72
    written, MiB/s:               2.48

General statistics:
    total time:                          10.0796s
    total number of events:              9081

Latency (ms):
         min:                                    0.00
         avg:                                    1.10
         max:                                   51.37
         95th percentile:                        8.13
         sum:                                 9994.87

Threads fairness:
    events (avg/stddev):           9081.0000/0.00
    execution time (avg/stddev):   9.9949/0.00

The broken disk is /dev/sdg1 on /var/lib/hadoop/data/g type ext4 (ro,noatime)

Raid disk configuration is in WriteThrough instead of WriteBack.

  • On an-worker1131
nfraison@an-worker1131:~$ sudo megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 446.625 GB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 446.625 GB
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
  • On an-worker1132
nfraison@an-worker1132:~$ sudo megacli -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 446.625 GB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 446.625 GB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 2
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU

BBU looks fine

nfraison@an-worker1132:~$ sudo megacli -AdpBbuCmd -aALL
                                     
BBU status for Adapter: 0

BatteryType: BBU
Voltage: 3858 mV
Current: 0 mA
Temperature: 31 C
Battery State: Optimal
BBU Firmware Status:

  Charging Status              : None
  Voltage                                 : OK
  Temperature                             : OK
  Learn Cycle Requested	                  : No
  Learn Cycle Active                      : No
  Learn Cycle Status                      : OK
  Learn Cycle Timeout                     : No
  I2c Errors Detected                     : No
  Battery Pack Missing                    : No
  Battery Replacement required            : No
  Remaining Capacity Low                  : No
  Periodic Learn Required                 : No
  Transparent Learn                       : No
  No space to cache offload               : No
  Pack is about to fail & should be replaced : No
  Cache Offload premium feature required  : No
  Module microcode update required        : No

BBU GasGauge Status: 0x0138 
Relative State of Charge: 87 %
Charger Status: Complete
Remaining Capacity: 360 mAh
Full Charge Capacity: 417 mAh
isSOHGood: Yes
  Battery backup charge time : 0 hours

BBU Capacity Info for Adapter: 0

  Relative State of Charge: 87 %
  Absolute State of charge: 0 %
  Remaining Capacity: 360 mAh
  Full Charge Capacity: 417 mAh
  Run time to empty: Battery is not being charged.  
  Average time to empty: 29 Min. 
  Estimated Time to full recharge: Battery is not being charged.  
  Cycle Count: 3
Max Error = 0 %
Remaining Capacity Alarm = 0 mAh
Remining Time Alarm = 0 Min

BBU Design Info for Adapter: 0

  Date of Manufacture: 00/00, 0000
  Design Capacity: 0 mAh
  Design Voltage: 0 mV
  Specification Info: 0
  Serial Number: 0
  Pack Stat Configuration: 0x0000
  Manufacture Name: 0x129
  Firmware Version   : 0.6
  Device Name: 
  Device Chemistry: 
  Battery FRU: N/A
Module Version = 0.6
  Transparent Learn = 1
  App Data = 0

BBU Properties for Adapter: 0

  Auto Learn Period: 90 Days
  Next Learn time: None  Learn Delay Interval:0 Hours
  Auto-Learn Mode: Disabled

Exit Code: 0x00

Enforcing cache to WriteBack doesn't work: sudo megacli -LDSetProp -WB -Immediate -Lall -aAll

Change 895830 had a related patch set uploaded (by Nicolas Fraison; author: Nicolas Fraison):

[operations/puppet@production] hadoop: add back an-worker1132

https://gerrit.wikimedia.org/r/895830

Adding back node as all paritions are available and disks cache is back

Change 895830 merged by Nicolas Fraison:

[operations/puppet@production] hadoop: add back an-worker1132

https://gerrit.wikimedia.org/r/895830