Page MenuHomePhabricator

Investigate OOM and elevated read latencies on 1007
Closed, ResolvedPublic

Assigned To
Authored By
GWicke
Oct 27 2015, 10:39 PM
Referenced Files
F2897788: pasted_file
Oct 30 2015, 2:28 PM
F2897798: pasted_file
Oct 30 2015, 2:28 PM
F2892929: rx_tx.png
Oct 29 2015, 6:25 PM
F2885569: cf-latency.png
Oct 28 2015, 3:01 PM
F2885548: requests.png
Oct 28 2015, 3:01 PM
F2885539: network.png
Oct 28 2015, 3:01 PM
F2885532: latency.png
Oct 28 2015, 3:01 PM
F2883284: pasted_file
Oct 27 2015, 10:39 PM

Description

Earlier this morning Cassandra on restbase1007 OOMed unexpectedly. Grafana also shows elevated read latencies in the last seven days:

pasted_file (1×1 px, 230 KB)

The Cassandra instance was restarted after the OOM, but read latencies are still elevated.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke updated the task description. (Show Details)

A couple more notes:

  • iowait is slightly higher on 100[7-9], but did not change recently; this is not surprising, as those nodes have half the RAM compared to 100[1-6]
  • read request throughput on 1007 has been the lowest in the cluster for several months
    • disk read throughput has been medium over the same period, suggesting that this is compaction-driven
  • logstash is showing messages corresponding to memory pressure on Oct 23 from 1007. These might be caused by a combination of new mobile HTML tables being created, while several large compactions were in progress concurrently.

[15:52] <urandom> some angry red disk errors in dmesg
[15:52] <urandom> for ata2
[15:52] <gwicke> yeah, but those are very old
[15:53] <gwicke> latest timestamp is 7933825, last error at 4575459
[15:54] <gwicke> given 92 day uptime, that was about 38 days ago

Not sure if this is relevant, but it seems that the swap array is pending on 1007:

gwicke@restbase1007:~$ cat /proc/mdstat 
Personalities : [raid0] [raid1] 
md0 : active raid1 sdb1[1] sda1[0]
      29279232 blocks super 1.2 [2/2] [UU]
      
md2 : active raid0 sda3[0] sdb3[1]
      1939599360 blocks super 1.2 512k chunks
      
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
        resync=PENDING
      
unused devices: <none>

Rebooted 1007 now, restarted instances, and ran sudo mdadm --readwrite md1 to get md1 back into read-write mode. Lets see if this changes anything about read latencies.

Cassandra on 1007 died yet again today at 11:35 UTC:

(11:35:01) icinga-wm: PROBLEM - service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
(11:36:00) icinga-wm: PROBLEM - Cassandra CQL query interface on restbase1007 is CRITICAL: Connection refused

Grafana shows elevated GC times.

Cassandra on 1007 died yet again today at 11:35 UTC:

(11:35:01) icinga-wm: PROBLEM - service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
(11:36:00) icinga-wm: PROBLEM - Cassandra CQL query interface on restbase1007 is CRITICAL: Connection refused

Grafana shows elevated GC times.

Just to add to this: This was another OOM exception. It looks like things started going south at somewhere around 5a, and got progressively worse until the crash at 11:30a. Since heap dump files are not currently getting unique names, I've moved the heap dump to a sub-directory of my home on 1007 (/home/eevans/2015-10-28_OOM), and will begin an analysis.

GWicke raised the priority of this task from High to Unbreak Now!.Oct 28 2015, 2:52 PM

Update:

I'm still working this problem, and don't yet have an answer, but thought it might be helpful to summarize where we are.

The read request latency metric seems to be worsening, and write latency is looking quite bad as well.

latency.png (409×1 px, 102 KB)

Interestingly, received bytes on this node is higher too, and the timing correlates with the increased latency.

network.png (297×948 px, 55 KB)

This additional traffic does not appear to be explained by additional requests, those seem on par with the other nodes.

requests.png (293×1 px, 150 KB)

Also of note, the request latency metric comes from Cassandra's StorageProxy#read, basically, it's timing how long it takes for the local and remote reads. Local reads don't seem to be the issue...

cf-latency.png (273×1 px, 127 KB)

... so next steps will by to try to isolate which non-local parts of read() are the issue.

As for the heap dumps:

Yesterdays (3.8G) dump is... inconclusive. I can't make sense of what I'm seeing. Today's (16G) dump has proven too large for me to open, so @fgiunchedi is presently transferring the dump to his machine, so we can try to analyze it there.

I have restarted 1007 with a 20G heap, as well as a sligthly lower initiation threshold (30%) and target latency (150ms). The hope is that this will buy us time and possibly lower latencies, while we establish what exactly is causing the memory shortage. https://gerrit.wikimedia.org/r/#/c/248960/ also just lowered the heap size from 16G to 12G; we might want to re-consider the heap size change.

The increased inbound traffic is interesting.

It doesn't seem to be matched by an equivalent outbound traffic increase. This would make it less likely that the Cassandra driver selects restbase1007 more often as the coordinator node, as that should also result in an increase in outgoing traffic.

A change in request sizes could perhaps explain this. The gradual increase in incoming traffic would suggest that it's an accumulative effect. We had a bug causing repetitive escaping before, so I wonder if a similar issue could be at work here. On the other hand, the effect seems to be localized to one node, which would be less likely if this was a general bug across the cluster.

Rebooted 1007 now, restarted instances, and ran sudo mdadm --readwrite md1 to get md1 back into read-write mode. Lets see if this changes anything about read latencies.

FTR that is a mdadm mode where the underlying devices haven't been written to yet, in the md1 case that's because the machine hasn't touched swap yet but it'll go readwrite by itself as soon as that happens, it isn't related to cassandra latencies in any way

Something of note, the 'aberrant data/one-bad-row' theory seemed to be contradicted by a lack of corresponding outbound bandwidth on other nodes. However, (courtesy of @fgiunchedi), there does appear to be a corresponding increase on 1002, 1004, and 1006. It's easier to see this in a per-host breakdown on Ganglia.

You can see this in Grafana as well, if you fiddle with scale, and isolate the hosts:

rx_tx.png (295×1 px, 87 KB)

As of about 5:30 am (UTC) this morning, client read latency has nearly converged (though while it would appear that 1007's latency has lowered somewhat, the other 8 machines seem to have increased some).

pasted_file (945×1 px, 749 KB)

Received bytes have similarly converged.

pasted_file (945×1 px, 487 KB)

GWicke lowered the priority of this task from Unbreak Now! to Medium.Nov 11 2015, 6:14 AM
GWicke claimed this task.

Here is a recollection of what I learned while looking into this:

Because of a bug, the page_revisions table got a write on each re-render, accumulating hundreds of thousands of revision entries per title over time. The write rate was around 200-300/s, with some articles being re-rendered about once per minute on average. An additional bug in the monitoring spec deployed earlier in the week (requests with no-cache) caused the page 'Foobar' to be re-rendered on each monitoring script run, on any node.

Reads and compaction of those very wide rows caused high IO load, which in turn caused compaction to fall behind a little, leading to a higher density of tombstones (except in the revision table) & a higher number of SSTables involved in reads. Range reads for revisions were accessing a high number of entries from many SSTables, resulting in large memory consumption.

Steps we took to resolve this:

  • Removed the no-cache header from the monitoring script.
  • Fixed the revision save bug (a broken deep equality check), and deleted the old revision entries. IOWait dropped by half, latency by almost half.
  • Moved from leveled compaction to Date-Tiered compaction to improve compaction efficiency and get more robust handling of highly skew update patterns: T117115

Created T118524 for proactively alerting on similar anomalous storage trends in the future.