wdqs1001 and wdqs1003 unresponsive
Closed, ResolvedPublic


Today, WDQS became unresponsive, leading to HTTP 502 errors. wdqs1003 was the first one to expose this behaviour (at 10am UTC) and recovered after a restart. wdqs1001 displayed a similar behaviour at ~3:30pm UTC.

wdqs1001 was depooled at 3:29pm UTC.

A few thread dumps were taken before wdqs1001 was restarted (available in journalctl: journalctl -u wdqs-blazegraph -o cat --since="2017-02-28 15:31:00" --until="2017-02-28 16:00:00"). Report is available on fastthread.

At 3:58pm, wdqs1001 started to raise OutOfMemoryError. The JVM was not restarted (seems that now this can be done without an external wrapper - ExitOnOutOfMemoryError). We are also a short on metrics about GC (GC logs, heap regions metrics, ...). This OutOfMemoryError indicates that a thread was still running and allocating memory after the depool (maybe the updater).

Gehel created this task.Feb 28 2017, 5:02 PM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptFeb 28 2017, 5:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2017-02-28T17:04:18Z] <gehel> restarting blazegraph on wdqs1001 - T159245

debt triaged this task as High priority.Mar 2 2017, 11:02 PM

I've bumped the memory so it should be better now. ExitOnOutOfMemoryError is an interesting possibility. @Gehel let's talk about it.

Gehel added a comment.Mar 7 2017, 10:47 AM

An early report based on GC logs is available. This report analyses only 14h of data, so no hard conclusions yet. Still, a few things to note:

  • We use G1, I have not much experience with it, I need to learn!
  • Over the period analysed, we rarely go over 8Go of heap after GC. This tends to indicate that 16Go is overkill and probably has negative influence on throughput. More data is needed to validate this hypothesis.
  • Over the period analysed, we have an average allocation rate of ~400Mo/sec. This is high! Not crazy high, but still high. I suspect that some of the issues we have seen might be related to increase activity (user traffic, updates, ...) which would increase the allocation rate to something higher than what the GC can cope with. This is pure speculation at this point. This is bad news, as it looks like this might become our bottleneck and there is no easy way to overcome memory throughput issues.
Smalyshev closed this task as Resolved.Mar 7 2017, 9:42 PM
Smalyshev claimed this task.

I think we can resolve the issue for now, since bumping memory limit solved immediate issue. I agree that we need to watch GC stats and see if we don't have trouble there.