wdqs1001 and wdqs1003 unresponsive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Feb 28 2017, 5:02 PM

Description

Today, WDQS became unresponsive, leading to HTTP 502 errors. wdqs1003 was the first one to expose this behaviour (at 10am UTC) and recovered after a restart. wdqs1001 displayed a similar behaviour at ~3:30pm UTC.

wdqs1001 was depooled at 3:29pm UTC.

A few thread dumps were taken before wdqs1001 was restarted (available in journalctl: journalctl -u wdqs-blazegraph -o cat --since="2017-02-28 15:31:00" --until="2017-02-28 16:00:00"). Report is available on fastthread.

At 3:58pm, wdqs1001 started to raise OutOfMemoryError. The JVM was not restarted (seems that now this can be done without an external wrapper - ExitOnOutOfMemoryError). We are also a short on metrics about GC (GC logs, heap regions metrics, ...). This OutOfMemoryError indicates that a thread was still running and allocating memory after the depool (maybe the updater).

Related Objects

Mentioned In: T158874: wdqs replication lagging behind on some nodes

Event Timeline

Gehel created this task.Feb 28 2017, 5:02 PM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptFeb 28 2017, 5:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2017-02-28T17:04:18Z] <gehel> restarting blazegraph on wdqs1001 - T159245

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Mar 2 2017, 11:01 PM

debt triaged this task as High priority.Mar 2 2017, 11:02 PM

I've bumped the memory so it should be better now. ExitOnOutOfMemoryError is an interesting possibility. @Gehel let's talk about it.

Smalyshev merged a task: T158874: wdqs replication lagging behind on some nodes.Mar 2 2017, 11:30 PM

Smalyshev mentioned this in T158874: wdqs replication lagging behind on some nodes.

Smalyshev moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Smalyshev added a subscriber: debt.

Smalyshev moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.Mar 4 2017, 1:07 AM

An early report based on GC logs is available. This report analyses only 14h of data, so no hard conclusions yet. Still, a few things to note:

We use G1, I have not much experience with it, I need to learn!
Over the period analysed, we rarely go over 8Go of heap after GC. This tends to indicate that 16Go is overkill and probably has negative influence on throughput. More data is needed to validate this hypothesis.
Over the period analysed, we have an average allocation rate of ~400Mo/sec. This is high! Not crazy high, but still high. I suspect that some of the issues we have seen might be related to increase activity (user traffic, updates, ...) which would increase the allocation rate to something higher than what the GC can cope with. This is pure speculation at this point. This is bad news, as it looks like this might become our bottleneck and there is no easy way to overcome memory throughput issues.

I think we can resolve the issue for now, since bumping memory limit solved immediate issue. I agree that we need to watch GC stats and see if we don't have trouble there.

Smalyshev removed a project: Discovery-Wikidata-Query-Service-Sprint.Jul 14 2017, 10:31 PM

wdqs1001 and wdqs1003 unresponsiveClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

wdqs1001 and wdqs1003 unresponsive
Closed, ResolvedPublic
Actions