Today, WDQS became unresponsive, leading to HTTP 502 errors. wdqs1003 was the first one to expose this behaviour (at 10am UTC) and recovered after a restart. wdqs1001 displayed a similar behaviour at ~3:30pm UTC.
wdqs1001 was depooled at 3:29pm UTC.
A few thread dumps were taken before wdqs1001 was restarted (available in journalctl: journalctl -u wdqs-blazegraph -o cat --since="2017-02-28 15:31:00" --until="2017-02-28 16:00:00"). Report is available on fastthread.
At 3:58pm, wdqs1001 started to raise OutOfMemoryError. The JVM was not restarted (seems that now this can be done without an external wrapper - ExitOnOutOfMemoryError). We are also a short on metrics about GC (GC logs, heap regions metrics, ...). This OutOfMemoryError indicates that a thread was still running and allocating memory after the depool (maybe the updater).