Page MenuHomePhabricator

wdqs-updater crashing not cleanly
Closed, ResolvedPublic

Description

wdqs-updater crashed on wdqs100[345] at mostly the same time (Apr 23, 2018, ~9:45 UTC).

Looking at logs on wdqs1004, we can see a non-200 response from blazegraph (P7025). It looks like blazegraph was unable to create threads.

The updater did not recover. A thread dump (P7026) shows a non daemon thread that seems to be related to an HTTP server. We previously thought that this thread was related to jolokia (T188413), but jolokia has been removed. We need to find another explanation.

Event Timeline

Gehel created this task.Apr 23 2018, 11:15 AM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptApr 23 2018, 11:15 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thread dump from blazegraph on wdqs1004 does not show anything suspicious (P7027), but that dump was taken ~30' after the wdqs-updater failure.

This might or might not be related to T192759

The error seems to be "Caused by: java.lang.OutOfMemoryError: unable to create new native thread" - not sure what caused it or why Java didn't reset the process on OOM.

Looking at the thread dump for Updater, I see this:

"Thread-2" #7 prio=5 os_prio=0 tid=0x00007f173c5aa000 nid=0x7c76 runnable [0x00007f170644d000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x000000008006faf8> (a sun.nio.ch.Util$3)
	- locked <0x000000008006fb08> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000008006fab0> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:352)
	at java.lang.Thread.run(Thread.java:748)

Not sure what http server is doing inside Updater - is it Prometeus?

Looks like Prometheus' agent is at fault - locally, when I add the agent, Updater is stuck on error as it happens in production, without it it exits properly. Looking at Prometheus sources: https://github.com/prometheus/jmx_exporter/blob/master/jmx_prometheus_javaagent/src/main/java/io/prometheus/jmx/JavaAgent.java it is likely the source of HTTP server. Not sure how to fix it, ideas welcome.

Smalyshev added a comment.EditedApr 24 2018, 3:59 AM

Judging from this: https://github.com/prometheus/jmx_exporter/releases/tag/parent-0.1.0

[BUGFIX] Prevent hanging on JVM exit for java agent

It is a known issue - we may just need to upgrade the build. Latest release is https://github.com/prometheus/jmx_exporter/releases/tag/parent-0.3.0
We have prometheus-jmx-exporter 0.10-3

I tried to build master branch from source and the problem does not reproduce with it. So if it's hard to upgrade we can just include a jar into the deployment dir and load it.

@fgiunchedi might have a planned upgrade already, let's see what he has to say on the subject...

No planned upgrades ATM, though a newer upstream version might help with understanding (hopefully fixing) T192456: Prometheus metrics missing for some hosts too, so definitely welcome!

Change 428638 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/debs/prometheus-jmx-exporter@master] upgrade to upstream version 0.3.0

https://gerrit.wikimedia.org/r/428638

Change 428638 merged by Gehel:
[operations/debs/prometheus-jmx-exporter@master] upgrade to upstream version 0.3.0

https://gerrit.wikimedia.org/r/428638

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:52:15Z] <gehel> restarting wdqs-updater on all nodes for prometheus jmx exporter update - T192768

Smalyshev triaged this task as High priority.Apr 24 2018, 10:00 PM
Smalyshev closed this task as Resolved.Apr 26 2018, 4:22 AM
Smalyshev claimed this task.