We've had another case where Blazegraph gets to 10K threads and gets stuck there:
It would be good if we could figure out what caused it and how we could amend our configs to prevent this from happening.
We've had another case where Blazegraph gets to 10K threads and gets stuck there:
It would be good if we could figure out what caused it and how we could amend our configs to prevent this from happening.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Smalyshev | T206108 Limit CPU consumption of blazegraph | |||
Resolved | Smalyshev | T206189 Set sensible thread limit to Blazegraph | |||
Resolved | Smalyshev | T206880 Investigate runaway Blazegraph threads |
From small-scale test, which only got to about 400 threads: P7685
We have:
The last one is worrying for me - looks like, if I understand it right, that Jetty has some unlimited pool there. Not sure what these threads are and what they're used for, but backtraces look like:
"qtp1058025095-33988" #33988 prio=5 os_prio=0 tid=0x00007febcc00c000 nid=0x2578 waiting on condition [0x00007fed4cbca000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007feeecbe8780> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:516) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:748)
So they don't look like Blazegraph-owned ones. I wonder why there are so many of them.
Will try to run test on bigger scale tomorrow.
@Smalyshev if you could take a heap dump of blazegraph under load, we might be able to trace more precisely where this unnamed thread pool is coming from. Feel free to send me the dump for analysis.
Analyzing a big bigger spike, up to 3100 threads:
Looks like executorService is the one responsible for thread growth. Looks like also the thread count continues to raise after the load source was removed - may be related to accept backlog processing? It takes about 20 minutes to reach the peak and then it slowly recedes to normal levels over following 2 hours.
Doing the full replay of the recorded traffic, I get 9658 executorService threads. Again, load and thread set keeps ramping up long after traffic source is removed. I suspect the tasks accumulate in jetty's processing queue and then the executor tries to get them processed even though they should long have timed out and the client has long disconnected.
Timeout detection also seems to be unreliable under load, as I get queries with 500+s runtimes, even though the should have been long cancelled.
The running query queue itself seems to be limited by 32 parallel queries, but this does not constrain the growth of executorService thread set. Eventually it gets to the max of 10k and all threads fail with java.lang.OutOfMemoryError: unable to create new native thread.
Also, the pattern seems to be fully reproducible, I used:
./goreplay --input-file perf.gor --output-http http://localhost:9999 -stats -verbose
on wdqs1010.
I think it's pretty clear what is going on, so next thing would be in T206189 to set some limits. Bryan advises against setting hard limits on executor, so the options are: