Page MenuHomePhabricator

WDQS timeout on the public eqiad cluster
Closed, ResolvedPublic


WDQS has paged for the eqiad public cluster. Symptoms are:

  • high response times
  • timeouts on our LVS checks
  • high numbers of threads (~10K)

Note that the number of banned request dropped in the same timeframe as the problem started. Maybe an abusive client started to evade our ban.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004). Looking at logstash, the number of Haltable errors is high, but it has been just as high in the last 7 days without major issues.

Side note, we might want to have a log of long running queries, to see more easily if we can spot something...

wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it:

Mentioned in SAL (#wikimedia-operations) [2018-06-24T21:51:23Z] <gehel> restarting wdqs1005 after taking a few thread dumps - T198042

Looking at thread dumps on wdqs1005, there is > 5000 threads waiting logging (see stack trace below). We could improve the situation with an AsyncAppender (probably a good idea anyway), but that's only treating the symptoms, not the root cause.

com.bigdata.journal.Journal.executorService20427 - priority:5 - threadId:0x00007f5c8013f800 - nativeId:0x350c - state:WAITING
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f5d69220380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(
at java.util.concurrent.locks.ReentrantLock.lock(
at ch.qos.logback.core.OutputStreamAppender.writeBytes(
at ch.qos.logback.core.OutputStreamAppender.subAppend(
at ch.qos.logback.core.OutputStreamAppender.append(
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(
at ch.qos.logback.classic.Logger.callAppenders(
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(
at ch.qos.logback.classic.Logger.log(
at org.apache.log4j.Category.differentiatedLog(
at org.apache.log4j.Category.warn(
at com.bigdata.util.concurrent.Haltable.logCause(
at com.bigdata.util.concurrent.Haltable.halt(
at com.bigdata.bop.join.PipelineJoin$JoinTask$
at com.bigdata.bop.join.PipelineJoin$JoinTask.consumeSource(
at com.bigdata.bop.join.PipelineJoin$
at com.bigdata.bop.join.PipelineJoin$
at com.bigdata.bop.engine.ChunkedRunningQuery$
at com.bigdata.bop.engine.ChunkedRunningQuery$
at java.util.concurrent.Executors$
at com.bigdata.bop.engine.ChunkedRunningQuery$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$

The pattern of banned / throttled request as seen on wdqs matches a pattern of HTTP 500 seen on varnish. It is the same user agent / IP. I was expecting all those banned / throttled requests to be 403 / 429, but it looks like this is not the case. Something is wrong...

Mentioned in SAL (#wikimedia-operations) [2018-06-25T07:46:25Z] <gehel> depooling wdqs1005 to allow it to catch up on updates - T198042

Mentioned in SAL (#wikimedia-operations) [2018-06-25T12:00:11Z] <gehel> repooling wdqs1005 it has catched up on updates - T198042

Do we have anything left to do here?

JJMC89 renamed this task from 5daaaaaaaa to WDQS timeout on the public eqiad cluster.Jul 1 2018, 3:14 AM
JJMC89 closed this task as Resolved.
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.