Page MenuHomePhabricator

WDQS timeout on the public eqiad cluster
Closed, ResolvedPublic

Description

WDQS has paged for the eqiad public cluster. Symptoms are:

  • high response times
  • timeouts on our LVS checks
  • high numbers of threads (~10K)

Note that the number of banned request dropped in the same timeframe as the problem started. Maybe an abusive client started to evade our ban.

Event Timeline

Gehel created this task.Jun 24 2018, 4:54 PM
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptJun 24 2018, 4:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel added a comment.Jun 24 2018, 7:47 PM

Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004). Looking at logstash, the number of Haltable errors is high, but it has been just as high in the last 7 days without major issues.

Side note, we might want to have a log of long running queries, to see more easily if we can spot something...

Gehel added a comment.Jun 24 2018, 9:50 PM

wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it:

Mentioned in SAL (#wikimedia-operations) [2018-06-24T21:51:23Z] <gehel> restarting wdqs1005 after taking a few thread dumps - T198042

Mentioned in SAL (#wikimedia-operations) [2018-06-24T22:26:46Z] <gehel> restarting wdqs1004 - T198042

Gehel added a comment.Jun 25 2018, 6:40 AM

Looking at thread dumps on wdqs1005, there is > 5000 threads waiting logging (see stack trace below). We could improve the situation with an AsyncAppender (probably a good idea anyway), but that's only treating the symptoms, not the root cause.

com.bigdata.journal.Journal.executorService20427 - priority:5 - threadId:0x00007f5c8013f800 - nativeId:0x350c - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f5d69220380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at ch.qos.logback.core.OutputStreamAppender.writeBytes(OutputStreamAppender.java:197)
at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:231)
at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:102)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
at ch.qos.logback.classic.Logger.log(Logger.java:765)
at org.apache.log4j.Category.differentiatedLog(Category.java:193)
at org.apache.log4j.Category.warn(Category.java:257)
at com.bigdata.util.concurrent.Haltable.logCause(Haltable.java:474)
at com.bigdata.util.concurrent.Haltable.halt(Haltable.java:197)
at com.bigdata.bop.join.PipelineJoin$JoinTask$BindingSetConsumerTask.call(PipelineJoin.java:1022)
at com.bigdata.bop.join.PipelineJoin$JoinTask.consumeSource(PipelineJoin.java:739)
at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:623)
at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:382)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63)
at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTask.call(ChunkedRunningQuery.java:1346)
at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTaskWrapper.run(ChunkedRunningQuery.java:926)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63)
at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkFutureTask.run(ChunkedRunningQuery.java:821)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Gehel added a comment.Jun 25 2018, 7:11 AM

The pattern of banned / throttled request as seen on wdqs matches a pattern of HTTP 500 seen on varnish. It is the same user agent / IP. I was expecting all those banned / throttled requests to be 403 / 429, but it looks like this is not the case. Something is wrong...

Mentioned in SAL (#wikimedia-operations) [2018-06-25T07:46:25Z] <gehel> depooling wdqs1005 to allow it to catch up on updates - T198042

Mentioned in SAL (#wikimedia-operations) [2018-06-25T12:00:11Z] <gehel> repooling wdqs1005 it has catched up on updates - T198042

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Jun 26 2018, 7:37 AM

Do we have anything left to do here?

I think we're all done here.

Vvjjkkii renamed this task from WDQS timeout on the public eqiad cluster to 5daaaaaaaa.Jul 1 2018, 1:01 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from 5daaaaaaaa to WDQS timeout on the public eqiad cluster.Jul 1 2018, 3:14 AM
JJMC89 closed this task as Resolved.
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.