Page MenuHomePhabricator

MemoryManagerClosedException is produced even with no external queries
Open, Stalled, LowPublic

Description

Due to T198042: WDQS timeout on the public eqiad cluster, I have temporarily depooled wdq1004, but even after removing it from pool and restarting Blazegraph, I still get these exceptions:

Jun 25 03:44:15 wdqs1004 bash[20997]: 03:44:14.804 [com.bigdata.journal.Journal.executorService62] ERROR com.bigdata.util.concurrent.Haltable IP: UA: - com.bigdata.bop.join.PipelineJoin$JoinTask{ joinOp=com.bigdata.bop.join.PipelineJoin[2]()[ BOp.bopId=2, JoinAnnotations.constraints=null, AST2BOpBase.simpleJoin=true, BOp.evaluationContext=ANY, AccessPathJoinAnnotations.predicate=com.bigdata.rdf.spo.SPOPredicate[1](x=null, y=null, z=null)[ IPredicate.relationName=[wdq.spo], IPredicate.timestamp=1529898254773, BOp.bopId=1, AST2BOpBase.estimatedCardinality=5514016070, AST2BOpBase.originalIndex=SPO, IPredicate.flags=[KEYS,VALS,READONLY,PARALLEL]]]} : isFirstCause=true : com.bigdata.rwstore.sector.MemoryManagerClosedException
Jun 25 03:44:15 wdqs1004 bash[20997]: com.bigdata.rwstore.sector.MemoryManagerClosedException: null
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.MemoryManager.assertOpen(MemoryManager.java:110)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.MemoryManager.allocate(MemoryManager.java:671)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.AllocationContext.allocate(AllocationContext.java:195)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.AllocationContext.allocate(AllocationContext.java:169)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.AllocationContext.allocate(AllocationContext.java:159)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.sector.AllocationContext.alloc(AllocationContext.java:359)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.PSOutputStream.save(PSOutputStream.java:335)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.rwstore.PSOutputStream.getAddr(PSOutputStream.java:416)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.solutions.SolutionSetStream.put(SolutionSetStream.java:297)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.LocalNativeChunkMessage.<init>(LocalNativeChunkMessage.java:213)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.LocalNativeChunkMessage.<init>(LocalNativeChunkMessage.java:147)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.StandaloneChunkHandler.handleChunk(StandaloneChunkHandler.java:92)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$HandleChunkBuffer.outputChunk(ChunkedRunningQuery.java:1699)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$HandleChunkBuffer.addReorderAllowed(ChunkedRunningQuery.java:1628)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$HandleChunkBuffer.add(ChunkedRunningQuery.java:1569)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$HandleChunkBuffer.add(ChunkedRunningQuery.java:1453)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.relation.accesspath.UnsyncLocalOutputBuffer.handleChunk(UnsyncLocalOutputBuffer.java:59)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.relation.accesspath.UnsyncLocalOutputBuffer.handleChunk(UnsyncLocalOutputBuffer.java:14)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.relation.accesspath.AbstractUnsynchronizedArrayBuffer.overflow(AbstractUnsynchronizedArrayBuffer.java:287)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.relation.accesspath.AbstractUnsynchronizedArrayBuffer.add2(AbstractUnsynchronizedArrayBuffer.java:215)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.relation.accesspath.AbstractUnsynchronizedArrayBuffer.add(AbstractUnsynchronizedArrayBuffer.java:173)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask$AccessPathTask.handleJoin2(PipelineJoin.java:1868)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask$AccessPathTask.call(PipelineJoin.java:1684)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask$BindingSetConsumerTask.runOneTask(PipelineJoin.java:1087)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask$BindingSetConsumerTask.call(PipelineJoin.java:995)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask.consumeSource(PipelineJoin.java:729)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:623)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:382)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTask.call(ChunkedRunningQuery.java:1346)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTaskWrapper.run(ChunkedRunningQuery.java:926)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63)
Jun 25 03:44:15 wdqs1004 bash[20997]:         at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkFutureTask.run(ChunkedRunningQuery.java:821)
Jun 25 03:44:16 wdqs1004 bash[20997]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jun 25 03:44:16 wdqs1004 bash[20997]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jun 25 03:44:16 wdqs1004 bash[20997]:         at java.lang.Thread.run(Thread.java:748)

There should not be any queries running that time out, only Updater queries that do not timeout, but the exception still happens.

Log attached:

Event Timeline

Smalyshev added a subscriber: Gehel.

Maybe caused by the system "ping" from icinga? But that shouldn't be timing out either, so it's weird it is appearing. If it's normal, it should be suppressed and not pollute the logs at the minimum.

Change 441774 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: don't log MemoryManagerClosedException

https://gerrit.wikimedia.org/r/441774

The above change would suppress the MemoryManagerClosedExceptions and all log messages from com.bigdata.util.concurrent.Haltable. I'm not entirely convinced this is the right thing to do, since it is hiding an error that we don't entirely understand. But if there is nothing we can do about it, it does not make sense to log it.

Change 441774 merged by Gehel:
[operations/puppet@production] wdqs: don't log MemoryManagerClosedException

https://gerrit.wikimedia.org/r/441774

Smalyshev lowered the priority of this task from High to Medium.Jun 25 2018, 5:35 PM

Still seeing those exceptions, e.g. see on wdqs1009 (that serves no traffic, therefore should not have any timeouts?)

We've disabled it on the logger level, but I'd probably still do two things there:

  1. maybe change the cutoff from com.bigdata.util.concurrent.Haltable to only the one caused by com.bigdata.rwstore.sector.MemoryManagerClosedException - using evaluators in logback config.
  1. Figure out if it's a legit exception and shouldn't it be caught and handled somewhere more quietly? I feel it may be some overzealous handling of LIMIT (i.e. some background threads trying to supply data that aren't needed already because of LIMIT) but can be wrong.
Vvjjkkii renamed this task from MemoryManagerClosedException is produced even with no external queries to 1daaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 1daaaaaaaa to MemoryManagerClosedException is produced even with no external queries.Jul 2 2018, 10:22 AM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.
Smalyshev changed the task status from Open to Stalled.Jun 18 2019, 10:30 PM
Smalyshev lowered the priority of this task from Medium to Low.