Page MenuHomePhabricator

Deadlock in blazegraph blocking all queries and updates
Open, Needs TriagePublic

Description

Apparently a deadlock inside blazegraph itself:

Found one Java-level deadlock:
=============================
"GASEngine4":
  waiting for ownable synchronizer 0x00007fcbf9dbc3c0, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "com.bigdata.journal.Journal.executorService1539347"
"com.bigdata.journal.Journal.executorService1539347":
  waiting to lock monitor 0x00007fc555798e18 (object 0x00007fcfda000320, a java.lang.Object),
  which is held by "GASEngine2"
"GASEngine2":
  waiting to lock monitor 0x00007fc57c22e358 (object 0x00007fcbf9b97710, a java.lang.Object),
  which is held by "com.bigdata.journal.Journal.executorService1539347"

full stack: P10117

The problem remained unseen by the system, but started around 2020-01-10T15:44.
The machine stopped to handle updates and queries, the lag stopped to be reported as well.
Blazegraph was restarted around 19:44.

Event Timeline

dcausse created this task.Jan 10 2020, 7:56 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptJan 10 2020, 7:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2020-01-16T16:31:44Z] <dcausse> depooling wdqs1007, blazegraph stuck (T242453)

Mentioned in SAL (#wikimedia-operations) [2020-01-16T17:05:42Z] <dcausse> restarting blazegraph@wdqs1007 (T242453)

icinga check showed: CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. for Query Service HTTP Port and NaN for WDQS high update lag.

We should probably alert in case of timeouts.

Stackdumps from blazegraph: P10185

Mentioned in SAL (#wikimedia-operations) [2020-01-18T09:00:29Z] <dcausse> repool wdqs1007 (T242453)

Mentioned in SAL (#wikimedia-operations) [2020-02-06T08:18:02Z] <dcausse> restarting blazegraph on wdqs1006: T242453

Mentioned in SAL (#wikimedia-operations) [2020-03-15T13:27:15Z] <dcausse> restarting blazegraph on wdqs1005 T242453

Mentioned in SAL (#wikimedia-operations) [2020-03-19T08:43:34Z] <dcausse> restarting blazegraph on wdqs1006 (T242453)

dcausse renamed this task from wdqs1005 stopped to handle updates to Deadlock in blazegraph stopping all queries and updates.Thu, Mar 19, 8:44 AM
dcausse renamed this task from Deadlock in blazegraph stopping all queries and updates to Deadlock in blazegraph blocking all queries and updates.