Page MenuHomePhabricator

Detect and alert and/or remediate Blazegraph deadlocks
Open, HighPublic3 Estimated Story Points

Description

Apparently a deadlock inside blazegraph itself:

Found one Java-level deadlock:
=============================
"GASEngine4":
  waiting for ownable synchronizer 0x00007fcbf9dbc3c0, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "com.bigdata.journal.Journal.executorService1539347"
"com.bigdata.journal.Journal.executorService1539347":
  waiting to lock monitor 0x00007fc555798e18 (object 0x00007fcfda000320, a java.lang.Object),
  which is held by "GASEngine2"
"GASEngine2":
  waiting to lock monitor 0x00007fc57c22e358 (object 0x00007fcbf9b97710, a java.lang.Object),
  which is held by "com.bigdata.journal.Journal.executorService1539347"

full stack: P10117

The problem remained unseen by the system, but started around 2020-01-10T15:44.
The machine stopped to handle updates and queries, the lag stopped to be reported as well.
Blazegraph was restarted around 19:44.

Root cause is likely related to the fact that the GAS engine spawns its own thread pool.

Possible workarounds:

  • disable the GAS engine service (user impact is high)
  • detect this deadlock state and restart the service

AC:

  • determine how to workaround this problem, create a new ticket and decline this one as we won't be fixing this bug inside blazegraph

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2020-01-16T16:31:44Z] <dcausse> depooling wdqs1007, blazegraph stuck (T242453)

Mentioned in SAL (#wikimedia-operations) [2020-01-16T17:05:42Z] <dcausse> restarting blazegraph@wdqs1007 (T242453)

icinga check showed: CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. for Query Service HTTP Port and NaN for WDQS high update lag.

We should probably alert in case of timeouts.

Stackdumps from blazegraph: P10185

Mentioned in SAL (#wikimedia-operations) [2020-02-06T08:18:02Z] <dcausse> restarting blazegraph on wdqs1006: T242453

Mentioned in SAL (#wikimedia-operations) [2020-03-15T13:27:15Z] <dcausse> restarting blazegraph on wdqs1005 T242453

Mentioned in SAL (#wikimedia-operations) [2020-03-19T08:43:34Z] <dcausse> restarting blazegraph on wdqs1006 (T242453)

dcausse renamed this task from wdqs1005 stopped to handle updates to Deadlock in blazegraph stopping all queries and updates.Mar 19 2020, 8:44 AM
dcausse renamed this task from Deadlock in blazegraph stopping all queries and updates to Deadlock in blazegraph blocking all queries and updates.

Mentioned in SAL (#wikimedia-operations) [2020-04-12T06:59:24Z] <dcausse> restarting blazegraph on wdqs1004 (T242453)

Mentioned in SAL (#wikimedia-operations) [2020-04-18T12:15:49Z] <addshore> depool wdqs1006 blazegraph stuck T242453

Mentioned in SAL (#wikimedia-operations) [2020-04-18T12:19:30Z] <addshore> restarting blazegraph on wdqs1006 blazegraph stuck T242453

wdqs1006 now 8 hours lagged and depooled after I noticed it was not reporting on grafana after I was poked about wdqs being very behind.
I'll keep checking back throughout the day and repool later.
We should get T250140: icinga: WDQS high update lag should alert when the service times out done asap

While it was out the lag on the other eqiad instances shot up, but maxlag didnt kick in as wdqs1006 wasnt reporting so the eqiad servers were basically ignored from the maxlag calculation
T238751: Only generate maxlag from pooled query service servers.

We actually depooled the whole of eqiad for wdqs for now until they all catch up..

14:03 cdanis@cumin1001: conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=eqiad

Will then repool.
Also T250557: allow non-roots to pool/depool certain DNS Discovery services got created.

Change 589873 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873

Change 589874 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.28] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589874

Mentioned in SAL (#wikimedia-operations) [2020-04-18T22:50:04Z] <addshore> pool wdqs1006 blazegraph caught up T242453

This just happened again, so depooled and restarted 1006, and switched traffic over to codfw.
Seems to always be 1006?

This just happened again, so depooled and restarted 1006, and switched traffic over to codfw.
Seems to always be 1006?

I don't think so, it happened at least once to all nodes in eqiad (1004, 1005, 1006 and 1007).

Change 589874 abandoned by Addshore:
Switch query service maxlag to be median 1

https://gerrit.wikimedia.org/r/589874

Mentioned in SAL (#wikimedia-operations) [2020-04-24T15:08:12Z] <addshore> depool and restart wdqs1006 to catch up with lag after deadlock T242453

For posterity, SAL records for when eqiad was depooled due to this issue:
https://sal.toolforge.org/sal/production?p=0&q=%22dnsdisc%3Dwdqs%22

Mentioned in SAL (#wikimedia-operations) [2020-04-25T10:23:02Z] <addshore> going to restart and probably depool for a short time wdqs1005 as it is in a deadlock T242453

Mentioned in SAL (#wikimedia-operations) [2020-04-27T10:22:59Z] <addshore> depool and restart wdqs1007 (deadlocks) T242453

Mentioned in SAL (#wikimedia-operations) [2020-04-28T08:21:33Z] <dcausse> restarting blazegraph on wdqs1007 (T242453)

dcausse triaged this task as High priority.Apr 28 2020, 9:44 AM

Raising to high, this issue might be hard to solve as it sounds related to the blazegraph design flaw of running with unbounded thread pools.
We might perhaps at least try to add some debugging code to isolate the request that's causing the deadlock and sees if we find a pattern.

Change 589873 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873

Mentioned in SAL (#wikimedia-operations) [2020-05-01T18:57:04Z] <gehel> restart blazegraph on wdqs1006 - T242453

Mentioned in SAL (#wikimedia-operations) [2020-05-10T10:27:20Z] <dcausse> restarting blazgraph on wdqs1004: T242453

There is a suspicion that the query below is related to the issue. More context in logstash.

SELECT DISTINCT ?government_governmental_jurisdiction_governing_officials ?government_governmental_jurisdiction_governing_officials ?government_government_position_held_office_holder_inverse ?government_government_position_held_appointed_by ?government_government_position_held_basic_title_inverse
WHERE {
	VALUES ?government_governmental_jurisdiction_governing_officials { wdt:P17 p:P17 wdt:P36 p:P36 wdt:P47 p:P47 wdt:P6 p:P6 wdt:P138 p:P138 wdt:P37 p:P37 wdt:P150 p:P150 wdt:P1313 p:P1313 wdt:P625 p:P625 wdt:P35 p:P35 } .
	?c ?government_governmental_jurisdiction_governing_officials ?k .
	VALUES ?government_governmental_jurisdiction_governing_officials { wdt:P17 p:P17 wdt:P36 p:P36 wdt:P47 p:P47 wdt:P6 p:P6 wdt:P138 p:P138 wdt:P37 p:P37 wdt:P150 p:P150 wdt:P1313 p:P1313 wdt:P625 p:P625 wdt:P35 p:P35 } .
	?c ?government_governmental_jurisdiction_governing_officials ?y .
	VALUES ?government_government_position_held_office_holder_inverse { wdt:P19 p:P19 wdt:P103 p:P103 wdt:P69 p:P69 wdt:P937 p:P937 wdt:P106 p:P106 wdt:P26 p:P26 wdt:P27 p:P27 wdt:P166 p:P166 wdt:P39 p:P39 wdt:P607 p:P607 wdt:P735 p:P735 wdt:P21 p:P21 wdt:P551 p:P551 wdt:P102 p:P102 wdt:P1412 p:P1412 wdt:P6886 p:P6886 wdt:P2163 p:P2163 } .
	wd:Q15029 ?government_government_position_held_office_holder_inverse ?y .
	VALUES ?government_government_position_held_appointed_by { wdt:P991 p:P991 wdt:P1308 p:P1308 wdt:P112 p:P112 wdt:P710 p:P710 wdt:P542 p:P542 wdt:P488 p:P488 wdt:P726 p:P726 wdt:P40 p:P40 wdt:P6 p:P6 wdt:P138 p:P138 wdt:P1365 p:P1365 wdt:P921 p:P921 wdt:P35 p:P35 wdt:P748 p:P748 wdt:P1366 p:P1366 wdt:P22 p:P22 wdt:P3373 p:P3373 wdt:P50 p:P50 wdt:P26 p:P26 } .
	?k ?government_government_position_held_appointed_by wd:Q19211 .
	VALUES ?government_government_position_held_basic_title_inverse { wdt:P19 p:P19 wdt:P88 p:P88 wdt:P1193 p:P1193 wdt:P3616 p:P3616 wdt:P205 p:P205 wdt:P106 p:P106 wdt:P6161 p:P6161 wdt:P1331 p:P1331 wdt:P279 p:P279 wdt:P677 p:P677 wdt:P6828 p:P6828 wdt:P131 p:P131 wdt:P1001 p:P1001 wdt:P2099 p:P2099 } .
	wd:Q30461 ?government_government_position_held_basic_title_inverse ?y .
	VALUES ?c { wd:Q148 } .
}  LIMIT 100

Mentioned in SAL (#wikimedia-operations) [2020-08-25T07:04:31Z] <dcausse> restartint blazegraph on wdqs1005 (T242453)

Mentioned in SAL (#wikimedia-operations) [2020-10-08T06:57:45Z] <dcausse> restart blazegraph on wdqs2002 (stuck) T242453

Mentioned in SAL (#wikimedia-operations) [2021-01-11T07:41:02Z] <dcausse> depooling & restarting blazegraph on wdqs1007 (T242453)

Mentioned in SAL (#wikimedia-operations) [2021-01-11T07:43:55Z] <dcausse> repool wdqs1007 (wrong machine) (T242453)

Mentioned in SAL (#wikimedia-operations) [2021-01-11T07:49:17Z] <dcausse> depooling & restarting blazegraph on wdqs2007 (T242453)

This specific deadlock does not seem to happen often enough to warrant more investigation.

re-opening, seems to happen more frequently

bking renamed this task from Deadlock in blazegraph blocking all queries and updates to Detect and alert and/or remediate Blazegraph deadlocks.EditedMar 29 2022, 2:26 PM

Per conversation with dcausse, we could run jstack on a timer and grep the output for errors as shown above, then alert and/or remediate.

Mentioned in SAL (#wikimedia-operations) [2022-04-29T08:46:39Z] <dcausse> restarting blazegraph on wdqs1006 (deadlocked for 18hours, T242453)

Mentioned in SAL (#wikimedia-operations) [2022-12-05T08:05:06Z] <dcausse> restarting blazegraph on wdqs1004 (stuck with 2000+ threads, T242453)