Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dcausse
	Oct 20 2021, 8:14 AM

Description

As as a maintainer of a service running on top of the JVM I want the JVM to rapidly quit if it enters a gc death spiral so that the service increase its availability.

The default heuristics used by the JVM to kill itself (-XX:+ExitOnOutOfMemoryError) are too conservative to make them useful for real production use cases.
jvmquake seems to circumvent these problems by allowing more flexible heuristics to detect when the JVM will be stuck in a death spiral, see article: https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70.

This approach might be useful for several services:

blazegraph sometimes stuck in a death spiral certainly triggered by a bad query
cloudelastic sometimes misbehaving because of the GC
more directly identifying old gc hell on search clusters

AC:

debian package exists for jvmquake
jvmquake is deployed on Blazegraph with puppet
jvmquake is configured in reporting only mode

Details

Subject	Repo	Branch	Lines +/-
team-search-platform: remove BlazegraphJvmQuakeWarnGC	operations/alerts	master	+0 -30
wdqs: activate jvmquake at 300:5	operations/puppet	production	+2 -4
wdqs: tune jvmquake settings (take 2)	operations/puppet	production	+2 -2
wdqs: tune jvmquake settings	operations/puppet	production	+4 -4
team-search-platform: add jvmquake alerting	operations/alerts	master	+30 -0
[wdqs] test jvmquake options on the public cluster	operations/puppet	production	+216 -52

Customize query in gerrit

Event Timeline

• dcausse created this task.Oct 20 2021, 8:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2021, 8:14 AM

• dcausse updated the task description. (Show Details)Oct 20 2021, 8:18 AM

Maintenance_bot added a project: Wikidata.Oct 20 2021, 8:45 AM

• MPhamWMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.Oct 25 2021, 3:18 PM

• MPhamWMF added a project: Discovery-Search (Current work).

Gehel updated the task description. (Show Details)Oct 25 2021, 3:53 PM

• MPhamWMF set the point value for this task to 5.Oct 25 2021, 3:54 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel updated the task description. (Show Details)Oct 25 2021, 3:54 PM

Gehel removed the point value for this task.

TJones set the point value for this task to 5.Oct 25 2021, 4:03 PM

TJones removed the point value for this task.

EBernhardson updated the task description. (Show Details)Feb 4 2022, 6:14 PM

• dcausse claimed this task.Feb 9 2022, 8:22 AM

• dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Gehel moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Feb 22 2022, 8:19 PM

Pushed https://gitlab.wikimedia.org/repos/search-platform/jvmquake/-/merge_requests/1 (up for review) to have a debian package that we could install on production machines.

Gehel moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Mar 7 2022, 4:26 PM

Mentioned in SAL (#wikimedia-operations) [2022-03-08T16:02:10Z] <inflatador> bking@deneb manually installed openjdk-11-jdk for T293862 . moritzm will add puppet patch for this

Mentioned in SAL (#wikimedia-operations) [2022-03-08T16:53:57Z] <inflatador> bking@deneb manually installed tox for T293862 . moritzm will add puppet patch for this

Small update:
Moritz built deb pkg
Manually installed on wdqs1010
Puppet patch for fleet-wide installation is planned.

Change 770978 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] [wdqs] add jvmquake options to wdqs1010 for testing

https://gerrit.wikimedia.org/r/770978

gerritbot added a project: Patch-For-Review.Mar 15 2022, 5:28 PM

Mentioned in SAL (#wikimedia-operations) [2022-03-16T09:36:05Z] <dcausse> T293862: manually restarted blazegraph on wdqs1010 with "-agentpath:/usr/lib/libjvmquake.so=1000,1,0,warn=30,touch=/tmp/jvmquake"

• dcausse moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Mar 21 2022, 3:10 PM

Change 770978 merged by Bking:

[operations/puppet@production] [wdqs] test jvmquake options on the public cluster

https://gerrit.wikimedia.org/r/770978

Maintenance_bot removed a project: Patch-For-Review.Mar 24 2022, 6:10 PM

Mentioned in SAL (#wikimedia-operations) [2022-03-24T21:11:20Z] <inflatador> bking@cumin1001 restarting blazegraph on wdqs[1003-1013].eqiad.wmnet for T293862

Change 773758 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] team-search-platform: add jvmquake alerting

https://gerrit.wikimedia.org/r/773758

gerritbot added a project: Patch-For-Review.Mar 25 2022, 11:08 AM

Change 773758 merged by jenkins-bot:

[operations/alerts@master] team-search-platform: add jvmquake alerting

https://gerrit.wikimedia.org/r/773758

Change 775254 had a related patch set uploaded (by Ryan Kemper; author: DCausse):

[operations/puppet@production] wdqs: tune jvmquake settings

https://gerrit.wikimedia.org/r/775254

Change 775254 merged by Ryan Kemper:

[operations/puppet@production] wdqs: tune jvmquake settings

https://gerrit.wikimedia.org/r/775254

With the settings we properly detected wdqs1006 going down for 30minutes at 2022-04-01T12:30:00 (this 2minutes after the first blip in the graph).
Unfortunately there was a false positive wdqs1012 at 2022-04-01T10:00:00 as this machine was unavailable from 2 minutes only.
Unsure if it's still too sensitive or if we can accept having a couple false positives.

Actually wdqs2007, wdqs2004 and wdqs2003 also triggered jvmquake, GC activity increased and wdqs2007 & wdqs2003 were unresponsive for a couple minutes. For wdqs2004 there are no visible blips in the various graph. I guess we should relax the settings a bit more.

Change 776857 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: tune jvmquake settings (take 2)

https://gerrit.wikimedia.org/r/776857

Change 776857 merged by Bking:

[operations/puppet@production] wdqs: tune jvmquake settings (take 2)

https://gerrit.wikimedia.org/r/776857

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:31:13Z] <ryankemper> [WDQS] T293862 Need to do a rolling restart of wdqs public; going to just roll a full deploy since it's equal work

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:44:11Z] <ryankemper> T293862 Rolling restart of wdqs public is complete; new jvmquake settings have been uptaken on wdqs public hosts: -agentpath:/usr/lib/libjvmquake.so=1000,5,0,warn=60,touch=/tmp/wdqs_blazegraph_jvmquake_warn_gc

To check for presence of touched file:

ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' '[ -f "/tmp/wdqs_blazegraph_jvmquake_warn_gc" ] && echo yes || echo no'
11 hosts will be targeted:
wdqs[2001-2004,2007].codfw.wmnet,wdqs[1004-1007,1012-1013].eqiad.wmnet
Ok to proceed on 11 hosts? Enter the number of affected hosts to confirm or "q" to quit 11
===== NODE GROUP =====
(5) wdqs[2001-2002,2007].codfw.wmnet,wdqs[1004-1005].eqiad.wmnet
----- OUTPUT of '[ -f "/tmp/wdqs_...o yes || echo no' -----
no
===== NODE GROUP =====
(6) wdqs[2003-2004].codfw.wmnet,wdqs[1006-1007,1012-1013].eqiad.wmnet
----- OUTPUT of '[ -f "/tmp/wdqs_...o yes || echo no' -----
yes
================
PASS |███████████████████████████████████████████████████████| 100% (11/11) [00:00<00:00, 12.38hosts/s]
FAIL |                                                                |   0% (0/11) [00:00<?, ?hosts/s]
100.0% (11/11) success ratio (>= 100.0% threshold) for command: '[ -f "/tmp/wdqs_...o yes || echo no'.
100.0% (11/11) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

To clear away current touched files:

ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"
11 hosts will be targeted:
wdqs[2001-2004,2007].codfw.wmnet,wdqs[1004-1007,1012-1013].eqiad.wmnet
Ok to proceed on 11 hosts? Enter the number of affected hosts to confirm or "q" to quit 11
===== NODE GROUP =====
(6) wdqs[2003-2004].codfw.wmnet,wdqs[1006-1007,1012-1013].eqiad.wmnet
----- OUTPUT of 'rm -fv '/tmp/wdq...vmquake_warn_gc'' -----
removed '/tmp/wdqs_blazegraph_jvmquake_warn_gc'
================
PASS |███████████████████████████████████████████████████████| 100% (11/11) [00:00<00:00, 12.55hosts/s]
FAIL |                                                                |   0% (0/11) [00:00<?, ?hosts/s]
100.0% (11/11) success ratio (>= 100.0% threshold) for command: 'rm -fv '/tmp/wdq...vmquake_warn_gc''.
100.0% (11/11) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:50:24Z] <ryankemper> T293862 Removed touched files so that it'll be easier to see when the new jvmquake threshold is crossed: ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"