Page MenuHomePhabricator

Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead
Closed, ResolvedPublic

Description

As as a maintainer of a service running on top of the JVM I want the JVM to rapidly quit if it enters a gc death spiral so that the service increase its availability.

The default heuristics used by the JVM to kill itself (-XX:+ExitOnOutOfMemoryError) are too conservative to make them useful for real production use cases.
jvmquake seems to circumvent these problems by allowing more flexible heuristics to detect when the JVM will be stuck in a death spiral, see article: https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70.

This approach might be useful for several services:

  • blazegraph sometimes stuck in a death spiral certainly triggered by a bad query
  • cloudelastic sometimes misbehaving because of the GC
  • more directly identifying old gc hell on search clusters

AC:

  • debian package exists for jvmquake
  • jvmquake is deployed on Blazegraph with puppet
  • jvmquake is configured in reporting only mode

Event Timeline

Gehel removed the point value for this task.
TJones set the point value for this task to 5.Oct 25 2021, 4:03 PM
TJones removed the point value for this task.

Pushed https://gitlab.wikimedia.org/repos/search-platform/jvmquake/-/merge_requests/1 (up for review) to have a debian package that we could install on production machines.

Mentioned in SAL (#wikimedia-operations) [2022-03-08T16:02:10Z] <inflatador> bking@deneb manually installed openjdk-11-jdk for T293862 . moritzm will add puppet patch for this

Mentioned in SAL (#wikimedia-operations) [2022-03-08T16:53:57Z] <inflatador> bking@deneb manually installed tox for T293862 . moritzm will add puppet patch for this

Small update:
Moritz built deb pkg
Manually installed on wdqs1010
Puppet patch for fleet-wide installation is planned.

Change 770978 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] [wdqs] add jvmquake options to wdqs1010 for testing

https://gerrit.wikimedia.org/r/770978

Mentioned in SAL (#wikimedia-operations) [2022-03-16T09:36:05Z] <dcausse> T293862: manually restarted blazegraph on wdqs1010 with "-agentpath:/usr/lib/libjvmquake.so=1000,1,0,warn=30,touch=/tmp/jvmquake"

Change 770978 merged by Bking:

[operations/puppet@production] [wdqs] test jvmquake options on the public cluster

https://gerrit.wikimedia.org/r/770978

Mentioned in SAL (#wikimedia-operations) [2022-03-24T21:11:20Z] <inflatador> bking@cumin1001 restarting blazegraph on wdqs[1003-1013].eqiad.wmnet for T293862

Change 773758 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] team-search-platform: add jvmquake alerting

https://gerrit.wikimedia.org/r/773758

Change 773758 merged by jenkins-bot:

[operations/alerts@master] team-search-platform: add jvmquake alerting

https://gerrit.wikimedia.org/r/773758

Change 775254 had a related patch set uploaded (by Ryan Kemper; author: DCausse):

[operations/puppet@production] wdqs: tune jvmquake settings

https://gerrit.wikimedia.org/r/775254

Change 775254 merged by Ryan Kemper:

[operations/puppet@production] wdqs: tune jvmquake settings

https://gerrit.wikimedia.org/r/775254

With the settings we properly detected wdqs1006 going down for 30minutes at 2022-04-01T12:30:00 (this 2minutes after the first blip in the graph).
Unfortunately there was a false positive wdqs1012 at 2022-04-01T10:00:00 as this machine was unavailable from 2 minutes only.
Unsure if it's still too sensitive or if we can accept having a couple false positives.

Actually wdqs2007, wdqs2004 and wdqs2003 also triggered jvmquake, GC activity increased and wdqs2007 & wdqs2003 were unresponsive for a couple minutes. For wdqs2004 there are no visible blips in the various graph. I guess we should relax the settings a bit more.

Change 776857 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: tune jvmquake settings (take 2)

https://gerrit.wikimedia.org/r/776857

Change 776857 merged by Bking:

[operations/puppet@production] wdqs: tune jvmquake settings (take 2)

https://gerrit.wikimedia.org/r/776857

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:31:13Z] <ryankemper> [WDQS] T293862 Need to do a rolling restart of wdqs public; going to just roll a full deploy since it's equal work

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:44:11Z] <ryankemper> T293862 Rolling restart of wdqs public is complete; new jvmquake settings have been uptaken on wdqs public hosts: -agentpath:/usr/lib/libjvmquake.so=1000,5,0,warn=60,touch=/tmp/wdqs_blazegraph_jvmquake_warn_gc

To check for presence of touched file:

ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' '[ -f "/tmp/wdqs_blazegraph_jvmquake_warn_gc" ] && echo yes || echo no'
11 hosts will be targeted:
wdqs[2001-2004,2007].codfw.wmnet,wdqs[1004-1007,1012-1013].eqiad.wmnet
Ok to proceed on 11 hosts? Enter the number of affected hosts to confirm or "q" to quit 11
===== NODE GROUP =====
(5) wdqs[2001-2002,2007].codfw.wmnet,wdqs[1004-1005].eqiad.wmnet
----- OUTPUT of '[ -f "/tmp/wdqs_...o yes || echo no' -----
no
===== NODE GROUP =====
(6) wdqs[2003-2004].codfw.wmnet,wdqs[1006-1007,1012-1013].eqiad.wmnet
----- OUTPUT of '[ -f "/tmp/wdqs_...o yes || echo no' -----
yes
================
PASS |███████████████████████████████████████████████████████| 100% (11/11) [00:00<00:00, 12.38hosts/s]
FAIL |                                                                |   0% (0/11) [00:00<?, ?hosts/s]
100.0% (11/11) success ratio (>= 100.0% threshold) for command: '[ -f "/tmp/wdqs_...o yes || echo no'.
100.0% (11/11) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

To clear away current touched files:

ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"
11 hosts will be targeted:
wdqs[2001-2004,2007].codfw.wmnet,wdqs[1004-1007,1012-1013].eqiad.wmnet
Ok to proceed on 11 hosts? Enter the number of affected hosts to confirm or "q" to quit 11
===== NODE GROUP =====
(6) wdqs[2003-2004].codfw.wmnet,wdqs[1006-1007,1012-1013].eqiad.wmnet
----- OUTPUT of 'rm -fv '/tmp/wdq...vmquake_warn_gc'' -----
removed '/tmp/wdqs_blazegraph_jvmquake_warn_gc'
================
PASS |███████████████████████████████████████████████████████| 100% (11/11) [00:00<00:00, 12.55hosts/s]
FAIL |                                                                |   0% (0/11) [00:00<?, ?hosts/s]
100.0% (11/11) success ratio (>= 100.0% threshold) for command: 'rm -fv '/tmp/wdq...vmquake_warn_gc''.
100.0% (11/11) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2022-04-07T17:50:24Z] <ryankemper> T293862 Removed touched files so that it'll be easier to see when the new jvmquake threshold is crossed: ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-public' "rm -fv '/tmp/wdqs_blazegraph_jvmquake_warn_gc'"

RKemper moved this task from Needs review to Waiting on the Discovery-Search (Current work) board.

Moving to Waiting while we see how the newest settings do

Oops, I just meant to move on workboard, not sure how I closed it as well

Change 779440 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: activate jvmquake at 300:5

https://gerrit.wikimedia.org/r/779440

Change 779831 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] team-search-platform: remove BlazegraphJvmQuakeWarnGC

https://gerrit.wikimedia.org/r/779831

Change 779440 merged by Bking:

[operations/puppet@production] wdqs: activate jvmquake at 300:5

https://gerrit.wikimedia.org/r/779440

Change 779831 merged by jenkins-bot:

[operations/alerts@master] team-search-platform: remove BlazegraphJvmQuakeWarnGC

https://gerrit.wikimedia.org/r/779831