Page MenuHomePhabricator

Improve monitoring of Gerrit's memory usage
Closed, DuplicatePublic

Description

Follow-up from T270451: Gerrit is OOMing and all the other times Gerrit (or gerrit-replica) OOMs and requires a restart. Users report getting 502 errors when trying to clone/pull and it becomes very sluggish if it is able to respond. Ideally monitoring would detect when it's close to OOMing, so it can be restarted before users actually experience errors

Notes from IRC:

 16:30:05 <+thcipriani> legoktm: we have a monitoring of https://gerrit.wikimedia.org/r/plugins/healthcheck/Documentation/index.html but we start to hit parallel stop-the-world GC events degrading performance before that explodes
...
 16:34:07 <+thcipriani> this would probably be the metric to add alerting on: https://grafana.wikimedia.org/d/Bw2mQ3iWz/javamelody?viewPanel=14&orgId=1
 16:35:19 <+thcipriani> that's typically in the 10s of milliseconds, once it's above 100ms that typically is enough impact for people to notice slowness

Related Objects