Page MenuHomePhabricator

gerrit: mod_qos allowlist and monitoring improvements
Closed, ResolvedPublic

Description

In T406005: train-presync failed due to git clone failing with gnutls_handshake() failure we saw 2 issues :

  • some servers are not properly identified as VIP by httpd.
  • mtail configuration for mod_qos is missing some QOS events

Event Timeline

ABran-WMF triaged this task as High priority.
ABran-WMF added a subscriber: LSobanski.

Change #1192831 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: fix allowlist for mod_qos

https://gerrit.wikimedia.org/r/1192831

mtail was not the source of the observability issue. I've fixed the promql query that renders the QoS event rates using this uptick as reference from the logs.

Change #1192831 merged by Arnaudb:

[operations/puppet@production] gerrit: fix allowlist for mod_qos

https://gerrit.wikimedia.org/r/1192831

Change #1192839 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] Revert^2 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192839

Change #1192839 merged by Arnaudb:

[operations/puppet@production] Revert^2 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192839

Change #1192845 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] Revert^4 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192845

Change #1192845 merged by Arnaudb:

[operations/puppet@production] Revert^4 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192845

Change #1192854 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] Revert^6 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192854

Change #1192854 merged by Arnaudb:

[operations/puppet@production] Revert^6 "gerrit: fix allowlist for mod_qos"

https://gerrit.wikimedia.org/r/1192854

Change #1192882 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: toggle mod_qos log only

https://gerrit.wikimedia.org/r/1192882

Change #1192882 merged by Arnaudb:

[operations/puppet@production] gerrit: toggle mod_qos log only

https://gerrit.wikimedia.org/r/1192882

We had an outage yesterday related to the QoS limit, it was only reported on IRC/Slack. For the record:

16:25:38 <bearloga> Is anyone else experiencing Gerrit being so weird and not always loading today?
17:54 <aude> is it just me or is gerrit slow today and intermittently not loading? (and to lesser extent maybe phabricator too)
19:08:48 <James_F> Also specifically I'm getting "Plugin install error: https://gerrit.wikimedia.org/r/plugins/wm-motd/static/wm-motd.js load error from https://gerrit.wikimedia.org/r/plugins/wm-motd/static/wm-motd.js " errors.
19:42:33 <bearloga> My experience has been a mix of: Gerrit not loading at all, Gerrit loading after a while, Gerrit loading but blank and errors about a bunch of plugins

The reason is the number of allowed concurrent connections was reduced from 25 to 20 at 9:00 UTC with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1193013
The limit has been raised back at 19:45 UTC with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1193212