Page MenuHomePhabricator

improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit)
Closed, ResolvedPublic

Description

We should investigate why icinga did not report high cpu/load including gerrit.wikimedia.org loading very slow.


We should add better monitoring that would have caught this.

This can be a new check_http check and/or installing the healthcheck plugin for gerrit.

Event Timeline

Paladox created this task.Jan 31 2019, 3:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2019, 3:54 PM
Paladox triaged this task as High priority.Jan 31 2019, 3:54 PM
Paladox renamed this task from Investigate why icinga did not report high cpu/load to Investigate why icinga did not report high cpu/load for gerrit.Jan 31 2019, 3:56 PM

Please include context. Is this related to T215004?

Yup, it's related to that.

Dzahn added a comment.Feb 1 2019, 7:29 AM

The check_load plugin can be used for that. We do use it but only on other servers, API appservers, SWIFT and a passive check for Fundraising servers.

nrpe::monitor_service { 'cpu_load':
    description  => 'High CPU load on API appserver',
    nrpe_command => "/usr/lib/nagios/plugins/check_load -w ${warning} -c ${critical}",
    retries      => 1,
}
nrpe::monitor_service { 'load_average':
     description  => 'very high load average likely xfs',
     nrpe_command => '/usr/lib/nagios/plugins/check_load -w 80,80,80 -c 200,100,100',
 }

Thank you @Dzahn i guess we should do both?

Dzahn added a comment.Feb 3 2019, 12:40 AM

@Paladox I think it's just different descriptions for the same thing when looking at the nrpe_command line only. So just one new check.

hashar added a subscriber: hashar.Feb 4 2019, 12:07 PM

Icinga does not monitor Gerrit CPU usage / system load. We would need to add the check_load plugin mentioned above by @Dzahn.

CDanis added a subscriber: CDanis.Feb 4 2019, 5:57 PM

If the thing we want to monitor is "Gerrit is responding slowly / not at all", IMO that is the thing we should check. High CPU load is just one of the causes that could result in high latency.

Would it be difficult to use check_http, which already supports WARNING/CRITICAL on specific response durations, pointed at a few key Gerrit URLs? Maybe just the front page URL and the JSON API call for "recent changes"?

Dzahn added a comment.Feb 4 2019, 6:08 PM

Would it be difficult to use check_http, .. pointed at a few key Gerrit URLs?

We already check if HTTPS works as it's included in the cert expiry check (check_ssl). Afair we used check_http before that and replaced it.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cobalt&service=HTTPS

check_command => "check_ssl_on_host_port_letsencrypt!${tls_host}!${tls_host}!443",
define command{
    command_name    check_ssl_on_host_port
    command_line    $USER1$/check_ssl --warning 60 --critical 30 --cn $ARG1$ -H $ARG2$ -p $ARG3$
}

This does not check for specific content patterns though.

We also check if the Gerrit process is running already at:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cobalt&service=gerrit+process

Do you feel we should still add some check_http in addition and check for content patterns and return code? We could, yea.

Change 487901 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: add icinga https check for dashboard content

https://gerrit.wikimedia.org/r/487901

CDanis added a comment.Feb 4 2019, 6:42 PM

The timeouts on that check_ssl invocation make little sense to me -- a warning after 60 seconds, but critical after 30? Those seem backwards, and also too high by something like a factor of 4.

Do we know how slowly Gerrit was loading during the incident?

I think checking for content patterns makes sense -- really, even just a trivial one on the length of the response returned by this URL:
https://gerrit.wikimedia.org/r/changes/?n=25&O=81
(and also ensuring it was a 200 response, although I think check_ssl does this?)

If you're curious for some background on why symptom-based alerting is one of my personal hobbyhorses, you could read https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit# from a former coworker of mine.

Dzahn added a comment.EditedFeb 4 2019, 6:50 PM

a warning after 60 seconds, but critical after 30? Those seem backwards

Agreed, i bet it used to be once 60 and 90 and then somebody lowered 90 to 30 without adjusting the other.

Do we know how slowly Gerrit was loading during the incident?

@Paladox do you know?

I think checking for content patterns makes sense

I made one here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/487901

That uses our existing check command check_https_url_at_address_for_string which translates to:

check_http -H $ARG1$ -I $ARG1$ -S -u $ARG2$ -s $ARG3$

So on icinga1001 that is:

@icinga1001:/etc/icinga# /usr/lib/nagios/plugins/check_http -H gerrit.wikimedia.org -I gerrit.wikimedia.org -S -u /r/ -s "Wikimedia Code Review"

HTTP OK: HTTP/1.1 200 OK - 13953 bytes in 0.013 second response time |time=0.012935s;;;0.000000;10.000000 size=13953B;;;0

(and also ensuring it was a 200 response, although I think check_ssl does this?)

Not sure it does, but yes, that is included in the new one above.

If you're curious for some background on why symptom-based alerting is one of my personal hobbyhorses, you could read https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit# from a former coworker of mine.

Thanks

Im thinking we need the health check plugin.

That way we just need to check the http status code.

Dzahn added a comment.EditedFeb 4 2019, 7:03 PM

The timeouts on that check_ssl invocation make little sense to me

Actually i quoted the wrong check command. It uses check_ssl_on_host_port_letsencrypt and not check_ssl_on_host_port which means the actual command is:

check_ssl --warning 7 --critical 3 --cn $ARG1$ -H $ARG2$ -p $ARG3$

So as you say "too high by something like a factor of 4" and that is already the case and no specific fix needed for Gerrit.

Only fix needed if other services actually use the "check_ssl_on_host_port" command and if none uses it to delete that command.

Dzahn added a comment.Feb 4 2019, 7:13 PM

P.S. The --warning and --critical values are not backwards because they are "days until expiry".

Dzahn renamed this task from Investigate why icinga did not report high cpu/load for gerrit to improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).Feb 4 2019, 7:18 PM
Dzahn updated the task description. (Show Details)

Technically "investigate why it did not alert" has been resolved. But of course we also mean to do something about it for next time. Ticket renamed /edited accordingly.

So we just need a http check that checks the website (without checking if the ssl cert is valid or not as we have that check already) and also a load check to make sure the load doesn't go too high otherwise it could cause noticeable impact.

CDanis added a comment.Feb 4 2019, 7:57 PM

So we just need a http check that checks the website (without checking if the ssl cert is valid or not as we have that check already) and also a load check to make sure the load doesn't go too high otherwise it could cause noticeable impact.

Restating what I was trying to communicate earlier:

checking that the server responds with something that looks 'good' with reasonable latency should be sufficient.

That fully encapsulates what we actually care to be alerted about. A check on system load is neither sufficient nor necessary.

Dzahn added a comment.Feb 4 2019, 8:10 PM

Indeed, i would say merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/487901 should resolve this and as a stretch goal we can checkout the healthcheck plugin you mentioned. Maybe in a separate task.

Dzahn added a comment.EditedFeb 4 2019, 8:11 PM

healthcheck plugin you mentioned. Maybe in a separate task.

I see that already exists :) --> T214326

The extra checks we get from that are listed at T214326#4897801.

Already have T214326 for the health check plugin :)

Change 487901 merged by Dzahn:
[operations/puppet@production] gerrit: add icinga https check for actual JSON content

https://gerrit.wikimedia.org/r/487901

Change 488232 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/gerrit: move gerrit monitoring to icinga module

https://gerrit.wikimedia.org/r/488232

Change 488232 merged by Dzahn:
[operations/puppet@production] icinga/gerrit: move gerrit monitoring to icinga module

https://gerrit.wikimedia.org/r/488232

Dzahn added a comment.Feb 6 2019, 4:41 AM

new virtual "service" host: (so also ping check for gerrit.wikimedia.org as opposed to cobalt or gerrit2001):

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=gerrit.wikimedia.org

new service on that host but Icinga can't find the new check command yet (why? i can confirm it's there in commands.cfg and that is included in the main config)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit.wikimedia.org&service=Gerrit+JSON&scroll=110

Is the script on cobalt?

jijiki moved this task from Backlog to Incoming on the serviceops board.Feb 6 2019, 12:08 PM

Change 488636 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/gerrit: add double quotes around URL part in check command

https://gerrit.wikimedia.org/r/488636

Change 488636 merged by Dzahn:
[operations/puppet@production] icinga/gerrit: add double quotes around URL part in check command

https://gerrit.wikimedia.org/r/488636

Dzahn added a comment.Feb 7 2019, 12:54 AM

The new check "Gerrit JSON" works now:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit.wikimedia.org&service=Gerrit+JSON

It checks if virtual host "gerrit.wikimedia.org" (not the individual servers in each DC) actually returns some JSON (checks for a minimum size), once, from the icinga prod host.

Dzahn closed this task as Resolved.Feb 7 2019, 12:56 AM
Dzahn claimed this task.