Page MenuHomePhabricator

Add better monitoring for Analytics UIs
Open, HighPublic

Description

We need to have better monitoring for Superset and Turnilo:

monitoring::service { 'superset':
    description   => 'superset',
    check_command => "check_tcp!${::superset::port}",
    require       => Class['::superset'],
    notes_url     => 'https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset',
}

    monitoring::service { 'turnilo':
        description   => 'turnilo',
        check_command => "check_tcp!${port}",
        contact_group => $contact_group,
        notes_url     => 'https://wikitech.wikimedia.org/wiki/Analytics/Systems/Turnilo-Pivot',
    }

These are effective only if the daemon is down, since the port is not available anymore, but not if the daemon is up but responding with errors (like happened this morning).

Superset is tricky due to the auth scheme, maybe we could have a test user and use something like https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Test_as_different_user_on_staging ? Turnilo should be easy.

Event Timeline

elukey triaged this task as High priority.Thu, Mar 18, 6:53 AM
elukey created this task.

Change 673556 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] turnilo: add monitoring for http

https://gerrit.wikimedia.org/r/673556

@elukey I made a patch for turnilo, should be straightforward as you said.

For superset, if monitoring check come from the host itself, we should be able to use the user header and can either create a new user, or just use "admin".

Here's a curl command that checks if superset is running:

curl -L 'http://localhost:9080/login/' -H 'X-Remote-User: admin' -c cookiejar-$RANDOM

For Nagios to implement this, we'll have to ensure the following:

  • redirects are enabled (-L in curl) since /login will set a session cookie and redirect to /superset/welcome/
  • cookies are preserved between redirects (-c cookiejar-$RANDOM above, the $RANDOM is there to get a fresh cookie jar each time)

and of course we'll have to pass the X-Remote-User header. Apparently X-Forwarded-Proto is not needed here.

Alright, nagios doesn't support it directly, but we can use a shell script, like modules/profile/files/eventstreams/check_eventstreams.sh

Change 678044 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: check http server following redirects with curl

https://gerrit.wikimedia.org/r/678044

Change 678044 merged by Razzi:

[operations/puppet@production] superset: check http server following redirects with curl

https://gerrit.wikimedia.org/r/678044

Seeing this error when running puppet on an-tool1010:

Error: /Stage[main]/Profile::Superset/File[/usr/local/bin/check_superset_http]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/superset/check_superset_http.sh
Notice: /Stage[main]/Profile::Superset/Nrpe::Monitor_service[check_superset_http]/Nrpe::Check[check_check_superset_http]/File[/etc/nagios/nrpe.d/check_check_superset_http.cfg]: Dependency File[/usr/local/bin/check_superset_http] has failures: true
Warning: /Stage[main]/Profile::Superset/Nrpe::Monitor_service[check_superset_http]/Nrpe::Check[check_check_superset_http]/File[/etc/nagios/nrpe.d/check_check_superset_http.cfg]: Skipping because of failed dependencies
Warning: /Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]: Skipping because of failed dependencies

Change 678109 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: put puppet:// resource in files/

https://gerrit.wikimedia.org/r/678109

Change 678109 merged by Razzi:

[operations/puppet@production] superset: put puppet:// resource in files/

https://gerrit.wikimedia.org/r/678109

Fixed the puppet:/// resource error, now getting curl code 47, too many redirects

Ok, interesting, the error is happening only on an-tool1010, on an-tool1005 it works fine. I'll roll the check back for now while I look into a command that works on both machines.

Change 678113 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: comment out check that isn't working as intended

https://gerrit.wikimedia.org/r/678113

Change 678113 merged by Razzi:

[operations/puppet@production] superset: comment out check that isn't working as intended

https://gerrit.wikimedia.org/r/678113

Change 678130 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: ensure http check absent until it is working

https://gerrit.wikimedia.org/r/678130

Change 678130 merged by Razzi:

[operations/puppet@production] superset: ensure http check absent until it is working

https://gerrit.wikimedia.org/r/678130

Change 678966 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] superset: use different user headers for staging and production

https://gerrit.wikimedia.org/r/678966

Change 678966 merged by Razzi:

[operations/puppet@production] superset: use different user headers for staging and production

https://gerrit.wikimedia.org/r/678966

I implemented a check that works on both staging and production, using the appropriate header for production (x-cas-uid rather than x-remote-user). I re-enabled alarms and everything looks good here.