Page MenuHomePhabricator

Get health check warnings when a web node goes down from the load balancer's perspective
Closed, ResolvedPublic

Description

According to the nginx documentation, it will mark an upstream node as failed passively as requests come in. After the fail_timeout period, it will start trying to send requests again. For our purposes, this means that unless both web nodes fail (and the request doesn't hit the cache), we won't receive any alerts from icinga.

Useful links:
nginx load balancing: http://nginx.org/en/docs/http/load_balancing.html
nginx lb config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/templates/ores/lb.nginx.erb
puppet lb config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/labs/ores/lb.pp
icinga ores config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/icinga/manifests/monitor/ores.pp
icinga ores worker check: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/nagios_common/files/check_commands/check_ores_workers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here's what I've found:

Icinga is currently performing two http checks and one host-level check.

The http checks are hitting ores.wmflabs.org and http://oresweb/scores/testwiki/reverted/${timestamp}/

Nginx uses a round-robin strategy by default, and upon failing a host, will try to bring it back gracefully after the fail-timeout setting.

If we want to know when nginx fails a host, then we'll have to monitor /var/log/nginx/error.log. The alternative is to have icinga make http requests directly against the web nodes (and probably keeping the lb request as well).

From within labs, you can ping a web node directly. E.g. http://ores-web-03:8080

This can only work from within labs.

After speaking with @akosiaris, since the web nodes have private IP addresses in the labs environment, they can't be monitored directly by icinga.

He suggested monitoring the nginx status page. There are also modules for nginx like this that explicitly monitor upstream hosts.

@Halfak would having a path like /nginx_status present any problems to the other functionality? Would you want restrictions on the visibility of such a path?

+1 for something like /nginx_status

The stub status page only provides info like the following:

Active connections: 12
server accepts handled requests
155124 155124 222151
Reading: 0 Writing: 9 Waiting: 4

This isn't enough information to determine the health of the upstream nodes, and including additional modules specific to this task would require rolling a custom-compiled nginx (a bad idea).

@Halfak do you have any more specific information about where/when there's missing info? We could look into the logs, see what's happening, and modify the system to provide visibility into the scenario.

do you have any more specific information about where/when there's missing info?

Sorry, I'm confused. The scenario is that a web node goes down. The reason shouldn't matter.

Sorry, I'm confused. The scenario is that a web node goes down. The reason shouldn't matter.

As it's configured, Icinga doesn't seem to be a viable option for monitoring the health of individual web nodes. How are internal-to-labs services currently being monitored?

How are internal-to-labs services currently being monitored?

@yuvipanda or @Dzahn might be able to answer this question.

Depending on what you mean by 'internal-to-labs', the answer might be 'lol, we do not' to 'hope' to 'this graphite+shinken based complex thingy that is complicated'...

We can monitor something from prod Icinga if it has a public IP and we consider it a special "semi-prod" service or so. Ores is being monitored here:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ores.wmflabs.org

Depending on what you mean by 'internal-to-labs', the answer might be 'lol, we do not' to 'hope' to 'this graphite+shinken based complex thingy that is complicated'...

We want to know when an individual web node goes down from the perspective of the load balancer.

Currently, there is no clear way to do that.

@schana Any thoughts on what direction to take this?

@Halfak, to clarify, this task is referencing the labs instance, correct? How has the infrastructure changed for the production deployment?

Yes. This references the labs install.

Re. production, I don't know. @akosiaris?

lb rebalance does

Yes. This references the labs install.

Re. production, I don't know. @akosiaris?

pybal takes care of marking problematic nodes as failed and not routing requests to them. pybal constantly issues health monitoring requests and decides to pool/depool hosts accordingly. See https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/lvs/configuration.yaml;b78d1bcde9ef96c9511ee672bf3e93f8ea15db4f$946 for the config stanza. Alterting on LVS will only happen if no server is able to service the request. However per server alerting happens normally. I doubt any of this however is reusable in labs.

schana renamed this task from [spike] Find out if we can still get health check warnings after lb rebalance to [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective.Jun 23 2016, 8:32 AM
schana updated the task description. (Show Details)

Here's a plan that I think would solve the problem as stated:

  1. Add paths to the load balancer that allow specific web nodes to be hit directly:
  2. Add a path that returns the current list of web nodes
  3. Modify check_ores_workers to fetch the list of web nodes, query each, and return something based on matching an expected string

Problems I have:

  1. check_ores_workers is currently named poorly and this use case doesn't help that
  2. I'm not sure how icinga would handle the resulting string mismatch (would it show up somewhere if "okay" != "ores-web-03:8080 down"?)
  3. check_ores_workers isn't checking based on the hiera config; it's performing an addition query that shouldn't be necessary

@Halfak, thoughts?

@schana check_ores_workers is a wrapper around the standard nagios plugin check_http, the line is currently check_http -f follow -H $host -I $host -u "http://oresweb/scores/testwiki/reverted/${timestamp}/"

We can let check_http check for a specific string on the page but we don't currently do that. So it just checks if the connection works and the OK is just the output of the check itself. We have these options:

check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
       [-w <warn time>] [-c <critical time>] [-t <timeout>] [-L] [-a auth]
       [-b proxy_auth] [-f <ok|warning|critcal|follow|sticky|stickyport>]
       [-e <expect>] [-s string] [-l] [-r <regex> | -R <case-insensitive regex>]
       [-P string] [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>]
       [-A string] [-k string] [-S] [--sni] [-C <age>] [-T <content-type>]
       [-j method]

Change 296535 had a related patch set uploaded (by Nschaaf):
Check all ores web nodes

https://gerrit.wikimedia.org/r/296535

I've created a patch, but am unsure how to properly implement the following in an approved-by-ops way (that is, defining multiple things based off an array):

$realservers = hiera('role::labs::ores::lb::realservers')

define monitor_ores_labs_web_node ($realserver = $title) {
    $server_parts = split($realserver, ':')
    $server = $server_parts[0]
    monitoring::service { "ores_web_node_labs_${server}":
        description   => "ORES web node labs ${server}",
        check_command => "check_ores_workers!oresweb/${server}",
        host          => 'ores.wmflabs.org',
        contact_group => 'team-ores',
    }
}

monitor_ores_labs_web_node { $realservers: }

@schana I have amended the patch to fix the jenkins-bot downvote and in response to Yuvi's comments. That part is fixed now, but it can't find the server names in hiera yet. (at least when i run it in puppet compiler). Is that maybe just in labs hiera but needs to be added to production ./hieradata/ ?

Is that maybe just in labs hiera but needs to be added to production ./hieradata/ ?

I'm unfamiliar with how the hiera configuration works; all I know is I can view it here: https://wikitech.wikimedia.org/wiki/Hiera:Ores

@schana Yea, so the Hiera: namespace on wikitech is (one of the 2) places where you can add Hiera data in Labs, but for it to be in production, it has to be added in ./hieradata/ in operations/puppet along with your change.

Change 296535 merged by Dzahn:
icinga: check all ores web nodes

https://gerrit.wikimedia.org/r/296535

so the full command we are actually running here, after unwrapping it, should be like:

/usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/node/ores-web-03/scores/testwiki/reverted/1234"

from:

$pluginpath/check_http -f follow -H $host -I $host -u "http://${urlhost}/scores/testwiki/reverted/${timestamp}/"

and that gets us the 404

I think nginx needs "^~" before the node location to properly match.

@schana bingo!

root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/node/ores-web-03"
HTTP WARNING: HTTP/1.1 404 NOT FOUND - 414 bytes in 0.056 second response time |time=0.056263s;;;0.000000 size=414B;;;0
root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/~node/ores-web-03"
HTTP WARNING: HTTP/1.1 404 NOT FOUND - 414 bytes in 0.029 second response time |time=0.029213s;;;0.000000 size=414B;;;0
root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/^~node/ores-web-03"
HTTP OK: HTTP/1.1 200 OK - 2950 bytes in 0.044 second response time |time=0.043526s;;;0.000000 size=2950B;;;0

Change 297115 had a related patch set uploaded (by Dzahn):
ores: fix-up web node monitoring

https://gerrit.wikimedia.org/r/297115

Change 297115 merged by Dzahn:
ores: fix-up web node monitoring

https://gerrit.wikimedia.org/r/297115

@schana Thanks, you were exactly right about that extra ^~. It works now. Let me know if i closed this ticket too early and there was more involved.

@Dzahn, I was meaning for that to go in the nginx config location block to identify what type of matching nginx should perform on the path.

location ^~ /node/blah {
  ...
}

To expand a bit further, the following should proxy a request like http://domain/foo/some/path to http://otherdomain/some/path

location ^~ /foo/ {
  proxy_pass http://otherdomain/;
}
Halfak renamed this task from [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective to Get health check warnings when a web node goes down from the load balancer's perspective.Jul 5 2016, 2:45 PM

We might want to simulate some downtime to make sure that we get pings. Is there a better way?

We should bring back ores-web-04, repool, bring down -05 and wait for error notification.

Currently it looks like it just works when you ask ores.wmflabs.org for an URL like "http://oresweb/^~node/ores-web-03".

I don't know if it should or not, i can just confirm that it is only a 200 if you ask for "http://oresweb/%5E~node/ores-web-03%22" like the icinga check does

Change 297599 had a related patch set uploaded (by Nschaaf):
Change path to proxy node requests

https://gerrit.wikimedia.org/r/297599

Change 297599 merged by Dzahn:
Change path to proxy node requests

https://gerrit.wikimedia.org/r/297599

Yes, and the checks also look good in Icinga. If you want to we can do that test and break something to confirm it triggers the check.

Fun story, we actually had a minor amount of downtime on ores-web-03 during a puppet run today and it reported the issues and recovery just fine. So I think we can declare victory here.

Alternatively, we have ores-web-04 depooled right now. We could repool it, shut down -03 and let the warning go off.

Confirmed, i have seen that minor outage too.

And we have history here to proof it.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-03

So i think we can call it resolved indeed.

@schana do you agree it's resolved?

Yes, until there's a better solution (be it healthcheck-specific URLs or whatever).

Ladsgroup subscribed.

Sorry for re-opening, #revision-scoring-as-a-service tasks stays open until end of the weekly meeting and then close them altogether (after the weekly update) Sorry for the confusion.