Get health check warnings when a web node goes down from the load balancer's perspective
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	May 9 2016, 4:40 PM

Description

According to the nginx documentation, it will mark an upstream node as failed passively as requests come in. After the fail_timeout period, it will start trying to send requests again. For our purposes, this means that unless both web nodes fail (and the request doesn't hit the cache), we won't receive any alerts from icinga.

Useful links:
nginx load balancing: http://nginx.org/en/docs/http/load_balancing.html
nginx lb config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/templates/ores/lb.nginx.erb
puppet lb config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/labs/ores/lb.pp
icinga ores config: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/icinga/manifests/monitor/ores.pp
icinga ores worker check: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/nagios_common/files/check_commands/check_ores_workers

Details

Subject	Repo	Branch	Lines +/-
Change path to proxy node requests	operations/puppet	production	+3 -3
ores: fix-up web node monitoring	operations/puppet	production	+1 -1
icinga: check all ores web nodes	operations/puppet	production	+20 -0

Customize query in gerrit

Related Objects

Mentioned In: rOPUPf86af8305808: Change path to proxy node requests
rOPUPc27a46888670: Change path to proxy node requests
rOPUPe149abbdddd9: ores: fix-up web node monitoring
rOPUP98e4f3c2bd38: ores: fix-up web node monitoring
rOPUPe28d921880ff: ores: fix-up web node monitoring
rOPUP3cfb53ee246b: ores: fix-up web node monitoring
rOPUP395a8cc3dcd5: icinga: check all ores web nodes
rOPUP94d04bdd1004: icinga: check all ores web nodes
rOPUPa1ef08ce6cde: icinga: check all ores web nodes
rOPUP74e051b40843: icinga: check all ores web nodes
rOPUPa020465cade7: Check all ores web nodes
rOPUPff2addf3579a: Check all ores web nodes
T138380: Reading teams would like a tag to identify spikes
rOPUP9fffd8152321: Check all ores web nodes
rOPUPa65ecce5a501: Check all ores web nodes
rOPUPff26aec08540: Check all ores web nodes
rOPUP02fc80173de2: Check all ores web nodes
rOPUP6a3065ff1378: Check all ores web nodes
rOPUP054dd9f54f93: Check all ores web nodes
rOPUP34bbd992f758: Check all ores web nodes
rOPUP761a547bed33: Check all ores web nodes
rOPUPe7f8f81818dd: Check all ores web nodes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 9 2016, 4:40 PM

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.May 9 2016, 4:56 PM

• schana claimed this task.May 18 2016, 1:16 PM

• schana subscribed.

Here's what I've found:

Icinga is currently performing two http checks and one host-level check.

The http checks are hitting ores.wmflabs.org and http://oresweb/scores/testwiki/reverted/${timestamp}/

Nginx uses a round-robin strategy by default, and upon failing a host, will try to bring it back gracefully after the fail-timeout setting.

If we want to know when nginx fails a host, then we'll have to monitor /var/log/nginx/error.log. The alternative is to have icinga make http requests directly against the web nodes (and probably keeping the lb request as well).

• schana edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.May 20 2016, 11:16 AM

• schana moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.

From within labs, you can ping a web node directly. E.g. http://ores-web-03:8080

This can only work from within labs.

You can find the hosts of the web nodes in hiera https://wikitech.wikimedia.org/wiki/Hiera:Ores

After speaking with @akosiaris, since the web nodes have private IP addresses in the labs environment, they can't be monitored directly by icinga.

He suggested monitoring the nginx status page. There are also modules for nginx like this that explicitly monitor upstream hosts.

@Halfak would having a path like /nginx_status present any problems to the other functionality? Would you want restrictions on the visibility of such a path?

+1 for something like /nginx_status

The stub status page only provides info like the following:

Active connections: 12
server accepts handled requests
155124 155124 222151
Reading: 0 Writing: 9 Waiting: 4

This isn't enough information to determine the health of the upstream nodes, and including additional modules specific to this task would require rolling a custom-compiled nginx (a bad idea).

@Halfak do you have any more specific information about where/when there's missing info? We could look into the logs, see what's happening, and modify the system to provide visibility into the scenario.

do you have any more specific information about where/when there's missing info?

Sorry, I'm confused. The scenario is that a web node goes down. The reason shouldn't matter.

In T134782#2349237, @Halfak wrote:

Sorry, I'm confused. The scenario is that a web node goes down. The reason shouldn't matter.

As it's configured, Icinga doesn't seem to be a viable option for monitoring the health of individual web nodes. How are internal-to-labs services currently being monitored?

How are internal-to-labs services currently being monitored?

@yuvipanda or @Dzahn might be able to answer this question.

Depending on what you mean by 'internal-to-labs', the answer might be 'lol, we do not' to 'hope' to 'this graphite+shinken based complex thingy that is complicated'...

We can monitor something from prod Icinga if it has a public IP and we consider it a special "semi-prod" service or so. Ores is being monitored here:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ores.wmflabs.org&nostatusheader
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ores.wmflabs.org

In T134782#2353416, @yuvipanda wrote:

Depending on what you mean by 'internal-to-labs', the answer might be 'lol, we do not' to 'hope' to 'this graphite+shinken based complex thingy that is complicated'...

We want to know when an individual web node goes down from the perspective of the load balancer.

Currently, there is no clear way to do that.

@schana Any thoughts on what direction to take this?

@Halfak, to clarify, this task is referencing the labs instance, correct? How has the infrastructure changed for the production deployment?

Yes. This references the labs install.

Re. production, I don't know. @akosiaris?

lb rebalance does

In T134782#2399533, @Halfak wrote:

Yes. This references the labs install.

Re. production, I don't know. @akosiaris?

pybal takes care of marking problematic nodes as failed and not routing requests to them. pybal constantly issues health monitoring requests and decides to pool/depool hosts accordingly. See https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/lvs/configuration.yaml;b78d1bcde9ef96c9511ee672bf3e93f8ea15db4f$946 for the config stanza. Alterting on LVS will only happen if no server is able to service the request. However per server alerting happens normally. I doubt any of this however is reusable in labs.

• schana renamed this task from [spike] Find out if we can still get health check warnings after lb rebalance to [spike] Find out if we can get health check warnings when a web node goes down from the load balancer's perspective.Jun 23 2016, 8:32 AM

• schana updated the task description. (Show Details)

Here's a plan that I think would solve the problem as stated:

Add paths to the load balancer that allow specific web nodes to be hit directly:
Add a path that returns the current list of web nodes
Modify check_ores_workers to fetch the list of web nodes, query each, and return something based on matching an expected string

Problems I have:

check_ores_workers is currently named poorly and this use case doesn't help that
I'm not sure how icinga would handle the resulting string mismatch (would it show up somewhere if "okay" != "ores-web-03:8080 down"?)
check_ores_workers isn't checking based on the hiera config; it's performing an addition query that shouldn't be necessary

@Halfak, thoughts?

@schana check_ores_workers is a wrapper around the standard nagios plugin check_http, the line is currently check_http -f follow -H $host -I $host -u "http://oresweb/scores/testwiki/reverted/${timestamp}/"

We can let check_http check for a specific string on the page but we don't currently do that. So it just checks if the connection works and the OK is just the output of the check itself. We have these options:

check_http -H <vhost> | -I <IP-address> [-u <uri>] [-p <port>]
       [-w <warn time>] [-c <critical time>] [-t <timeout>] [-L] [-a auth]
       [-b proxy_auth] [-f <ok|warning|critcal|follow|sticky|stickyport>]
       [-e <expect>] [-s string] [-l] [-r <regex> | -R <case-insensitive regex>]
       [-P string] [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N] [-M <age>]
       [-A string] [-k string] [-S] [--sni] [-C <age>] [-T <content-type>]
       [-j method]

Change 296535 had a related patch set uploaded (by Nschaaf):
Check all ores web nodes

https://gerrit.wikimedia.org/r/296535

gerritbot added a project: Patch-For-Review.Jun 29 2016, 10:09 AM

• schana mentioned this in rOPUPe7f8f81818dd: Check all ores web nodes.Jun 29 2016, 10:15 AM

• schana mentioned this in rOPUP761a547bed33: Check all ores web nodes.Jun 29 2016, 10:33 AM

• schana mentioned this in rOPUP34bbd992f758: Check all ores web nodes.Jun 29 2016, 10:37 AM

• schana mentioned this in rOPUP054dd9f54f93: Check all ores web nodes.Jun 29 2016, 10:46 AM

• schana mentioned this in rOPUP6a3065ff1378: Check all ores web nodes.

I've created a patch, but am unsure how to properly implement the following in an approved-by-ops way (that is, defining multiple things based off an array):

$realservers = hiera('role::labs::ores::lb::realservers')

define monitor_ores_labs_web_node ($realserver = $title) {
    $server_parts = split($realserver, ':')
    $server = $server_parts[0]
    monitoring::service { "ores_web_node_labs_${server}":
        description   => "ORES web node labs ${server}",
        check_command => "check_ores_workers!oresweb/${server}",
        host          => 'ores.wmflabs.org',
        contact_group => 'team-ores',
    }
}

monitor_ores_labs_web_node { $realservers: }

Dzahn mentioned this in rOPUP02fc80173de2: Check all ores web nodes.Jun 29 2016, 9:49 PM

Dzahn mentioned this in rOPUPff26aec08540: Check all ores web nodes.Jun 29 2016, 9:54 PM

@schana I have amended the patch to fix the jenkins-bot downvote and in response to Yuvi's comments. That part is fixed now, but it can't find the server names in hiera yet. (at least when i run it in puppet compiler). Is that maybe just in labs hiera but needs to be added to production ./hieradata/ ?

In T134782#2416381, @Dzahn wrote:

Is that maybe just in labs hiera but needs to be added to production ./hieradata/ ?

I'm unfamiliar with how the hiera configuration works; all I know is I can view it here: https://wikitech.wikimedia.org/wiki/Hiera:Ores

@schana, see https://github.com/wikimedia/operations-puppet/tree/production/hieradata

@schana Yea, so the Hiera: namespace on wikitech is (one of the 2) places where you can add Hiera data in Labs, but for it to be in production, it has to be added in ./hieradata/ in operations/puppet along with your change.

• schana mentioned this in rOPUPa65ecce5a501: Check all ores web nodes.Jun 30 2016, 7:54 PM

• schana mentioned this in rOPUP9fffd8152321: Check all ores web nodes.Jul 1 2016, 10:24 AM

Halfak moved this task from Backlog to Review on the Machine-Learning-Team (Active Tasks) board.Jul 1 2016, 2:43 PM

Danny_B mentioned this in T138380: Reading teams would like a tag to identify spikes.Jul 1 2016, 2:57 PM

akosiaris mentioned this in rOPUPff2addf3579a: Check all ores web nodes.Jul 1 2016, 3:18 PM

akosiaris mentioned this in rOPUPa020465cade7: Check all ores web nodes.Jul 1 2016, 3:22 PM

Dzahn mentioned this in rOPUP74e051b40843: icinga: check all ores web nodes.Jul 1 2016, 3:30 PM

Dzahn mentioned this in rOPUPa1ef08ce6cde: icinga: check all ores web nodes.Jul 1 2016, 5:13 PM

Dzahn mentioned this in rOPUP94d04bdd1004: icinga: check all ores web nodes.Jul 1 2016, 5:21 PM

Change 296535 merged by Dzahn:
icinga: check all ores web nodes

https://gerrit.wikimedia.org/r/296535

Dzahn mentioned this in rOPUP395a8cc3dcd5: icinga: check all ores web nodes.Jul 1 2016, 5:25 PM

merged and watched on neon. has been added to icinga

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ores+web

just ..

HTTP WARNING: HTTP/1.1 404 NOT FOUND

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-03
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-05

so the full command we are actually running here, after unwrapping it, should be like:

/usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/node/ores-web-03/scores/testwiki/reverted/1234"

from:

$pluginpath/check_http -f follow -H $host -I $host -u "http://${urlhost}/scores/testwiki/reverted/${timestamp}/"

and that gets us the 404

I think nginx needs "^~" before the node location to properly match.

@schana bingo!

root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/node/ores-web-03"
HTTP WARNING: HTTP/1.1 404 NOT FOUND - 414 bytes in 0.056 second response time |time=0.056263s;;;0.000000 size=414B;;;0
root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/~node/ores-web-03"
HTTP WARNING: HTTP/1.1 404 NOT FOUND - 414 bytes in 0.029 second response time |time=0.029213s;;;0.000000 size=414B;;;0
root@neon:~# /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "http://oresweb/^~node/ores-web-03"
HTTP OK: HTTP/1.1 200 OK - 2950 bytes in 0.044 second response time |time=0.043526s;;;0.000000 size=2950B;;;0

Change 297115 had a related patch set uploaded (by Dzahn):
ores: fix-up web node monitoring

https://gerrit.wikimedia.org/r/297115

Dzahn mentioned this in rOPUP3cfb53ee246b: ores: fix-up web node monitoring.Jul 2 2016, 12:50 AM

Dzahn mentioned this in rOPUPe28d921880ff: ores: fix-up web node monitoring.

Dzahn mentioned this in rOPUP98e4f3c2bd38: ores: fix-up web node monitoring.

Change 297115 merged by Dzahn:
ores: fix-up web node monitoring

https://gerrit.wikimedia.org/r/297115

Dzahn mentioned this in rOPUPe149abbdddd9: ores: fix-up web node monitoring.Jul 2 2016, 12:55 AM

fixed :)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-03

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-05

@schana Thanks, you were exactly right about that extra ^~. It works now. Let me know if i closed this ticket too early and there was more involved.

@Dzahn, I was meaning for that to go in the nginx config location block to identify what type of matching nginx should perform on the path.

location ^~ /node/blah {
  ...
}

• schana reopened this task as Open.Jul 5 2016, 1:49 PM

yuvipanda unsubscribed.Jul 5 2016, 1:56 PM

To expand a bit further, the following should proxy a request like http://domain/foo/some/path to http://otherdomain/some/path

location ^~ /foo/ {
  proxy_pass http://otherdomain/;
}

We might want to simulate some downtime to make sure that we get pings. Is there a better way?

Should this work? https://ores.wmflabs.org/node/ores-web-03/

We should bring back ores-web-04, repool, bring down -05 and wait for error notification.

Currently it looks like it just works when you ask ores.wmflabs.org for an URL like "http://oresweb/^~node/ores-web-03".

https://ores.wmflabs.org/%5E~node/ores-web-03/ returns a 404. Should it not?

I don't know if it should or not, i can just confirm that it is only a 200 if you ask for "http://oresweb/%5E~node/ores-web-03%22" like the icinga check does

Change 297599 had a related patch set uploaded (by Nschaaf):
Change path to proxy node requests

https://gerrit.wikimedia.org/r/297599

• schana mentioned this in rOPUPc27a46888670: Change path to proxy node requests.Jul 6 2016, 2:45 PM

Change 297599 merged by Dzahn:
Change path to proxy node requests

https://gerrit.wikimedia.org/r/297599

Dzahn mentioned this in rOPUPf86af8305808: Change path to proxy node requests.Jul 6 2016, 6:44 PM

https://ores.wmflabs.org/node/ores-web-03/ and https://ores.wmflabs.org/node/ores-web-05/ now work

Yes, and the checks also look good in Icinga. If you want to we can do that test and break something to confirm it triggers the check.

Fun story, we actually had a minor amount of downtime on ores-web-03 during a puppet run today and it reported the issues and recovery just fine. So I think we can declare victory here.

Alternatively, we have ores-web-04 depooled right now. We could repool it, shut down -03 and let the warning go off.

Confirmed, i have seen that minor outage too.

And we have history here to proof it.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ores.wmflabs.org&service=ORES+web+node+labs+ores-web-03

So i think we can call it resolved indeed.

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 6 2016, 10:56 PM

@schana do you agree it's resolved?

In T134782#2435583, @Dzahn wrote:

@schana do you agree it's resolved?

Yes, until there's a better solution (be it healthcheck-specific URLs or whatever).

akosiaris closed this task as Resolved.Jul 7 2016, 9:41 AM

Sorry for re-opening, #revision-scoring-as-a-service tasks stays open until end of the weekly meeting and then close them altogether (after the weekly update) Sorry for the confusion.

Ladsgroup closed this task as Resolved.Jul 11 2016, 7:42 PM

Get health check warnings when a web node goes down from the load balancer's perspectiveClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Get health check warnings when a web node goes down from the load balancer's perspective
Closed, ResolvedPublic
Actions