Scap3 should support post-deploy checks
Closed, ResolvedPublic

Description

Post-deploy, scap3 should be able to verify a successful deploy with a series of checks.

These checks should include:

  • Port check
  • HTTP request return checks
  • logs/metrics checks
thcipriani updated the task description. (Show Details)
thcipriani raised the priority of this task from to Needs Triage.
thcipriani added a project: Deployment-Systems.
thcipriani added subscribers: mmodell, demon, dduvall and 4 others.
mmodell triaged this task as High priority.

For simple checks we can use this:

01:04 <_joe_> https://gerrit.wikimedia.org/r/#/c/231790/5/modules/service/templates/deployment_script.sh.erb at line 110 we do that exactly
01:04 <_joe_> (this is a smallish script to help deploys of rb/services while we don't get something serious)
01:06 <_joe_> where /usr/local/lib/nagios/plugins/service_checker is this https://github.com/wikimedia/operations-puppet/blob/production/modules/service/files/checker.py

Also, I've made a bit of progress on a graphite monitor class, that can watch an error rate metric and raise an alert if the error rate crosses a threshold. I haven't worked out how to define the thresholds yet but the code for reading values from graphite is probably useful. For background, see discussion on T93428: Streamline our service development and deployment process, specifically, T93428#1556189)

1# -*- coding: utf-8 -*-
2"""
3 scap.monitor
4 ~~~~~~~~~~~~
5 This module provides classes and utilities used by scap to monitor
6 services, logs or metrics for anomalous events.
7
8"""
9import urllib, json
10
11class Monitor(object):
12def __init__(self):
13pass
14
15class HttpMonitor(Monitor):
16def get(self, url, params = ""):
17if isinstance(params, dict):
18params = urllib.urlencode(params)
19f = urllib.urlopen("%s?%s" % (url, params))
20data = f.read()
21f.close()
22return data
23
24
25class ErrorRateMetric(HttpMonitor):
26def __init__(self,metric):
27self.metric = metric
28
29def check(self):
30params = {
31"from": "-30minutes",
32"target": self.metric,
33"format": "json"
34}
35data = self.get("http://graphite.wikimedia.org/render/", params)
36data = json.loads(data)
37maxVal = 0
38sumVal = 0
39data = data[0].get("datapoints")
40for i in data:
41maxVal = max(maxVal, i[0])
42sumVal += i[0]
43lastVal = i[0]
44avgVal = sumVal / len(data)
45print ("max: %s / average: %s last: %s" % (maxVal, avgVal, lastVal))
46return data
47def main():
48mon = ErrorRateMetric("transformNull(restbase.v1_page_html_-title-_-revision--_tid-.GET.5xx.sample_rate,0)")
49data = mon.check()
50print data
51
52if __name__ == '__main__':
53main()

mobrovac added a subscriber: Joe.Aug 20 2015, 10:25 AM

Great stuff @mmodell!

I haven't worked out how to define the thresholds yet but the code for reading values from graphite is probably useful.

@Joe points to their nagios script in T93428#1556326 .

What could be useful is to gather some metrics/logs before the actual deploy (say, in a one-min window or the likes), do the deploy, and then gather data a while afterwards and compare the results. In a first iteration, even just presenting the results to the deployer and let them decide whether to continue or roll back could be enough. This would probably not be too time consuming if we apply it to the canary nodes only.

What could be useful is to gather some metrics/logs before the actual deploy (say, in a one-min window or the likes), do the deploy, and then gather data a while afterwards and compare the results. In a first iteration, even just presenting the results to the deployer and let them decide whether to continue or roll back could be enough. This would probably not be too time consuming if we apply it to the canary nodes only.

That sound about like what I had in mind. At least with graphite metrics we can quickly compare the time series data, we could even continuously monitor several metrics during a deploy and give immediate feedback if any of them appear to be headed in the wrong direction.

we could even continuously monitor several metrics during a deploy and give immediate feedback if any of them appear to be headed in the wrong direction.

Yes, +1e3 to that. We'll need to devise a heuristic for the heading in the wrong direction part.

Note that the same could be done for log entries.

Yes, +1e3 to that. We'll need to devise a heuristic for the heading in the wrong direction part.

Note that the same could be done for log entries.

Yeah, I'm still trying to come up with the right heuristic - and I'm open to suggestions if you think if anything ;)

dduvall moved this task from Needs triage to Services MVP on the Scap board.Aug 21 2015, 5:09 PM
thcipriani set Security to None.
mmodell added a comment.EditedAug 22 2015, 5:02 AM

@Joe: I like the idea of deferring to the existing icinga check-scripts. Do those scripts get installed globally or should I do some puppeteering to make sure that the appropriate scripts get installed on deployment targets? Essentially scap3 will have a dependency on the check scripts so I just want figure out how to declare that dependency and be sure that it will be satisfied.

Seeking feedback to be sure I do it in a way that Operations will be happy with ;)

mmodell added a subscriber: bd808.Aug 22 2015, 6:37 AM

So, for querying logstash, we need ldap credentials. Anyone have any suggestions about how we should handle that? I don't really want to force deployers to store their individual ldap credentials in the clear, or force them to enter a username/password for every deploy. Can we set up a shared account on ldap that would grant access to logstash?

@bd808, do you have any suggestions?

bd808 added a comment.EditedAug 22 2015, 8:02 PM

So, for querying logstash, we need ldap credentials.

HTTP Basic Auth is only needed for accessing via http://logstash.wikimedia.org. From inside the cluster you can talk directly to the backing Elasticsearch cluster on port 9200.

You can look at https://github.com/bd808/ggml to get an idea of how searching the log records directly could work.

@bd808: awesome, that's perfect, thanks!

mmodell moved this task from Services MVP to Experiments on the Scap board.Aug 29 2015, 6:38 AM

Change 238391 had a related patch set uploaded (by 20after4):
Beginnings of some scap3 documentation

https://gerrit.wikimedia.org/r/238391

dduvall added a comment.EditedSep 30 2015, 4:07 PM

Moving IRC/email conversation here:

@mmodell wrote:

So in puppet the monitoring is configured by directly specifying in the commands, which are all in /usr/local/lib/nagios/plugins/

Restbase simply does an http request which uses this command line:

/usr/local/lib/nagios/plugins/service_checker -t 5 127.0.0.1

So does it make more sense to call the plugin scripts directly, or to parse the fully formed commands from /etc/nagios/nrpe.d/*.cfg? Doing the former seems more straightforward, but the latter would avoid the need to duplicate check specific configuration and map it to positional command arguments.

@mmodell wrote:

So maybe we should go with yaml for the checks, either embedded in the scap.cfg or otherwise.
For example:

1checks:
2- name : restbase_endpoint
3plugin : service_checker
4timeout : 5
5host : 127.0.0.1
6url : http://127.0.0.1:7231/en.wikipedia.org/v1
7where : target
8stage : check
9- name : restbase_anomaly
10plugin : check_graphite_anomaly
11where : host
12metric : reqstats.5xx
13stage : all
14- name : restbase_badwords
15where : host
16plugin : logstash_query
17query : some-logstash-query

I really like this syntax, and I think defining checks in a separate YAML rather than crowding the cfg actually makes the most sense.

Since the current structure of scap.cfg is geared more toward specifying realm-specific options, maybe it makes sense to have an option for specifying which checks to enable per realm.

[global]
checks=restbase_endpoint,restbase_badwords

[wmfnet]
checks=restbase_endpoint,restbase_badwords,restbase_anomaly

So does it make more sense to call the plugin scripts directly, or to parse the fully formed commands from /etc/nagios/nrpe.d/*.cfg? Doing the former seems more straightforward, but the latter would avoid the need to duplicate check specific configuration and map it to positional command arguments.

+1 for the latter, just get the name out of the nagios config and let the user allow its execution.

@mmodell wrote:

So maybe we should go with yaml for the checks, either embedded in the scap.cfg or otherwise.
For example: P2120

Nice! How about making that a hash/dict to avoid name conflicts? Also, perhaps it'd be worth to have check groups in there as well, so that users may reference a group of checks to execute? That would also improve readability as these could be grouped by stage/target/etc (whichever way feels more natural to the user).

I really like this syntax, and I think defining checks in a separate YAML rather than crowding the cfg actually makes the most sense.

Since the current structure of scap.cfg is geared more toward specifying realm-specific options, maybe it makes sense to have an option for specifying which checks to enable per realm.

[global]
checks=restbase_endpoint,restbase_badwords

[wmfnet]
checks=restbase_endpoint,restbase_badwords,restbase_anomaly

LGTM.

@dduvall: I'll write an nrpe config loader so that checks in nrpe.d/*.cfg will be available without being explicitly configured, you can just reference them by name.

@mmodell, perfect. I'll create a blocking task for it and get started on the basic config/hook framework.

dduvall claimed this task.Sep 30 2015, 6:20 PM
dduvall moved this task from Experiments to Services MVP on the Scap board.
dduvall moved this task from Services MVP to Done on the Scap board.Oct 9 2015, 6:02 PM
dduvall closed this task as Resolved.