Page MenuHomePhabricator

Set up monitoring automation for services
Closed, ResolvedPublic

Description

Right now the only way to monitor the health of any service is to write manually a new check that will get added to icinga in puppet, and we need to do that for each endpoint, for every service. We want to have a general, simple way of automating this process.

Ideally:

  • Developers should be able to programmatically expose what they want to be monitored, and what the expected responses will be
  • Said resources should be monitored explicitly and independently, without the need of any ops intervention or puppet modification

The way to do this is to have all services expose automatically a special endpoint, like /_monitor, with a simple JSON description of all relevant endpoints and which call we should make to monitor them, what payload to use, and what response to expect

A possible format for the JSON could be:

{
  "/path/to/endpoint" : {
    "method": "GET",
    "request": {
      "headers": {
        "header1": "value1",
        "headerN": "valueN"
      },
      "body": {}
    },
    "response" {
      "status": 200,
      "headers": {
        "header1": "value1",
        ...
      },
      "body": "/foo.*bar$/"
  },
  ...
}

All of the fields are optional, as by default the monitoring system will check the url with a GET request, and expect a 200 response code.

Event Timeline

Joe created this task.Apr 2 2015, 10:27 AM
Joe raised the priority of this task from to Needs Triage.
Joe updated the task description. (Show Details)
Joe added a subscriber: Joe.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2015, 10:27 AM
mobrovac updated the task description. (Show Details)Apr 2 2015, 11:46 AM
mobrovac set Security to None.

@Joe: Please assign a project to this task.

Joe added a comment.Apr 2 2015, 1:17 PM

@Aklapper I wanted marko to be able to edit it before we officially made this public. Doing it now.

mobrovac moved this task from Backlog to In-progress on the Services board.Apr 2 2015, 1:31 PM
GWicke added a subscriber: GWicke.EditedApr 2 2015, 3:20 PM

We could consider directly using the test/example request/response pairs from the swagger spec. An attribute could mark an end point as 'this should be monitored'. All we need to expose in that case is the spec itself, as for example from rest.wikimedia.org.

For the aggregate service we also have the typical request metrics on latency & response codes, along with service-specific ones that we should use for monitoring. The interesting part there is setting the right thresholds, ideally without having to touch puppet.

Andrew triaged this task as Normal priority.Apr 6 2015, 4:27 PM

We could consider directly using the test/example request/response pairs from the swagger spec. An attribute could mark an end point as 'this should be monitored'. All we need to expose in that case is the spec itself, as for example from rest.wikimedia.org.

Ideally, yes, I'm all for that approach. Sadly, however, at the moment service-template-node is not spec-driven. We should definitely change that, but for the time being, having /_monitor exposed should suffice. This is an internal thing of service-template-node anyway and the wait it obtains the specification does not really impact this overall.

For the aggregate service we also have the typical request metrics on latency & response codes, along with service-specific ones that we should use for monitoring. The interesting part there is setting the right thresholds, ideally without having to touch puppet.

Oh yeah, that's going to be fun to determine for sure :)

Regarding the monitoring check responses, I think that for the first iteration we should make it as flexible as possible. For example,

response: {
  status: 200,
  headers: {
    header1: "val1",
    header2: "/val2$/"
  },
  body: {
    field1: "/.+/"
  }
}

would mean that the response headers should contain header1 and header2 and match their respective values, while the body must have a non-empty field1 field. In this context, I think we should allow other data to be received as well, i.e. at least header1 and header2 must be present, if there are others we ignore them. Or perhaps issue a warning or notice or sth? Same goes for the body.

Also, it'd be nice to have an automatic content-type check to the extent possible:

  • text/html should contain some valid HTML tags
  • application/json should have a valid, parse-able body
GWicke added a comment.EditedApr 14 2015, 3:02 PM

@mobrovac, the service template does not need to be fully spec-driven to be able to expose a simple spec for monitoring purposes. A minimal spec can look like this:

swagger: 2.0
paths:
  /{foo}:
    get:
      x-amples:
      - request:
          uri: /Sometitle
          headers: ...
        response:
          status: 200
          headers: ...
          body: ....
        monitor: true

@GWicke I am not sure we are talking about the same thing any more. This ticket is about the contract between the monitoring utility and the service, while I have a feeling you are talking about the contract between the service and its developers.

In the former case, exposing a full swagger spec to the monitoring utility looks like overkill to me, given that the spec mentioned here can be easily generated by the service from a full swagger spec. That also simplifies things in that it does not dictate the contract between the service and its developer(s) (generating a full swagger spec from a quasi-spec is much harder than the other way around).

GWicke added a comment.EditedApr 14 2015, 3:28 PM

@mobrovac, I think we are talking about the same thing. An API spec documents (relevant properties of) an API. In this case we are interested primarily in end points that need to be monitored, along with request/response pairs. We have a way to express that in an API spec, and I think it's worth doing so without reinventing the wheel. The benefit is that we should be able to avoid unnecessary effort and complexity down the road. I think it's better to add five lines to a single monitoring utility to let it directly process specs than write custom extraction code in each service to produce the special monitoring spec you are envisioning here.

After sleeping on it, I realised it's just a matter of format and we can go either way. I still think we should expose a proper endpoint for this (either specifically for monitoring or the complete spec itself), something like /_spec or /_monitor Or maybe simply make the endpoint configurable? Also, if we go with the complete spec, I think endpoints should be monitored by default, unless otherwise specified.

@Joe thoughts?

Joe added a comment.EditedApr 15 2015, 7:25 AM

Having an endpoint exposing this gives us a lot of flexibility/autodiscovery ability that does NOT depend on people using swagger/spec to define what we should monitor. After all, I think it's the app's duty to expose an usable API to its clients. I also agree with @mobrovac about this being a contract between the devs and the infrastructure they're working on.

But in the end, I just care to have that info exposed via an endpoint in a defined format, so using swagger is nice as well, although I prefer to do as little processing on the monitoring host as possible.

swagger: 2.0
info:
  version:
  title:
x-default-params:
  title: Foobar
paths:
  /page:
    get:
      tags: [a, b, c]
      x-monitor: false
  /page/html/{title}{/revision}:
    get:
      tags: [d, e, f]
      x-amples:
        - request:
            params:
              revision: 12345
          response:
            status: 200
            headers:
              etag: /.+/
              content-type: /text\/html/
            body: /<html>.+<body>/
        - request:
            headers:
              cache-control: no-cache
          response:
            status: 200
            headers:
              etag: /.+/
              content-type: /text\/html/
            body: /<html>.+<body>/
        - request:
            params:
              title: Barack Obama
            headers:
              cache-control: no-cache
              accept-encoding: gzip
          response:
            status: 200
            headers:
              etag: /.+/
              content-type: /text\/html/
              content-encoding: /gzip/

After receiving this specification from host:port/configurable_path/_monitor, the monitoring script would need to check the following paths:

  • host:port/configurable_path/page/html/Foobar/12345, because the default title is set to Foobar in x-default-params, while the revision is given the request's params stanza
  • host:port/configurable_path/page/html/Foobar with a cache-control: no-cache header
  • host:port/configurable_path/page/html/Barack%20Obama (with no-cache and gzip headers), note here that the title parameter has been given in the request, so it supersedes the default one, and also that it needs to be url-encoded

Would something like that work for you guys?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 6 2015, 12:49 PM

Change 223328 had a related patch set uploaded (by Giuseppe Lavagetto):
service::node: auto-monitoring of local endpoints [WiP]

https://gerrit.wikimedia.org/r/223328

Change 223328 merged by Giuseppe Lavagetto:
service::node: auto-monitoring of local endpoints

https://gerrit.wikimedia.org/r/223328

fgiunchedi assigned this task to Joe.Jul 20 2015, 10:51 AM
fgiunchedi added a subscriber: fgiunchedi.

Change 226527 had a related patch set uploaded (by Giuseppe Lavagetto):
graphoid: enable spec-based monitoring

https://gerrit.wikimedia.org/r/226527

Change 226527 merged by Giuseppe Lavagetto:
graphoid: enable spec-based monitoring

https://gerrit.wikimedia.org/r/226527

Joe lowered the priority of this task from Normal to Lowest.Mar 16 2016, 2:55 PM

This has been implemented with checker.py, and is used for pretty much all services, both per-node and for LVS end points (see T134551).

@Joe, is there anything left to be done here, or should we close this task?

Joe added a comment.Jul 28 2016, 10:06 AM

@GWicke nothing left now as we've separated service-checker to be an external repository

Joe closed this task as Resolved.Jul 28 2016, 10:06 AM