Page MenuHomePhabricator

Basic scap{2,3} canary deployment process & checks
Closed, ResolvedPublic2 Estimated Story Points

Description

After T109515, I would like to add the option to continuously monitor a configurable logstash query, and alert if error frequency crosses a threshold.

Another possibility would be the option to watch for specific messages and trigger an action if a specific (configurable) pattern is seen in the logs.

Related Objects

Event Timeline

mmodell claimed this task.
mmodell raised the priority of this task from to Medium.
mmodell updated the task description. (Show Details)
mmodell added projects: Scap, Deployments.

While this has been moved into the MediaWiki MVP column, I'd like to voice that services would benefit from this too and has in fact been outlined in T103344: RESTBase deployment process.

mmodell lowered the priority of this task from Medium to Low.Mar 6 2016, 2:23 AM
mmodell moved this task from Needs Triage to Debt on the Scap (Scap3-MediaWiki-MVP) board.
mmodell changed the point value for this task from 1 to 2.

Following an IRC discussion after a deploy incident, I'm wondering what it would take to get a very minimal canary process going, which could be hooked up in scap 2 & expanded for scap 3.

Here is a first stab:

  1. Deploy the new code to a test host, like mw1017.
  2. Run a few basic test requests against that host (possibly using the service-tester tool).
  3. Deploy the new code to a canary host that sees normal traffic.
  4. Monitor logs from this host for a few minutes (logstash? HHVM local logs? graphite?).
  5. If this looks okay, proceed roll-out to all hosts.

The biggest question mark to me is about the log / metric checking in 4). Can we construct useful logstash queries for this? Are there other ways to check per-host HHVM logs / metrics for errors?

I think we should use a server that is receiving production traffic. That way we can just watch error rates and abort if there is a jump in 5xx errors. This seams like it would catch a wider variety of errors than some synthetic test requests.

Looking at network traffic in Kibana, querying logstash should be possible with something like this:

curl -d '{"facets":{"terms":{"terms":{"field":"_type","size":20,"order":"count","exclude":[]},"facet_filter":{"fquery":{"query":{"filtered":{"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1464831059852,"to":1464831359852}}},{"fquery":{"query":{"query_string":{"query":"host:(\"mw1017\")"}},"_cache":true}}]}}}}}}}},"size":0}' https://logstash.wikimedia.org/logstash-2016.06.02/_search

This request normally requires authentication. Are there ways to get around that from within the cluster? /cc @bd808

Edit: On IRC, @bd808 mentions that access from within the cluster is currently firewalled, but we could open it up for the deploy hosts. Alternatively, the normal HTTP auth method is also reasonably easy to use from scripts. We'd need to set up a user in that case. Another option for those with access is to test queries locally on logstash1001, port 9200.

Opening up the ferm rules to allow direct port 9200 traffic from tin. terbium, mira, and fluorine would be pretty easy. The HTTP Basic auth is backed by LDAP and requires group membership that would open up just about anything that we protect with LDAP auth.

Now we need to come up with some reasonable query, filter and a python algorithm to surface relevant errors while suppressing logspam. Our hand-curated kibana dashboards are already a good starting place. What I really would like to show in the terminal is a simple text-mode error-rate graph along with a scrolling window of log output. (perhaps https://zachholman.com/spark/)

Meta comment:

  1. Deploy the new code to a test host, like mw1017.
  2. Run a few basic test requests against that host (possibly using the service-tester tool).
  3. Deploy the new code to a canary host that sees normal traffic.
  4. Monitor logs from this host for a few minutes (logstash? HHVM local logs? graphite?).
  5. If this looks okay, proceed roll-out to all hosts.

I tried to capture points 1 and 2 in T136839.

3 and 4 seem to be covered here.

Using T131120 as a tracking task for now.

The Kibana histogram query looks like a good candidate for detecting error frequency trends. Here is a sample query matching mediawiki messages from a specific host, and excluding warnings & info level messages, extracted from this Kibana dashboard:

curl -u<user>:<pass> -d '{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"10s"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"*"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":"now-10m","to":"now"}}},{"terms":{"_type":["mediawiki"]}},{"fquery":{"query":{"query_string":{"query":"host:(\"mw1167\")"}},"_cache":true}}],"must_not":[{"fquery":{"query":{"query_string":{"query":"level:(\"INFO\")"}},"_cache":true}},{"fquery":{"query":{"query_string":{"query":"level:(\"WARNING\")"}},"_cache":true}}]}}}}}}}},"size":20,"query":{"filtered":{"query":{"query_string":{"query":"type:scap AND (channel.raw:scap.announce OR message:\"Started sync_wikiversions\")"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":"now","to":"now"}}}]}}}},"sort":[{"@timestamp":{"order":"desc","ignore_unmapped":true}},{"@timestamp":{"order":"desc","ignore_unmapped":true}}]}' https://logstash.wikimedia.org/logstash-2016.06.02/_search

Result:

{  
   "took":253,
   "timed_out":false,
   "_shards":{  
      "total":1,
      "successful":1,
      "failed":0
   },
   "hits":{  
      "total":0,
      "max_score":null,
      "hits":[  

      ]
   },
   "facets":{  
      "0":{  
         "_type":"date_histogram",
         "entries":[  
            {  
               "time":1464886340000,
               "count":1
            },
            {  
               "time":1464886370000,
               "count":1
            },
            .....
            {  
               "time":1464887210000,
               "count":2
            }
         ]
      }
   }
}

An error check for the live-traffic canary could look like this:

  • After the live-traffic canary deploy, wait for some time, lets say two minutes.
  • Select the last 5-10 minutes worth of data for the canary host.
  • Calculate the mean error rate before the deploy & after the deploy.
  • Make a decision based on a threshold. Example: Consider the deploy a potential failure if the error rate increased more than 2x.

@GWicke: Indeed, that is also something that can be turned into a nice text-mode display like ▁▂▃▅▂▇

@thcipriani, @mmodell & myself talked about next steps on IRC, and identified three independent tasks:

  1. Create a script to run test requests for the MediaWiki service. This might be checker.py, which @mobrovac recently wrapped into a handy check-<service> tool. If we use checker.py, we'll need to create a very basic swagger spec with x-ample request/response pairs. I'll try to win @mobrovac for this part.
  1. Create a script to check logstash for errors after a deploy to a canary host that sees client traffic. Draft script parameters:
    • host: Host name to check; example: mw1017.
    • delay: Time since deploy, in seconds. Used to calibrate error detection. Example: 120

      I'll look into this task.
  1. Add the ability to deploy to specific canary hosts & run a command afterwards in scap{2,3}:
    1. deploy to mw1017, run a checker script right away (possibly check-mediawiki).
    2. if return status from a) was 0, deploy to a second canary that receives client requests. Wait for delay seconds, then run log check script from overall step 2).
    3. If the check script from b) returned 0, proceed with deploy.

      @mmodell and @thcipriani made noises about tackling this one.

@GWicke just a couple of suggestions:

  1. If we want to check MediaWiki with service_checker (which I do agree with, I might have mentioned it on IRC repeatedly) we should document how service_checker works a bit better and ask the MediaWiki developers to evolve the spec file as MW evolves. At the moment, the number of people that know how to write a spec for the checker script is awfully limited. This can of course be done once we've done the first deploy.
  2. As canaries, I'd choose 2 hosts at least, as in one basic appserver and one api appserver that serve real traffic, so for example I'd avoid mw1017. Also, I am thinking we should maybe avoid referencing a single physical host directly. I am honestly unsure of what is the best way to do that. Having more than one canary in each pool could avoid having to change a config when the host is down for hw failure/maintenance
GWicke renamed this task from Scap3 check to monitor logstash and detect changes in error frequency to Basic scap{2,3} canary deployment process & checks.Jun 2 2016, 11:51 PM

Change 292505 had a related patch set uploaded (by GWicke):
WIP: logstash_checker script for canary deploys

https://gerrit.wikimedia.org/r/292505

@Joe: Maybe it should just pick a host at random each time instead of being hard-coded?

@Joe, agreed on the need for more documentation. It'll still be nice to share the code, and the spec will probably come in handy for other applications like fuzzing.

Re 2), my initial idea was to start with a test-only host like ms1017, so that completely broken deploys (ex: syntax error) never become visible to users. Basically a smoke test with synthetic traffic only.

The second round would then go for canary hosts with real traffic. I agree that the host selection should ideally be fairly random. Perhaps we could draw it from the hosts enabled in LVS?

A quick update re https://gerrit.wikimedia.org/r/292505: Basic check functionality is now in place. I think the query can use some improvement, possibly taking into account the newer elasticsearch information from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html. The 'facet' feature kibana is using there is deprecated.

In any case, I'll add some finishing touches tomorrow, at which point this should be ready for review.

Nice work @GWicke! This is a very cool thing and I'm happy to see it happening.

Example run of https://gerrit.wikimedia.org/r/#/c/292505/:

python logstash_checker.py --host mw1167 --auth 'GWicke:<pass>'
DEBUG: logstash response {u'hits': {u'hits': [], u'total': 603123301, u'max_score': 0.0}, u'_shards': {u'successful': 44, u'failed': 0, u'total': 44}, u'took': 604, u'facets': {u'0': {u'_type': u'date_histogram', u'entries': [{u'count': 1, u'time': 1465256330000}, {u'count': 1, u'time': 1465256370000}, {u'count': 1, u'time': 1465256430000}, {u'count': 1, u'time': 1465256440000}, {u'count': 1, u'time': 1465256460000}, (...), {u'count': 1, u'time': 1465259770000}, {u'count': 1, u'time': 1465259840000}]}}, u'timed_out': False}
OK ( 1.49462365591 -> 1.0 )

One question in terms of integration with Scap: would it be possible to have logstash_checker.py accept multiple hostnames?

We could run multiple instances of logstash_checker from scap in parallel without issue, but it would be easier to just do: logstash_checker.py --auth mwdeploy:<pass> --hosts mw1167 mw1099 mw1189

would it be possible to have logstash_checker.py accept multiple hostnames?

It depends on which semantics we are shooting for. Assuming we are going to deploy to an app server & one API server, I suspect we'll want to check each for a rise in the error rate separately, rather than averaging error rates across both. If this is the case, then there doesn't seem to be much of an advantage in adding this to the script, rather than running two instances in parallel.

would it be possible to have logstash_checker.py accept multiple hostnames?

It depends on which semantics we are shooting for. Assuming we are going to deploy to an app server & one API server, I suspect we'll want to check each for a rise in the error rate separately, rather than averaging error rates across both. If this is the case, then there doesn't seem to be much of an advantage in adding this to the script, rather than running two instances in parallel.

This sounds reasonable to me.

This only requires a few tweaks, preliminary work is complete in {D248}—I'll update with the assumption of running this script on an appserver and an API server. Scap has no concept of there being a difference at this point.

Hopefully, we'll be able to get all this in the next release! :)

Change 294742 had a related patch set uploaded (by Thcipriani):
scap: make deployment aware of canary machines

https://gerrit.wikimedia.org/r/294742

Change 294742 merged by Filippo Giunchedi:
scap: make deployment aware of canary machines

https://gerrit.wikimedia.org/r/294742

So it seems like since the update to the new kibana, we're no longer exposing the _search api in the same way. The only information I've found on it so far is on a github issue https://github.com/elastic/kibana/issues/5791#issuecomment-168154419 which seem to pretty much match up with what I'm seeing since the upgrade: https://logstash.wikimedia.org/logstash-*/_search

From some testing in beta, it seems like we could still query elasticsearch directly, but I don't think we expose that anywhere.

@EBernhardson @bd808 is there any precedent and/or documentation for doing direct logstash queries?

The main change here was that the old version of kibana was a webapp that talked directly to elasticsearch. The new version is a nodejs app that users talk to, so there is no direct proxy to the elasticsearch cluster anymore.

The most obvious way to go about this would be to allow certain things in the cluster to talk to the logstash elasticsearch servers directly. This is currently blocked by fermi, but it seems plausible we could open up a hole to deployment servers?

Change 292505 merged by Filippo Giunchedi:
Logstash_checker script for canary deploys

https://gerrit.wikimedia.org/r/292505

greg subscribed.

I don't believe there is anything left to do here to complete this task as stated. "Basic" is in place. Improvements/refinements go elsewhere :)

There is also: T136839: Create a script to run test requests for the MediaWiki service