The action API is a critical service in our infrastructure, which should have decent performance and error metrics, and alerts on those. As far as I am aware, we currently do not have such metrics or alerts.
It turns out that there are some Varnish backend metrics (example), but it is not completely clear if those record latencies or request rates. The sample_rate sub-metric does correlate quite well with request rates for restbase, so I think it is possible that these are actually latencies.
@ori also recently added timing information to Apache logs. However, those aren't currently aggregated & recorded in graphite, which makes it hard to correlate them with other data, or set alerts. An advantage of using Apache metrics is that they would cover internal API requests from services like Parsoid or RESTBase as well.
I have set up a basic latency and request rate dashboard using the Varnish metrics at https://grafana.wikimedia.org/dashboard/db/api-summary.
I have not found any metrics reporting status codes, which would be especially useful for setting up alerts. Setting up those metrics for all Varnish backends would be great.
Is there stuff left to do here beyond what's present in the current dashboards? I mean, our metrics can always be "better", but this task seems to lack specifics for someone to actually do.
That's fair, that is what's in the title. I think I was thinking one thing and saying another above. I was thinking "I don't see a useful path forward here because of meta-issues", and saying "what's left to do here?", which is very different.
Let's back up a step:
I'm looking at https://grafana.wikimedia.org/dashboard/db/api-summary which comes from some portion of the varnishstatsd output, which is basically what's linked as work done on this task's goal. It has the GET/POST request counts/rates and their latencies broken down by backend and filtered to just API-like backends.
Is Varnish even the right place to be logging these metrics? Technically, not all APIs (in the general sense) will be exposed through the Traffic layer, only public ones. Further, even for public APIs like api.svc and restbase.svc, not all requests must come through the public entrypoint. Surely there are cases (or should be? or will be?) where we use APIs internally, with fooservice making calls directly to bar.svc.eqiad.wmnet. Those are invisible to Varnish and still vitally important when tracing through performance and functional problems.
There's also the question of Varnish really not having the best information here on interpreting errors or having deeper metrics on timing breakdown, etc. We can pass that information up to Varnish in headers, but that seems cross-purpose: we're now sending Varnish information that the end-user doesn't need just so Varnish can log information the service should already know about itself. We've done this already in the MediaWiki case with the Backend-Timing outputs, but I think that's probably a bad pattern to follow. We lose any deeper information on total failure anyways; Varnish can't see through many classes of error/problem that just result in "503" from its perspective, but internal logging/metrics in the service can.
I would argue that services should log their own request metrics to graphite (or prometheus in the near future). It may still be useful from the Traffic perspective that we log the data we have from varnishstatsd so that we can correlate public-facing anomalies with Varnish's perspective on the backend services, but that's a very different (if related) thing.
What Alerts? Regardless of whether the service logs its own metrics or Varnish interprets and logs for it, what are the useful alerts on this? The best example we have today is the semi-intelligent 5xx rate alerts, but they tend to miss important spikes and false-positive on minor non-issues.... this is probably the subject of some future work on having better AI-like stuff that can usefully detect the notable deviations from the norm in all our metrics in general. I think anything we have today is going to just create more icinga spam in the IRC channel?
Maybe some decisions need to be made about these things, or discussions need to be had. But I still don't see an explicit task we're ready to accomplish here in this ticket that makes sense.
I think anything we have today is going to just create more icinga spam in the IRC channel?
While I agree that we need better root cause filtering, targeting & prioritization of alerts, I do not think that this should be a reason to not have at least very basic alerts on a major production API.
Is Varnish even the right place to be logging these metrics?
As you note, it certainly is not capturing all API requests. However, the data is readily available & covers a very large portion of API traffic. I think it would make for a good start.
Using grafana's new & spiffy alert feature, I set up a simple alert for the RESTBase backend request latency using the Varnish data, and also set up an alert for action API latency as seen by RESTBase. Once T162765 is resolved, the services team will get notifications for those alerts. We can use the same mechanism or native icinga alerts for the action API as seen by Varnish. Those alerts should be sent to the operations channel, the same way other service alerts are.
We now had quite a few instances of icinga only alerting on services indirectly using the action API. This caused us to repeatedly start investigating services, only to find out that the action API had a brief outage. Adding an alert directly on the action API would show issues more directly and quickly, and often avoid the detour via other services.
FTR, this is the graph with the alert I mentioned: https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1
The alert is defined in https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1&edit&tab=alert, triggering when the average p99 latency over 10 minutes is above 1500ms. This value has generally worked well, with no false triggers in the last months, while correctly highlighting https://wikitech.wikimedia.org/wiki/Incident_documentation/20170917-MediawikiApi. That said, a more conservative threshold of 2000ms & perhaps a slightly longer time window could reduce the chance of false alerts even further, while still catching major issues.