Page MenuHomePhabricator

Create WDQS uptime SLO
Closed, ResolvedPublic5 Estimated Story Points

Description

As a user and a maintainer of WDQS, I want an expectation of service availability so that I know when issues can/should be resolved.

The WDQS uptime SLO will be based on running a set of non-cached representative test queries periodically on the WDQS cluster, and comparing the time it takes for the queries to run against the baseline expectation; if this test time is over a TBD threshold, WDQS will be considered down, and require maintenance. This should approximate actual service availability for users. The tests will be non-cached and run against the entire cluster rather than per host.

Example -- test queries are run hourly should not take more than 200% time to run over baseline (<200ms if baseline is 100ms). Goal is to keep this uptime 95% of the time.

Sub-tasks:

  • Decided on SLO => trafficserver_backend metrics
  • WDQS SLO comms have been sent out
  • Implemented trafficserver metrics to see SLO performance

AC:

Event Timeline

Aisha has written some Jupyter notebooks to pull together a random selection from groupings of query by time-to-completion and query structure (which operators are used, basically).

On the ops side of things we'll need to decide between whether we want to just run a single simple query on every host or run a whole set of queries. Right now with Aisha's data we'll have the option of choosing either way.

Some quick pros/cons of two possible approaches to getting the SLI metrics: approach #1 is to run a query or set of queries per-dc at a certain frequency, approach #2 is just to run a query on each host at a certain frequency

* Approach #1: Hit wdqs.svc.{codfw,eqiad}.wmnet

Pros
  • Routing through pybal so we automatically ignore depooled hosts
  • Covers a broader class of failures than just simply running queries on each host
  • Maps a bit better to the actual user experience (ie if 10% of hosts are down)
Cons
  • Adds some complexity in terms of understanding how routing works (ex: do we have to worry about geoDNS [ie that we might end up unintentionally always routing to the same host] or is that [geoDNS] "higher up" in the stack and therefore not relevant?)

* Approach #2: Just run a simple query on each host in the fleet

Pros
  • Easy to reason about
  • Constantly testing each host individually, so we have host-level granularity
Cons
  • For generating the SLI itself, for each host need to filter out time range in which they're depooled
  • Not as pure of a gauge of user experience as compared to hitting wdqs.svc.{codfw,eqiad}.wmnet

Personally I lean a bit towards #1 because it intuitively seems to measure the user experience better, but I do have significant gaps in my understanding of our network / request routing stack so there's perhaps more unknowns with that approach. I'll need to do some sanity checking and circle back here with more clarity.

Intro (some context for traffic team)

Search team is working on creating an SLI to measure uptime of WDQS. We want our SLI to map as well to the actual user experience as possible, so to that end we're trying to come up with a way to hit WDQS endpoints externally or semi-externally. Ideally the solution would be ambivalent to the underlying pool/depool state of the underlying hosts (translation: if a host is depooled the request won't ultimately route to it).

The primary idea we have is to send automated requests to each datacenter, for example by querying wdqs.svc.{codfw,eqiad}.wmnet. However I/we have some knowledge gaps that makes it a bit murky to get a clear idea of what exactly that solution looks like.

Do you all have any thoughts on the best way to do this?

To that end it'd help to sanity check a few assumptions I'm working off of:

Assumptions

(A1) By hitting wdqs.svc.{codfw,eqiad}.wmnet, we're bypassing any geoDNS logic since that happens earlier up in the stack.

(A2) Requests that hit wdqs.svc.{codfw,eqiad}.wmnet round robin to an underlying pooled host in the fleet (https://config-master.wikimedia.org/pybal/eqiad/wdqs for example)

Beyond the sanity check of those assumptions, I have a few further questions:

Questions

(Q1) Is there precedent for this pattern already, ie is there perhaps an existing service that uses a similar approach?

(Q2) If there isn't already precedent for this, what are your initial thoughts on the best way to do this? For example could we just literally send a request to wdqs.svc.{codfw,eqiad}.wmnet every X minutes from an alerting host, or is there a better way?

Gehel and I met with bblack today.

Some highlights:

  • Best to use real user traffic if possible, rather than artificial. However this might be difficult for our use case (given that some subset of queries we consider invalid/failing)
  • If going artificial traffic route, makes more sense to do local queries on each host (accounting for pool/depool status ofc) rather than running queries at the dc level. This way we have a better separation of concerns.

With respect to the SLO itself, our goal is an SLO that captures the promise we make about service availability: namely, that WDQS is available on a best-effort basis. In practice, this means that if an issue arises out of "business hours", it's acceptable to wait until "business hours" to resolve it. For example, in the most extreme case, if the service were to have an outage on a Friday night, we wouldn't be paging anyone to work the night nor the weekend, but come Monday we'd be focusing our efforts on restoring availability as soon as possible. This specific scenario - a multi-day full outage - would of course be quite rare (on the order of a few times a year at most, but generally much less).

Thus our uptime % goal should reflect the above reality. I think a good starting point would be 95% uptime. This means that the service could be down for 18.25 days out of a year. With that number we could have basically one full weekend outage a quarter and be within our threshold.

Note that any % chosen is to some extent arbitrary. For example if WDQS were down during business hours and we weren't doing anything to try to fix it, but were still above 95% uptime, we'd be within our technical SLO but not actually meeting our best-effort claim. Conversely, if we were experiencing frequent weekend outages but were always getting things operational by the time the normal workweek has commenced, we could fall below our SLO's threshold while still actually meeting our own expectations for the service. But this 95% number seems like a reasonable initial target to convey our intent with this service. To be clear, in practice, at least based off current performance, I'd expect our uptime to be well over that 95% minimum bar, but the point is that the 95% threshold lets us be explicit about what kind of error budget we're allowing for.

Change 841582 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] [wip] query_service: try installing nginx w extras

https://gerrit.wikimedia.org/r/841582

Change 841582 merged by Ryan Kemper:

[operations/puppet@production] wdqs-test: try installing nginx w extras

https://gerrit.wikimedia.org/r/841582

Change 841518 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "wdqs-test: try installing nginx w extras"

https://gerrit.wikimedia.org/r/841518

Change 841518 merged by Ryan Kemper:

[operations/puppet@production] Revert "wdqs-test: try installing nginx w extras"

https://gerrit.wikimedia.org/r/841518

The current approach we're trying to work towards is recording the nginx response codes for requests. That will give us insight into the number of failures we're seeing.

At a high level, these are the various response codes we expect for different scenarios:

User throttled => 429 ("Too Many Requests - Please retry in %s seconds.")

User banned => 403 ("You have been banned until %s, please respect throttling and retry-after headers.")

Successful request => 2xx

Failed request => 5xx

  • One common failure mode is a specific wdqs host's blazegraph instance being deadlocked. In this case, nginx will never hear back from blazegraph, and will issue some sort of 5xx code (not yet sure which exact code)

With respect to recording nginx request responses:

Getting direct logs: One idea is to add /var/log/nginx/access.log to RollingFileAppender in modules/query_service/templates/logback.xml.erb (https://github.com/wikimedia/puppet/blob/6e3c52f30166f88c5021c11ebd5f6aa411118854/modules/query_service/templates/logback.xml.erb#L29)

Example log line:
```
(REDACTED IP; EXAMPLE FORMAT xx.xx.x.xx) - - [13/Oct/2022:16:28:14 +0000] "GET /sparql?format=json&query=REDACTED_QUERY_STRING HTTP/1.1" 200 97 "-" "REDACTED_USER_AGENT"
```

Pros: We're ingesting the full log line, not just prometheus metrics. This would make it easier to correlate, say, a spike in 5xx responses, with the log lines corresponding to the actual requests

Cons: We don't directly get time-series metrics for this. We'd probably want to separately ingest corresponding time series metrics so we can actually see this in Grafana. Kibana has the ability to visualize by parsing log lines, but this is computationally expensive, so we probably want to directly export nginx request metrics to Prometheus.

Getting metrics: I'm a bit hazy on the best way to do this. There's hopefully a pretty straightforward way. I'll see if o11y has any thoughts on this.

A few comments on the current dashboard:

  • a very quick look at Turnilo: the graph look different enough that I'd like to know why the discrepancies
  • as discussed, we should define the service as "working" not only when returning HTTP/200, but also when requests are throttled (429) or banned (403)
  • we probably need to dig a bit more into other response codes and the dips we see in the graph to understand what they are and if they are problematic (and thus refine our definition of a "working" service)

A few comments on the current dashboard:

  • a very quick look at Turnilo: the graph look different enough that I'd like to know why the discrepancies
  • as discussed, we should define the service as "working" not only when returning HTTP/200, but also when requests are throttled (429) or banned (403)
  • we probably need to dig a bit more into other response codes and the dips we see in the graph to understand what they are and if they are problematic (and thus refine our definition of a "working" service)

Just following up here: dashboard was updated to accept any of 200, 403, or 429 as successful as far as our SLI is concerned. Working on updating our SLO documentation accordingly.

With https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938, we now have the grizzly dashboard where we want it. That was the last blocker for closing out this ticket, so this should be all done.