Create WDQS uptime SLO
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• MPhamWMF
	Jul 25 2022, 5:26 PM

Description

As a user and a maintainer of WDQS, I want an expectation of service availability so that I know when issues can/should be resolved.

The WDQS uptime SLO will be based on running a set of non-cached representative test queries periodically on the WDQS cluster, and comparing the time it takes for the queries to run against the baseline expectation; if this test time is over a TBD threshold, WDQS will be considered down, and require maintenance. This should approximate actual service availability for users. The tests will be non-cached and run against the entire cluster rather than per host.

Example -- test queries are run hourly should not take more than 200% time to run over baseline (<200ms if baseline is 100ms). Goal is to keep this uptime 95% of the time.

Sub-tasks:

Decided on SLO => trafficserver_backend metrics
WDQS SLO comms have been sent out
Implemented trafficserver metrics to see SLO performance

AC:

SLO for WDQS uptime is established
SLO for WDQS uptime is viewable on WDQS dashboard => https://grafana.wikimedia.org/d/l-3CMlN4z/wdqs-uptime-slo?orgId=1&from=now-90d&to=now <-> https://grafana.wikimedia.org/d/slo-WDQS/wdqs-slo-s?orgId=1
Any necessary changes to alerting (reducing paging etc if necessary) have been made =>
New service expectations are socialized with broader SRE team

Details

	Subject	Repo	Branch	Lines +/-
	Revert "wdqs-test: try installing nginx w extras"	operations/puppet	production	+2 -2
	wdqs-test: try installing nginx w extras	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	RKemper	T313751 Create WDQS uptime SLO
Resolved	RKemper	T323064 Create WDQS Uptime SLO dashboard in Grizzly
Resolved	RKemper	T328306 Thanos rule evaluation alerts for service_slis
Resolved	RKemper	T323066 Understand meaning of trafficserver wdqs request data vs turnilo webrequest data
Resolved	RKemper	T324811 Create WDQS Lag SLO dashboard with Grizzly && documentation
Resolved	RKemper	T325324 Evaluate options to soften wdqs paging
Open	None	T303134 Should wdqs LVS checks page

Event Timeline

• MPhamWMF created this task.Jul 25 2022, 5:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2022, 5:26 PM

Maintenance_bot added a project: Wikidata.Jul 25 2022, 5:29 PM

Gehel mentioned this in T258754: Define SLOs and error budget for WDQS.Jul 26 2022, 7:21 PM

RKemper updated the task description. (Show Details)Jul 26 2022, 7:22 PM

• MPhamWMF moved this task from Incoming to Current work on the Wikidata-Query-Service board.Aug 1 2022, 3:39 PM

• MPhamWMF added a project: Discovery-Search (Current work).

• MPhamWMF set the point value for this task to 5.Aug 1 2022, 3:58 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Aisha has written some Jupyter notebooks to pull together a random selection from groupings of query by time-to-completion and query structure (which operators are used, basically).

On the ops side of things we'll need to decide between whether we want to just run a single simple query on every host or run a whole set of queries. Right now with Aisha's data we'll have the option of choosing either way.

• MPhamWMF moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Aug 22 2022, 3:35 PM

Some quick pros/cons of two possible approaches to getting the SLI metrics: approach #1 is to run a query or set of queries per-dc at a certain frequency, approach #2 is just to run a query on each host at a certain frequency

* Approach #1: Hit `wdqs.svc.{codfw,eqiad}.wmnet`

Pros

Routing through pybal so we automatically ignore depooled hosts
Covers a broader class of failures than just simply running queries on each host
Maps a bit better to the actual user experience (ie if 10% of hosts are down)

Cons

Adds some complexity in terms of understanding how routing works (ex: do we have to worry about geoDNS [ie that we might end up unintentionally always routing to the same host] or is that [geoDNS] "higher up" in the stack and therefore not relevant?)

* Approach #2: Just run a simple query on each host in the fleet

Pros

Easy to reason about
Constantly testing each host individually, so we have host-level granularity

Cons

For generating the SLI itself, for each host need to filter out time range in which they're depooled
Not as pure of a gauge of user experience as compared to hitting wdqs.svc.{codfw,eqiad}.wmnet

Personally I lean a bit towards #1 because it intuitively seems to measure the user experience better, but I do have significant gaps in my understanding of our network / request routing stack so there's perhaps more unknowns with that approach. I'll need to do some sanity checking and circle back here with more clarity.

Intro (some context for traffic team)

Search team is working on creating an SLI to measure uptime of WDQS. We want our SLI to map as well to the actual user experience as possible, so to that end we're trying to come up with a way to hit WDQS endpoints externally or semi-externally. Ideally the solution would be ambivalent to the underlying pool/depool state of the underlying hosts (translation: if a host is depooled the request won't ultimately route to it).

The primary idea we have is to send automated requests to each datacenter, for example by querying wdqs.svc.{codfw,eqiad}.wmnet. However I/we have some knowledge gaps that makes it a bit murky to get a clear idea of what exactly that solution looks like.

Do you all have any thoughts on the best way to do this?

To that end it'd help to sanity check a few assumptions I'm working off of:

Assumptions

(A1) By hitting wdqs.svc.{codfw,eqiad}.wmnet, we're bypassing any geoDNS logic since that happens earlier up in the stack.

(A2) Requests that hit wdqs.svc.{codfw,eqiad}.wmnet round robin to an underlying pooled host in the fleet (https://config-master.wikimedia.org/pybal/eqiad/wdqs for example)

Beyond the sanity check of those assumptions, I have a few further questions:

Questions

(Q1) Is there precedent for this pattern already, ie is there perhaps an existing service that uses a similar approach?

(Q2) If there isn't already precedent for this, what are your initial thoughts on the best way to do this? For example could we just literally send a request to wdqs.svc.{codfw,eqiad}.wmnet every X minutes from an alerting host, or is there a better way?

Gehel and I met with bblack today.

Some highlights:

Best to use real user traffic if possible, rather than artificial. However this might be difficult for our use case (given that some subset of queries we consider invalid/failing)

If going artificial traffic route, makes more sense to do local queries on each host (accounting for pool/depool status ofc) rather than running queries at the dc level. This way we have a better separation of concerns.

With respect to the SLO itself, our goal is an SLO that captures the promise we make about service availability: namely, that WDQS is available on a best-effort basis. In practice, this means that if an issue arises out of "business hours", it's acceptable to wait until "business hours" to resolve it. For example, in the most extreme case, if the service were to have an outage on a Friday night, we wouldn't be paging anyone to work the night nor the weekend, but come Monday we'd be focusing our efforts on restoring availability as soon as possible. This specific scenario - a multi-day full outage - would of course be quite rare (on the order of a few times a year at most, but generally much less).

Thus our uptime % goal should reflect the above reality. I think a good starting point would be 95% uptime. This means that the service could be down for 18.25 days out of a year. With that number we could have basically one full weekend outage a quarter and be within our threshold.

Note that any % chosen is to some extent arbitrary. For example if WDQS were down during business hours and we weren't doing anything to try to fix it, but were still above 95% uptime, we'd be within our technical SLO but not actually meeting our best-effort claim. Conversely, if we were experiencing frequent weekend outages but were always getting things operational by the time the normal workweek has commenced, we could fall below our SLO's threshold while still actually meeting our own expectations for the service. But this 95% number seems like a reasonable initial target to convey our intent with this service. To be clear, in practice, at least based off current performance, I'd expect our uptime to be well over that 95% minimum bar, but the point is that the 95% threshold lets us be explicit about what kind of error budget we're allowing for.

Change 841582 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] [wip] query_service: try installing nginx w extras

https://gerrit.wikimedia.org/r/841582

gerritbot added a project: Patch-For-Review.Oct 11 2022, 7:37 PM

Change 841582 merged by Ryan Kemper:

[operations/puppet@production] wdqs-test: try installing nginx w extras

https://gerrit.wikimedia.org/r/841582

Change 841518 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "wdqs-test: try installing nginx w extras"

https://gerrit.wikimedia.org/r/841518

Change 841518 merged by Ryan Kemper:

[operations/puppet@production] Revert "wdqs-test: try installing nginx w extras"

https://gerrit.wikimedia.org/r/841518

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2022, 8:30 PM

The current approach we're trying to work towards is recording the nginx response codes for requests. That will give us insight into the number of failures we're seeing.

At a high level, these are the various response codes we expect for different scenarios:

User throttled => 429 ("Too Many Requests - Please retry in %s seconds.")

User banned => 403 ("You have been banned until %s, please respect throttling and retry-after headers.")

Successful request => 2xx

Failed request => 5xx

One common failure mode is a specific wdqs host's blazegraph instance being deadlocked. In this case, nginx will never hear back from blazegraph, and will issue some sort of 5xx code (not yet sure which exact code)

With respect to recording nginx request responses:

Getting direct logs: One idea is to add /var/log/nginx/access.log to RollingFileAppender in modules/query_service/templates/logback.xml.erb (https://github.com/wikimedia/puppet/blob/6e3c52f30166f88c5021c11ebd5f6aa411118854/modules/query_service/templates/logback.xml.erb#L29)

Example log line:
```
(REDACTED IP; EXAMPLE FORMAT xx.xx.x.xx) - - [13/Oct/2022:16:28:14 +0000] "GET /sparql?format=json&query=REDACTED_QUERY_STRING HTTP/1.1" 200 97 "-" "REDACTED_USER_AGENT"
```

Pros: We're ingesting the full log line, not just prometheus metrics. This would make it easier to correlate, say, a spike in 5xx responses, with the log lines corresponding to the actual requests

Cons: We don't directly get time-series metrics for this. We'd probably want to separately ingest corresponding time series metrics so we can actually see this in Grafana. Kibana has the ability to visualize by parsing log lines, but this is computationally expensive, so we probably want to directly export nginx request metrics to Prometheus.

Getting metrics: I'm a bit hazy on the best way to do this. There's hopefully a pretty straightforward way. I'll see if o11y has any thoughts on this.

A few comments on the current dashboard:

a very quick look at Turnilo: the graph look different enough that I'd like to know why the discrepancies
as discussed, we should define the service as "working" not only when returning HTTP/200, but also when requests are throttled (429) or banned (403)
we probably need to dig a bit more into other response codes and the dips we see in the graph to understand what they are and if they are problematic (and thus refine our definition of a "working" service)

RLazarus subscribed.Nov 18 2022, 10:43 PM

In T313751#8388946, @Gehel wrote:

A few comments on the current dashboard:

a very quick look at Turnilo: the graph look different enough that I'd like to know why the discrepancies

as discussed, we should define the service as "working" not only when returning HTTP/200, but also when requests are throttled (429) or banned (403)

we probably need to dig a bit more into other response codes and the dips we see in the graph to understand what they are and if they are problematic (and thus refine our definition of a "working" service)

Just following up here: dashboard was updated to accept any of 200, 403, or 429 as successful as far as our SLI is concerned. Working on updating our SLO documentation accordingly.

Gehel closed subtask T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data as Resolved.Dec 9 2022, 3:53 PM

Gehel mentioned this in T252508: Improve visibility of WDQS inaccessability.Dec 16 2022, 8:40 AM

RKemper updated the task description. (Show Details)Feb 28 2023, 8:07 PM

RKemper added a subtask: T325324: Evaluate options to soften wdqs paging.

• MPhamWMF moved this task from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.Mar 6 2023, 4:27 PM

Gehel closed subtask T325324: Evaluate options to soften wdqs paging as Resolved.Mar 10 2023, 2:09 PM

Gehel closed subtask T324811: Create WDQS Lag SLO dashboard with Grizzly && documentation as Resolved.Mar 10 2023, 2:12 PM

Gehel closed subtask T323064: Create WDQS Uptime SLO dashboard in Grizzly as Resolved.Mar 10 2023, 2:14 PM

Gehel merged a task: T305951: Create SLI for Blazegraph uptime.Mar 16 2023, 2:10 PM

Gehel added a subscriber: bking.

Gehel moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.May 1 2023, 3:23 PM

Gehel added a project: Data-Platform-SRE.May 2 2023, 8:28 AM

Gehel moved this task from Incoming to In Progress on the Data-Platform-SRE board.

Gehel reopened subtask T324811: Create WDQS Lag SLO dashboard with Grizzly && documentation as Open.May 8 2023, 6:49 PM

With https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938, we now have the grizzly dashboard where we want it. That was the last blocker for closing out this ticket, so this should be all done.

Gehel closed this task as Resolved.May 12 2023, 8:55 AM

Gehel moved this task from In Progress to Done on the Data-Platform-SRE board.Jul 19 2023, 8:53 AM

Gehel closed subtask T324811: Create WDQS Lag SLO dashboard with Grizzly && documentation as Resolved.Aug 4 2023, 9:41 AM

Create WDQS uptime SLOClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

* Approach #1: Hit wdqs.svc.{codfw,eqiad}.wmnet

Pros

Cons

* Approach #2: Just run a simple query on each host in the fleet

Pros

Cons

Intro (some context for traffic team)

Assumptions

Questions

Create WDQS uptime SLO
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...

* Approach #1: Hit `wdqs.svc.{codfw,eqiad}.wmnet`