Page MenuHomePhabricator

Define an SLO for Wikidata Query Service public endpoint and communicate it
Closed, DeclinedPublic

Description

As noted in a number of other places, a public SPARQL endpoint is fragile in nature. We have had a number of problems leading to slow downs, timeouts or loss of service. In quite a few cases, those incidents paged our SRE team, some times at odd hours, but without giving them the means to do something significant to restore the service.

It is already understood by most our users and by the people directly supporting WDQS that this public endpoint is not expected to have the same service level than most of our public endpoints. We should clarify that situation, better communicate it and adapt our alerting to reflect the expected service level objectives.

  • define the expected SLO
    • acceptable response time
    • acceptable update lag
    • acceptable duration of downtime
  • communicate this to users (on wiki? on the query.wikidata.org page?)
  • adapt paging to the level defined

Note that this concerns the public WDQS endpoint. The internal endpoint, which is more controlled, is expected to be highly available, with predictable performances.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Is there a place where I can look at the current thresholds for this and other services? Then I could probably give some more meaningful input.

You can have a look at the historical values we have for

I don't think we have SLO defined for other services. For most service, the approach is "the service should always be available". We can discuss if that is a good idea in general, but for the public WDQS endpoint, this is not appropriate, or at least not reflecting the reality.

@Lydia_Pintscher if this does not answer your question, it is probably because I did not understand it correctly. Ping me for a chat if need be.

I think response times and number of timeouts are not a good metric for this type of thing - people run a lot of heavy queries, for which it's OK to time out, and the query response time is largely dependent on the type of queries (and bots) running. We could look at p95 or p50, but we will be measuring factors which are outside of our control as much as anything. Lag is mostly under our control (or should be) except for rare cases of massive change dump, but those are rare. So I think we do need to think some about how exactly we define what we want to achieve.

I think response times and number of timeouts are not a good metric for this type of thing

To echo what @Smalyshev is saying, yes, I agree that we don't have good measures of either the reliability or the performances of WDQS. And it is somewhat related to the fact that we don't have a good definition of what those should be. We could measure the response time of a standard query and see how it evolves over time. I know that the query we use to check availability from LVS ( ASK{ ?x ?y ?z };) does timeout from time to time.

We could look at p95 or p50, but we will be measuring factors which are outside of our control as much as anything. Lag is mostly under our control (or should be) except for rare cases of massive change dump, but those are rare.

Yes and no. We could solve those issues by throwing loads of hardware at the problem. Or changing the architecture completely. Or requiring authentication on the service and having drastic usage limits. All those are probably thing we don't want to do.

What I would like, even if we don't have precise metrics, is that we agree on a statement that WDQS public endpoint is not expected to have high availability / stability guarantees.

( ASK{ ?x ?y ?z };) does timeout from time to time.

This is definitely the thing that should not be happening, but I wonder how we can build metric around it beyond "this should never happen".

We could solve those issues by throwing loads of hardware at the problem.

I guess. But before that, we should define what the issues are :) I.e. do we want to get p95 or p50 into some range? Which range? Do we even care is p95 is 0.5s or 10s or 30s? If p99 is 40s, is it good or bad? Right now I am not sure I know how to answer these.

WDQS public endpoint is not expected to have high availability / stability guarantees.

Well, this sounds a bit like giving up on availability (even if it's not the intention), so I think we want to have something. Let's think/brainstorm on what this something could be and how we could measure it.

Smalyshev triaged this task as Medium priority.Jul 12 2018, 6:03 PM

Is something happening on this or this was shelved for now?

Mentioned in SAL (#wikimedia-operations) [2018-10-09T12:54:29Z] <gehel> silencing wdqs-public lag alerts (service still functional, and SLO unclear) - T199228

Coming back to this discussion, I'll try to make my point more clear:

wdqs public endpoint is by nature a service more fragile than most of our other services. The update lag is a good example of a problem we don't seem to be able to get under control on the public endpoint. The consequence of that is that we are starting to ignore those lag alerts. And we don't see any major consequences to this lag on the public cluster, which is an indication that our current alerting threshold does not match the reality of what is needed by the clients of that service. I might be wrong here and maybe this lag is an important issue, and if it is, we need to address it with a high priority.

WDQS public endpoint is not expected to have high availability / stability guarantees.

Well, this sounds a bit like giving up on availability (even if it's not the intention), so I think we want to have something. Let's think/brainstorm on what this something could be and how we could measure it.

Yes, it is giving up on at least some level of availability. Or matching expectations with the reality of that availability. Or taking a strong product decision that this public sparql endpoint is expected to have higher availability than it has and acting on that decision.

I'm not sure how much we should formalize an SLO on this endpoint, but expecting the same level of service as from other endpoint does not match the reality and never will unless we take drastic actions. This influences how we should react to failures on this endpoint, so it should be defined in some way.

I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users won't even notice (persistent lag is different of course). If however the user's queries time out, that is different.

The problem that I see here is how to quantify what we want. We probably can reasonably promise endpoint availability, as in "can run trivial queries" (even that I would not be sure how to quantify). However, if we get to the "interesting" queries, the variety is so large then I am not sure how to express any guarantees in any certain terms. Maybe p95/p99? But that can be influenced by any random bot...

Do you have any SLOs in mind that we could look at and get an impression how that should look like?

I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users won't even notice (persistent lag is different of course). If however the user's queries time out, that is different.

If update lag is not a big issue for our users, then we should make it clear. We should increase the threshold on the Icinga alert and make our alerting match the reality of the expectations. That would already be a big step forward from my point of view. I (or @Mathew.onipe) would not have to try to depool servers to help them catch up, or feel bad for doing nothing when an alert is raised.

The problem that I see here is how to quantify what we want. We probably can reasonably promise endpoint availability, as in "can run trivial queries" (even that I would not be sure how to quantify). However, if we get to the "interesting" queries, the variety is so large then I am not sure how to express any guarantees in any certain terms. Maybe p95/p99? But that can be influenced by any random bot...

Do you have any SLOs in mind that we could look at and get an impression how that should look like?

I don't have anything that would directly apply to the WDQS use case. One of the limit is: "if you provide a crappy / expensive / wrong query, we're not going to give you good results in a good time frame". And WDQS is definitely more sensitive to this by nature than other services that we expose.

My concerns are coming from an operational stand point. Do I (or someone else) have to wake up during the night if I receive an alert from this service?

If update lag is not a big issue for our users, then we should make it clear.

More precise statement would be a temporary update lag is not the highest priority issue. I.e. getting a hour-old data for a short time is not going to matter for most users, while e.g. not getting any response it all would be a problem for everybody.

Note that long-term lag (e.g. one happening for hours) is still an issue, so if the server is not catching up quickly, it still needs to be depooled, etc.

Change 470819 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: raise alerting threshold on updater lag for public cluster

https://gerrit.wikimedia.org/r/470819

Sorry for having neglected this. About the update lag: It is true that for most users some minutes or in the worst case even hours don't make a huge difference. It does however matter quite a bit for editors and tools using the query service to operate on live data when adding new data or doing data maintenance. Adding @Magnus and @Pintoch so they can maybe add a few more details if it would have impact on their tools so we can get a better understanding of the current situation.
That being said if there are very frequent alerts that we mostly ignore then they are kinda useless.

Change 470819 merged by Gehel:
[operations/puppet@production] wdqs: raise alerting threshold on updater lag for public cluster

https://gerrit.wikimedia.org/r/470819

Thanks for the ping Lydia! On the top of my mind, the only uses of SPARQL in the tools I maintain are in the openrefine-wikidata interface:

  • queries to retrieve the list of subclasses of a given class - lag is not critical at all for this as the ontology is assumed to be stable. (These results are cached on my side for 24 hours, for any root class.)
  • queries to retrieve items by external identifiers or sitelinks - lag can be more of an issue for this but I would not consider it critical. (These results are not cached.)

What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.

What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.

I'd be most interested in how well this is going at the moment! The open for all and widely varying cost of requests on WDQS makes it really hard to guarantee the same kind of level of service than what you might expect from other services. But we don't have a good measurement of the current quality of service. So your subjective observations are very valuable!

@Gehel my service has been quite unstable for some time, but I haven't found the time yet to find out exactly where the problem is coming from - it could be SPARQL, the Wikidata API, redis or the webservice itself. I will add a few more metrics to understand what is going on and report back here.

To support what Smalyshev said: occasional termporary update lag may not be such a high-priority issue; but prolonged or repeated update lag rapidly would be.

A common workflow for editors making manual edits to fix problems on Wikidata is to use WDQS to generate a list of anomalies; then to invesigate and manually fix the first 'n' of those anomalies; then to re-run the WDQS query to get an updated list of anomalies remaining, with luck now excluding anomalies involving the 'n' items that have been edited.

This requires WDQS to be reasonably up to date most of the time. A lag of 5 minutes isn't such a problem. An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem -- if the server is having a slow moment, one can go away and work on something else for a while.

But if update lags become sustained or repeated, then this breaks the workflow and does become a problem.

However, what is even worse for this workflow is if edits get missed -- ie leading to anomalies that should be reported getting missed, or anomalies that should have been cleared failing to disappear. Accurate eventual synchronisation remains the #1 priority.

Thanks for the feedback!

This requires WDQS to be reasonably up to date most of the time. A lag of 5 minutes isn't such a problem. An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem -- if the server is having a slow moment, one can go away and work on something else for a while.

In the current situation, we are doing far worse than only occasional lag higher than 5 minutes (see graph of the last 7 days). Being able to support this kind of SLO would require a significant amount of work and rearchitecture.

I'm not saying we should not do it, just that we're not there at the moment. And that if we put a strong constraint on updater lag, we should review the way we manage this public endpoint.

However, what is even worse for this workflow is if edits get missed -- ie leading to anomalies that should be reported getting missed, or anomalies that should have been cleared failing to disappear. Accurate eventual synchronisation remains the #1 priority.

@Smalyshev is doing great work in tracking the synchronisation issues! I don't think we have a good measurement of the miss rate (having that measure is probably almost as hard as fixing the actual issues).

I note that if we manage to formally define an SLO, missed edit should be part of it.

In many cases, especially bot/background tasks (e.g. Listeria), a lag of hours is not critical. This is also true for many interactive tools, where the user gets some items matching certain criteria.

However, two situations I see as problematic:

  • "Instant gratification" - You see a problem via one of my tools, go fix it, reload to see how nicely it's fixed now - except it isn't. Frustrating, but tolerable.
  • Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have accidentally listed the same paper twice, or if two different batch jobs check/create the same paper, then the first create is sometimes "invisible" due to SPARQL lag, and a duplicate item is created. This has apparently happened a lot in the last few days.
  • Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have accidentally listed the same paper twice, or if two different batch jobs check/create the same paper, then the first create is sometimes "invisible" due to SPARQL lag, and a duplicate item is created. This has apparently happened a lot in the last few days.

Interesting. I'm probably lacking context, so apologies if I'm completely wrong here.

It sounds very much like an asynchronous process (WDQS updates) is used as part of an operation that should be transactional (creating an item if it does not exist). Even if we somewhat improve the lag, it still seems like the wrong tool for the job. Sadly, I don't have a proposable for a better tool for that job.

The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.
Retrieving items by identifiers is quite crucial in many tools so it would be useful to have a solid interface for that instead of relying on SPARQL (which feels indeed like using a sledgehammer to crack a nut).

The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.

The lag on elasticsearch is typically lower than what we have on WDQS. But that's still an asynchronous process, with lag expected to climb occasionally.

Retrieving items by identifiers is quite crucial in many tools so it would be useful to have a solid interface for that instead of relying on SPARQL (which feels indeed like using a sledgehammer to crack a nut).

Again, I'm missing context / knowledge, but accessing item by identifier sounds like an operation that should be exposed by a Wikidata API directly, not by a secondary datastore like WDQS/Blazegraph or Search/elasticsearch.

To follow up on @Jheald and @Lydia_Pintscher's comments here - for a lot of use-cases, I agree that a lag measured in a few hours isn't much of a problem, because the underlying data is fairly static. Most property values don't change minute-to-minute or even month-to-month, most ontologies and hierarchies are reasonably stable, and so on. Queries which are "tell me something interesting about the underlying data" will tend to return reasonably good results whatever the update lag, assuming that it's not something that changes frequently or has been recently worked on.

But maintenance work often presumes a much quicker response rate, and people have built workflows around that expectation - it has been, historically, pretty reliable at maintaining an update lag of a minute or so. Maybe we've just got soft and lazy because it's been so good :-)

To give a practical example, I have a regular data-cleaning workflow which uses a series of related queries to look at a class of people, with one set of queries identifying if the group is complete, another set bringing up various bits of metadata so it can be manually checked, and a third set looking for inconsistencies between them. Corrections at one stage can feed into another, and cleaning a set of data often means running the reports two or three times to check everything fits together accurately after changes have been made. Once I'm confident it's all complete and comprehensive, I mark it as validated and move on to the next one.

In this sort of situation, a few minutes lag is no problem at all, but once it's a few hours it becomes very challenging to complete a single batch in one session. If that then means having to do it over several days, it leads to extra work as well as an increased chance of mistakes through human error (eg losing track of which bits I've done).

I don't want to say this is the biggest problem there is, or anything like that, and I would completely agree that ultimately accuracy of the results and the system being reliably up are very much priorities #1 and #2, with lag behind them at #3. However, increasing lag does still have noticeable impacts on maintenance work, and doing that work becomes more difficult once lag gets sufficiently long.

There's another SLO which I think ought to be relevant: ensuring that the data in the WDQS nodes accurately reflects the data upstream of the service, or at least that the data is consistent between query nodes.

We've noticed a few instances of nodes becoming out of sync (see e.g. T203646 and T199916) and as far as I can tell these events aren't actively monitored. It would be great for the WDQS SLO to say "nodes will accurately reflect the data within X minutes/hours" (as opposed to "the update lag will be at most X minutes/hours"), and for each node monitored to spot when data has gotten lost on the way.

ensuring that the data in the WDQS nodes accurately reflects the data upstream of the service, or at least that the data is consistent between query nodes

I am not sure how you would propose ensuring that. Given the database of almost 7 billion triples, it is not feasible to compare two of them or to verify the whole database against Wikidata.

as far as I can tell these events aren't actively monitored

How would you propose to actively monitor these?

for each node monitored to spot when data has gotten lost on the way

If we had a process to know in advance which data has gotten lost, we could use the same process to recover the data. The whole problem is we don't know when the data is lost - this happens when existing process of data synchronization does not work for some reason.

Smalyshev changed the task status from Open to Stalled.Mar 20 2019, 12:10 AM

I'm trying to figure out what the volume of queries on WQS may be:

If I get https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=18&fullscreen&orgId=1
correctly, it's < 50 queries a second on each of some 12 servers?

Personally, I don't think some lag is problematic if one is aware of it and if an hour later it's not just an hour more.

@Smalyshev : Maybe do it like jsub on the Toollabs: Give an option to add the expected runtime? Based on this the load balancer in front of the different SPARQL services can assign you a node.

We are in the process of significantly changing the architecture of WDQS. We will address a better definition of what the services are supposed to provide as part of redefining those services. There is already a lot more awareness about the limitations of the current WDQS, which was one of the secondary goal of this ticket.