As noted in a number of other places, a public SPARQL endpoint is fragile in nature. We have had a number of problems leading to slow downs, timeouts or loss of service. In quite a few cases, those incidents paged our SRE team, some times at odd hours, but without giving them the means to do something significant to restore the service.
It is already understood by most our users and by the people directly supporting WDQS that this public endpoint is not expected to have the same service level than most of our public endpoints. We should clarify that situation, better communicate it and adapt our alerting to reflect the expected service level objectives.
- define the expected SLO
- acceptable response time
- acceptable update lag
- acceptable duration of downtime
- communicate this to users (on wiki? on the query.wikidata.org page?)
- adapt paging to the level defined
Note that this concerns the public WDQS endpoint. The internal endpoint, which is more controlled, is expected to be highly available, with predictable performances.