Page MenuHomePhabricator

Automatically depool wdqs servers that are "lagged"
Open, MediumPublic3 Estimated Story Points

Description

As a wdqs user I would like servers that are lagged to be depooled so that I don't experience stale results.

We (wdqs maintainers) often have to depool wdqs servers manually because they are heavily lagged, this has several drawbacks:

  • it relies on a manual intervention
  • the operator that depooled the server in the first place must remember to repool the server once the lag is back to acceptable values

AC:

  • a server should be automatically depooled if the lag reached a certain threshold (re-use the same threshold used by icinga?)
  • a server should be automatically repooled when its lag is back to normal values
  • do not automatically depool more than what we currently can serve

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CBogen triaged this task as High priority.Jan 4 2021, 4:28 PM
CBogen moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.
Gehel lowered the priority of this task from High to Medium.Jun 10 2021, 2:55 PM
Gehel raised the priority of this task from Medium to High.Aug 26 2021, 1:08 PM

Is this still a high priority now that we've deployed the Streaming Updater?

Yeah we had a few cases over the last weeks where one wdqs server was significantly lagging behind and that then leading to very high maxlag and all bot activity on Wikidata grinding to a hault until the server was depooled/restarted.

Gehel subscribed.

Moving this ticket to the current work board following the incident review of T336134

Per today's triage meeting, we have a liveliness probe in our nginx config, we need to replace it with something smarter.

Gehel lowered the priority of this task from High to Medium.Dec 6 2023, 1:28 PM
Gehel moved this task from Misc to Toil / Automation on the Data-Platform-SRE board.