Page MenuHomePhabricator

Provide the data on the availability of certain services of Wikibase Cloud
Closed, ResolvedPublic8 Estimated Story Points

Description

As a Wikibase Cloud EM I want to know what it the daily availability (percentage) of defined services of Wikibase Cloud, so that I can learn how good enough level of service is provided to Wikibase Cloud users.

A/C

  • Item page can be accessed should be checked regularly
  • Item data can be accessed using API (wbgetentities) should be checked regularly
  • After this is deployed production/staging.
    • Document the how the wikis are setup.
    • Document the how to view the graphs

On Wikibase.dev kubernetes uptime checks are running every 60 seconds. On Wikibase.cloud, this would be an adequate frequency as well.

Note:
"Good enough level of service" is an internal non-formal target set by WMDE Engineering team, checking against which is not in the scope of this task.

Notes from storytime:

  • Uptime checks are not currently implemented on WB Cloud, just on wikibase.dev
  • Could potentially hook up to GC metrics platform (storage and presentation of data). This needs investigation as part of this story.
  • Which wiki are we checking? E.g. is there a dedicated test wiki that we create, or randomly select a user's wiki? T: on wikibase.dev the pattern is we are pointing to a dedicated wiki set up by dev team (coffeebase). If eventually we add additional things to monitor (e.g. editing) it makes sense that we are using a dedicated wiki. LM: all for simple solution
  • Where is the data presented? Dashboard, email, report, etc. Wikibase.dev is not hooked up to anything to store/present the data LM: most important thing is we have the data. Presentation/alerts/etc could be handled later if sufficiently complex.
  • Apply as well to Wikibase.dev? T: keeping both as similar as possible would be beneficial LM: its not a requirement but could be decided as the right way forward in the implementation. T: uptime checks being done via Terraform so could potentially be benefit in aligning there

Event Timeline

toan updated the task description. (Show Details)
toan subscribed.

Putting this up for initial review. There is no production deployment but rather asking if it looks sane and permission to try it out on staging :)

https://github.com/wmde/wbaas-deploy/pull/229

TODO: This should really get documented how to set these wikis up and how we maintain them.

So after some testing, this seems to work fine for all regions. The new checks also worked fine and we added one improvement to the PR.

After some chats with @WMDE-leszek we talked a bit about the presentation of this if the ui provided by google was sufficient and it seemed pretty good. So no additional hooking up to google metrics or anything like that should be required. However, after this has been deployed "everywhere" we should have a look at documenting the process for maintaining the wikis + how one can find the actual metrics and look at graphs and things.

To review:

deployment to staging: https://github.com/wmde/wbaas-deploy/pull/229

production deployment: https://github.com/wmde/wbaas-deploy/pull/232

toan updated the task description. (Show Details)
toan updated the task description. (Show Details)

deployed to staging and production

Tarrow claimed this task.