Page MenuHomePhabricator

If possible, create a health check endpoint for our services
Open, MediumPublicSpike

Description

It would be helpful to have a health check endpoint for our services. Can we update ours to reflect our system 'health' accurately? We have a health check endpoint live somewhere.

  • Find the existing endpoint and where it is retrieving its data
  • Find out if adjusting (or creating a new one if necessary) is possible and how
  • If yes, do (create follow-up task)

Note: We could get started with a super simple user-facing page with the same script we use to check our services' responsiveness during our deployment process.

Another note: to change the existing /_info endpoint to our desire or creating a new one will require a bit of time, communication and input from SRE.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptOct 31 2025, 7:29 AM

It may be possible for us to set up a user-facing page that behaves as a type of health check; we could use the scripts we use during our deploy process that pings our services that ensures they are alive and responsive. Starting off, the page would be sparse, albeit usable. What do you think we wire this up and then beautify it next year (next fiscal quarter(s))?

Product Q for Laura (can't find her account yet) and cc'ing @DSantamaria

@ecarg, is the current _info endpoint available to our users? If so, how?

@ecarg, is the current _info endpoint available to our users? If so, how?

No, the _info endpoint is the health-check for k8s (and doesn't check much health!).

@Jdforrester-WMF @ecarg, Then why do we need a user-facing page for doing a health check in this case?

@Jdforrester-WMF @ecarg, Then why do we need a user-facing page for doing a health check in this case?

It's not clear to me what the purpose of this task is. Are we trying to give people visibility? (Which people?) Are we looking for alerting / monitoring? Are we trying to get the system to heal itself?

This task was an action item from one of our team meetings following the Py Evaluator outage; IIRC @DSantamaria's proposition to have a user-facing system availability endpoint set up.

@ecarg @Jdforrester-WMF If I did propose that, I am unsure about the goal of having a user-facing endpoint. My main goal is to have observability around this. If this can be done with an internal endpoint similar to what we have in the _info one, that would definitely work for me, the idea is to have observability and alerting triggered when one of our services stop working and not being alerted by the community as the last outages in the team.