The troubleshooting process around T370304 was needlessly painful.
Something as "simple" as being able to tell which k8s pods were talking with the s4 master required some really gross hacks. This made it very hard to even determine which Mediawiki deployment (mw-jobrunner vs mw-api-ext, basically) was the trigger of the outages each time.
It also took several outage cycles before we had a set of debugging tooling that any SRE could invoke manually or that worked automatically. (For example, see edit history on P67012)
This task is to discuss improvements we can make in the near-term (like, between ~hours and ~weeks?).
I have a lot of ideas and suggestions, but I'll file them as comments or subtasks rather than in this bug description.