Page MenuHomePhabricator

Update/organize train deployment and related policy documentation
Open, MediumPublic

Description

In general, deployment documentation right now is a mess.

Several large pages are redundant with one another and slightly out of sync, navigation is difficult, and important details of policy are hard to find. There's also not really a single clear entry point for new deployers.

We should consolidate a number of pages under a more coherent structure, make sure everything actually reflects current practice, and improve the navigation aids. This applies to the procedural train docs as well as to descriptions of how deployments are structured overall and how backports are to be conducted.

Structural improvements and onboarding

We want to get more people confident deploying backports, as well as aware of the ways they are affected by the train process. To that end:

  • There should probably be an overall /Deployments portal, replacing the current calendar location
  • All the deployment docs should actually live under /Deployments
  • Calendar should probably move to /Deployments/Calendar
    • Projects that reference /Deployments will need updating:
      • Jouncebot parses Deployments
      • Do a codesearch for other stuff, ask around
  • There should be a /Deployments/Training entrypoint for new folks
  • We should establish a clear training process.
    • Open to anyone who:
      • Is in NDA / WMF / WMDE LDAP groups.
      • Has shell access.
      • Has received log triage training. (Details here could be worked out, but knowing how to deal with logs needs to be part of knowing how to deploy.)
    • Put this on the staff calendar, and offer invites: "Message me your email associated with your LDAP and I'll add you to the invite."
      • Trainer will check that people meet requirements.

Policy change tweaks

  • Holding the train
    • Mention client errors and 1k limit in a 12 hour period before it's an UBN
    • Client errors < 100 / hour
    • Specific error budget - 2 or more times in a version?
    • Define "new" in regards to errors
  • Heterogeneous deployment/Train_deploys
    • Mention client error dashboard
    • Client errors < 100 / hour
    • Define "new" in regards to errors

Event Timeline

cc: @thcipriani, @dancy if there are specifics I'm forgetting here.

LGTM! I'll mention what I voiced in our meeting: "new" is the term I struggle with.

brennen updated the task description. (Show Details)
brennen updated the task description. (Show Details)
brennen moved this task from Backlog to Doing on the User-brennen board.
brennen renamed this task from Update train policy documentation to Update/organize train deployment and related policy documentation.Feb 5 2021, 9:31 PM
brennen updated the task description. (Show Details)

Unlicking this cookie for the moment, as my good intentions got mugged by reality.