Page MenuHomePhabricator

Train Blocker UBN - T249045 - Retrospective
Closed, ResolvedPublic

Description

T249045 caused a train blocker and UBN.

It was volunteer created code, reviewed and merged by CPT, had tests, was caught actioned and reverted.

Objective(s):

  • Review/discuss what led up to the escape of this issue.
  • Review/discuss the response.
  • Identify what was learned from this issue and the response and identify prospective improvements (if any).

Outcomes

  • Develop set of actions items
  • Assign owners for actions
  • Setup accountability check-in to ensure actions are taken.

Event Timeline

Verbatim Meeting Notes:

2020/04/07
Issue Task: T249045
Retrospective Task: https://phabricator.wikimedia.org/T249645
Attendees: Will, Daniel, Petr, Greg, Tyler, James, Mandy
Facilitator: JR
Scribe: Will

Agenda

Approach:

  • What went well
    • Was spotted immediately on test train roll-out
    • Got rolled back
    • Cause was identified swiftly.
    • Fix was written and deployed quickly
    • This change was actually a change towards addressing fragile code.
  • What needs improvement
    • This wasn’t caught in CI
      • Is CentralAuth tests enabled?
      • Why don’t we enable all deployed extensions in gate-and-submit?
        • Catastrophic speed (and conflicts between extensions' test suites)
          • Can we have a background job just crunching those tests continuously on current master? We can automate further and see what list of changes were introduced
        • See ⚓ T204252 Have dependencies of gated extensions in the gate for expanding gate.
        • See also https://phabricator.wikimedia.org/T238492 for fixing the conflicts.
      • This wasn't noticed in Beta Cluster.
        • ? No-one looked at the log?
          • Maybe add a step to the train process?
          • Bot to sync mediawiki-new-errors dash from prod to beta cluster logstash?
        • ? Maybe it didn't trigger in the log?
        • See https://phabricator.wikimedia.org/T248683 for running API integration tests against beta

Next steps

Objective(s):

  • Review/discuss what led up to the escape of this issue.
  • Review/discuss the response.
  • Identify what was learned from this issue and the response and identify prospective improvements (if any).

Outcomes

  • Develop set of actions items
  • Assign owners for actions
  • Setup accountability check-in to ensure actions are taken.

Notes

  • Context
    • We used to do these but we stopped because the action items became unmanageable, so follow up lagged or was unsustained.
    • This kind of activity is really helpful in building confidence particularly in the current climate where we are very focused on stability
  • Train
  • Demonstrates that graduated deployments of the train is reasonable approach
  • This is the first time we have caught things at test level
  • Significant benefit of having layered rollout giving us space to catch issues
  • Surprised we didn’t notice this on beta cluster, given that this is what the staging environment is for.
    • Why did we not see this in beta cluster, env not capable? Not tested?
      • JF: Don’t look at fatal log for beta cluster, tedious and beta cluster is awkward
      • DK: Did it show?
      • JF: It should have but we don’t know
      • DK: This might have only happened where user did not use english
      • JF: It is true beta load is radically different from real load but this issue has occurred before but have ways to try to cause them to happen such as having en-gb
      • JR: How has this shown up before?
      • JF: Lang is both very visible and would have potential to error
      • PP: Action: If we don’t triage beta cluster logstash we should consider reviewing it and improving it
        • GG Thinking of caveats to reviewing beta clusters logstash. Sometimes by design beta clusters are different from prod. Diff s/w versions, so there tends to be noise which may or may not be easy to differentiate - so it may be non-trivial to map across.
      • DK: Selenium tests should have failed on beta, do they login or all anon?
      • JF: There should be some that are logged in
      • JR: Not sure
      • DK: If it only uses anon then that might explain. Are the tests noisy so ignored or what are they like?
      • GG: There are some that are logged in, but they aren’t tied to pre-commit or blocking merging and can end up being passed over.
      • TC: Selenium dashboard
      • JR: There is a desire to increase testing but also keep cycle time reasonable. If there are different tests we can run we should look at that. But equally, we should look at the root case of the issues. Are the areas of the code fragile? The solution here is less tests and more refactoring.
      • JF: This change was going in the right direction, but by its nature it touched multiple interrelated pieces. The area itself was risky even if the change itself was not necessarily such.
      • DK: Action To reduce fragility we should focus on the weakest areas of our code.
      • DK: This initialise phase is hard to reason about because we use a lot of lazy initialisation. This is intended to make startup faster. There is a trade off here meaning we will not be able to get away from this entirely. This is a broader area of concern - lazy initialisation.
      • TC: Were you worried about this patch going live? Is there a process that covers the circumstance where you are worried about a patch?
      • DK: Usually I would speak to developers on my time. I was not worried about this patch because it was mainly moving code between classes and adding simple DI. This issue was not in the service object but in the wiring code. For me, this seemed completely safe. The chain of events that lead to this are complicated. Brad might have spotted it as an author.
      • JR: What I’m hearing is that this code is known to be problematic. Even if the patch itself was not worrisome, it was associated with code that was known to be problematic. To Tyler’s point is there a process to highlight that issue. Is there a way to raise the awareness for this generally?
      • DK: If you know of something specific might go wrong there is a process of writing to wikitech-l or writing informally to other teams that might have been impacted. Do we have any code that doesn’t touch sketchy code?
      • PP: This was one of 3 UBNs that week. We would have to flag almost everything because almost everything can cause something like this.
      • DK: Everything was equally likely to cause this and every train is equally at risk.
      • TC: We have between 3-800 patches. It is hard to say whether something is especially risky. There are some patchest that people do worry about. Having a developer risk level for a patch would help in setting a risk level for a train. Institutional knowledge and informal process are how we determine risk level. Having a standardised process would be something RelEng would really like to work toward. What is the level of difficulty in getting to this?
      • GG: Action "Scaptrap", tagging system for devs to self identify particularly problematic patches to highlight for others. We could provide a means to toggle a bit to say that they are concerned.
      • JF: Generally for complex changes people will highlight that there may be issues for pieces such as schema changes. The issues we’re seeing is, looks fine for dev, tests fine for them, but fails in an edge case or a case they didn’t consider, multi-dc etc. It would be useful as a train deployer to have flagged patches. I would likely have merged this and not flagged it. We should not focus to hard on this element
      • DK: We would previously add ALL CAPS commit message calling out things to check before/after for changes - smoke tests?
      • DK: We could measure number of lines changing and highlight large changes as a rough guide to areas that could be problematic. It would be very rough and obviously 1 liners can fail too but it might be a start.
      • JR: Yeah, having some kind of risk profile for code changes in an automatic fashion would be desirable. Ask devs to know the total scope is unrealistic. Providing an input point for them is good and useful but should form part of a fuller, overarching data analysis. We have a lot of data available to us, it’s about getting toward how we use it.
      • DK: We don’t run integration tests on beta
      • JF: Action Currently the API integration tests break on extensions, so will need fixing. Credential storage in Jenkins isn't 'secure' so can't easily do for production, only Beta.
      • DK: We need a test suite that runs for vanilla MW
      • JR: Categories


Actions

  • Will and JR to discuss and flesh out next steps and action items
    • Will send out to review in next few days.

DK: We need a test suite that runs for vanilla MW

We have a test suite that runs for vanilla MW. We run it in CI. But that doesn't cover extensions.
We need one that is tailed for, and runs against, the Wikimedia environment, see T248683.

DK: Action To reduce fragility we should focus on the weakest areas of our code.

...but these are by nature the most risky changes.

DK: We would previously add ALL CAPS commit message calling out things to check before/after for changes - smoke tests?

Like:

DEPLOY: watch for write load on foobar table

Or:

DEPLOY: post about this on the village pump

Are there actionable things left in this task? If not, should it be closed?

@Jrbranaa: Are there actionable things left in this task? If not, should it be closed?

I don't believe there are any additional actions to be taken.