Page MenuHomePhabricator

Reduce latency of new Scap releases
Closed, ResolvedPublic

Description

Current Scap release process

  • Release-Engineering-Team:
    • Update version info in scap/version.py and debian/changelog
    • Add change notes to debian/changelog
    • Merge a commit with those changes
    • Tag that commit
    • File a Phabricator task requesting deployment of the new release, tagging SRE

At this point we usually have to wait several days for the new release to be fully deployed. If there is a problem with the release, the cycle restarts. (Note: This is describing non-emergency releases. When something is badly broken, we can usually get an SRE to rescue us with same-day service.)

The ticket is a request to discuss and select ways to drastically decrease the lead time to deployment of new Scap code.

Last couple years tasks for reference:

Event Timeline

@Legoktm is working on a cookbook to speed up packaging of scap https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/727605. The rollout process has to stay as it is though (upgrade on canaries first, and roll out to all hosts after 1-2 days)

The rollout process has to stay as it is though (upgrade on canaries first, and roll out to all hosts after 1-2 days)

For the initial version I'm just tackling the packaging part, but after that we could add the debdeploy to canaries and even some test syncs from the deployment server.

I'm curious what the goal of the 1-2 day delay is. In theory if we can release scap faster, then we can also fix potential regressions faster so catching them at the canary stage is less important. Obviously that depends on what kind of regressions we're expecting, which is what I'm curious about - which code paths or uses of scap need 1-2 days to test/find issues with, and is there a way we could test those things during the scap release/deploy process?

@Legoktm we may debdeploy scap everywhere, and then for whatever reason we need to push change Y fast due to issue X. If scap fails everywhere because of a bug we missed, we have a problem where we first need to downgrade scap, and then rerun it. In my opinion, we should keep having scap sit on the canaries for 1 day, and save us from a potential scenario like this. To my knowledge, scap's test coverage is rather low (I admit I have not read scap code for quite some time). If this is still the case, gives us one more reason to want to be a little bit more careful with its rollout.

colewhite triaged this task as Medium priority.Nov 8 2021, 10:35 PM
dancy claimed this task.

Resolved via T303559