To avoid down times like this
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ladsgroup | T138251 [Epic] Complete outstanding tasks for ORES extension deployments | |||
Resolved | Ladsgroup | T138234 Document safe method to deploy ORES in prod |
Event Timeline
Just looking over the documentation on that page, and I wanted to make some comments.
Modify /srv/deployment/ores/deploy/scap/ores and /srv/deployment/ores/deploy/scap/ores-worker and remove all nodes except scb2001.codfw.wmnet.
Scap3 could handle this without any manual steps.
The way to make this happen in Scap3 in production would be to add a new server_group that points at a dsh target file that contains only scb2001.codfw.wmnet. So scap.cfg would be modified to add these lines:
server_groups: canary, worker, default dsh_targets: ores worker_dsh_targets: ores-worker canary_dsh_targets: ores-canary
With the configuration above, you would need to add a scap/ores-canary file that contains scb2001.codfw.wmnet.
This change would deploy to scb2001, then the machines listed in scap/ores-worker, and then the machines in scap/ores without having to preform any manuals steps.
Run a deploy by commanding "scap deploy -v". Once it's done log into scb2001.codfw.wmnet and check the service internally by commanding "curl 0.0.0.0:8081/v2/scores/testwiki/67687".
By splitting scb2001 into its own canary group, you could define a check in scap/checks.yaml that will only run on that machine as part of the canary deployment:
checks: canary_checks: type: command stage: restart_service group: canary command: curl 0.0.0.0:8081/v2/scores/testwiki/67687
Hey, Thank you the detailed explaination which thought me a lot. I set up a canary node ASAP. I already made T139825: Set up "canary" node for ORES in prod.
WRT checks: I think we should have some standard checks regardless but people still need to test changes they make manually and it varies. For example, in one case we might introducing new languages. The person needs to log into the canary node and check if the new language behaves as expected. (Like mw1017 in mediawiki deployments). Thanks again :)