Page MenuHomePhabricator

Document safe method to deploy ORES in prod
Closed, ResolvedPublic


To avoid down times like this

Event Timeline

Just looking over the documentation on that page, and I wanted to make some comments.

Modify /srv/deployment/ores/deploy/scap/ores and /srv/deployment/ores/deploy/scap/ores-worker and remove all nodes except scb2001.codfw.wmnet.

Scap3 could handle this without any manual steps.

The way to make this happen in Scap3 in production would be to add a new server_group that points at a dsh target file that contains only scb2001.codfw.wmnet. So scap.cfg would be modified to add these lines:

server_groups: canary, worker, default
dsh_targets: ores
worker_dsh_targets: ores-worker
canary_dsh_targets: ores-canary

With the configuration above, you would need to add a scap/ores-canary file that contains scb2001.codfw.wmnet.

This change would deploy to scb2001, then the machines listed in scap/ores-worker, and then the machines in scap/ores without having to preform any manuals steps.

Run a deploy by commanding "scap deploy -v". Once it's done log into scb2001.codfw.wmnet and check the service internally by commanding "curl".

By splitting scb2001 into its own canary group, you could define a check in scap/checks.yaml that will only run on that machine as part of the canary deployment:

    type: command
    stage: restart_service
    group: canary
    command: curl

Hey, Thank you the detailed explaination which thought me a lot. I set up a canary node ASAP. I already made T139825: Set up "canary" node for ORES in prod.
WRT checks: I think we should have some standard checks regardless but people still need to test changes they make manually and it varies. For example, in one case we might introducing new languages. The person needs to log into the canary node and check if the new language behaves as expected. (Like mw1017 in mediawiki deployments). Thanks again :)