Page MenuHomePhabricator

Adapt scap's testing strategy to mw-on-k8s
Closed, ResolvedPublic

Description

scap does swagger checks on bare metal canaries. To replicate that for mw-on-k8s, we need to find a way to route requests to the canary releases directly while also keeping them as part of the normal traffic path.
Currently, canaries use routed_via: main

However, the bigger number of replicas used for canaries in mw-on-k8s compared to bare-metal means swagger checks would end up testing only one or two of the many canary pods. As such, we've decided to use httpbb to do more in-depth testing of mwdebug before proceeding to canary deployment, and rely on logstash error-rate detection for canary testing.

Details

TitleReferenceAuthorSource BranchDest Branch
exp/files/php/scap.cfg: Set testservers_check_cmd_* for deploy containerrepos/releng/train-dev!48dancymain-I3c22e6a1f9cd1195ac0c612b5d7573c010202779main
Dockerfile.deploy: Add httpbbrepos/releng/train-dev!47dancymain-I67368cd5d0f00406bf4c35f62c1b1f423eb0493bmain
scap sync-world: Add support for testserver checksrepos/releng/scap!230dancymaster-I617844a5319d104e8d9a8303dcf29f8ea34c05b7master
Customize query in GitLab

Event Timeline

Clement_Goubert created this task.
Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

We've talked this over, and while doing swagger checks made sense when there were just a few canaries on bare-metal servers, these checks would now hit a random pod from ~3% of each deployment, making them of little potential value compared to the actual traffic being sent immediately to the canaries on deployment, the validity of which would be checked through the logstash error rate.

We've talked this over, and while doing swagger checks made sense when there were just a few canaries on bare-metal servers, these checks would now hit a random pod from ~3% of each deployment, making them of little potential value compared to the actual traffic being sent immediately to the canaries on deployment, the validity of which would be checked through the logstash error rate.

I agree the checks don't provide much value in their current format—we know very quickly if something is wrong via the logstash checks on those canaries.

Also the swagger checks are only a subset of what httpbb (as I understand it) checks.

But a swagger/httpbb check would be useful if it first ran on a server without any production traffic. Then a deploy that immediately breaks all wikis would never serve any production traffic before we know that something is wrong.

Running httpbb against an mwdebug server before rolling to canaries would provide some additional safety to deploys that the swagger checks in their current format do not provide.

...
Running httpbb against an mwdebug server before rolling to canaries would provide some additional safety to deploys that the swagger checks in their current format do not provide.

Agreed, scap could run the httpbb appserver test suite on mwdebug.discovery.wmnet:4444 before proceeding instead of the swagger check. This would add about 15s.

so httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 from the deploy server after deploying to mwdebug should work?

@dancy what do you think of ^ ?

so httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 from the deploy server after deploying to mwdebug should work?

@dancy what do you think of ^ ?

Looks straightforward. I tested that command on the deploy server.

Yep, that's the check we also do hourly on all non-jobrunner releases.

Clement_Goubert renamed this task from Find a way to address canary releases directly to Adapt scap's testing strategy to mw-on-k8s.Feb 26 2024, 9:27 AM
Clement_Goubert changed the task status from Open to In Progress.
Clement_Goubert updated the task description. (Show Details)

Change 1007956 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Set testservers_check_cmd to httppb in production

https://gerrit.wikimedia.org/r/1007956

@Clement_Goubert We have some questions:

  1. Does mwdebug.discovery.wmnet resolve to a random bare-metal/k8s target?
  2. Do you anticipate a case where you'd want to check only k8s testservers (e.g., when using scap sync-world --k8s-only), or only bare metal testservers? If so, what's the best way to achieve this?

@Clement_Goubert We have some questions:

  1. Does mwdebug.discovery.wmnet resolve to a random bare-metal/k8s target?

It's an active/active discovery record, so it resolves to the closest mw-debug deployment, k8s only.
Edit: Actually, it is active/active, but only pooled in the primary datacentre.

  1. Do you anticipate a case where you'd want to check only k8s testservers (e.g., when using scap sync-world --k8s-only), or only bare metal testservers? If so, what's the best way to achieve this?

I don't really anticipate a case where I would want to check bare-metal specifically, but someone may. Testing only mwdebug.discovery.wmnet when doing --k8s-only would be a good idea, and would test only mw-on-k8s.

dancy opened https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/48

exp/files/php/scap.cfg: Set testservers_check_cmd_* for deploy container

dancy merged https://gitlab.wikimedia.org/repos/releng/train-dev/-/merge_requests/48

exp/files/php/scap.cfg: Set testservers_check_cmd_* for deploy container

Change 1007956 merged by Clément Goubert:

[operations/puppet@production] scap.cfg.erb: Set testservers_check_cmd_* in production

https://gerrit.wikimedia.org/r/1007956

Mentioned in SAL (#wikimedia-operations) [2024-03-07T17:14:41Z] <dancy@deploy2002> Started scap: testing T358117

Mentioned in SAL (#wikimedia-operations) [2024-03-07T17:25:56Z] <dancy@deploy2002> Finished scap: testing T358117 (duration: 11m 15s)

Changes deployed and tested. Resolving this task.