The automated scap process currently does roughly the following, from a validation perspective:
- Deployment server
- Linting. This catches syntax errors for all code.
- Pre-promote check. For the subset of code involved with setting up multiversion, wmf-config and MW core, it also catches run-time errors if they affect all wikis and/or enwiki (per T121597, implemented as echo 1 | mwscript eval.php --wiki enwiki).
- Canary servers (percentage of live traffic).
- Logstash check. After 30 seconds of live traffic, consult Logstash and assert that overall error/exception levels are not significantly higher than before.
- Swagger checks. A small set of synthetic HTTP requests with expected response codes and bodies (spec.yaml).
- If step 2 fails we abort here, but leave the percentage of prod broken as Scap is currently unable to auto-rollback (Is there a task for this?)
- Sync to the remaining prod servers.
The Swagger checks can catch a good amount of problems and don't depend on live traffic. But, we only run them after we have reached the point of no return and thus real traffic is exposed.
Run the Swagger checks earlier.
Either on the deployment server (requires it to become a full appserver, right now it is a maintenance host only which means mwscript works but localhost:80 is not MW. I recall something about Apache conflicting with the Apache of Scap/Git sync proxies, but don't recall).
Or, on a depooled canary server.
Or, on the mwdebug we use for staging. That might play well with T239373.