Page MenuHomePhabricator

Run Swagger checks in Scap before exposing to prod MW traffic
Closed, ResolvedPublic

Description

Background

The automated scap process currently does roughly the following, from a validation perspective:

  1. Deployment server
    • Linting. This catches syntax errors for all code.
    • Pre-promote check. For the subset of code involved with setting up multiversion, wmf-config and MW core, it also catches run-time errors if they affect all wikis and/or enwiki (per T121597, implemented as echo 1 | mwscript eval.php --wiki enwiki).
  2. Canary servers (percentage of live traffic).
    • Logstash check. After 30 seconds of live traffic, consult Logstash and assert that overall error/exception levels are not significantly higher than before.
    • Swagger checks. A small set of synthetic HTTP requests with expected response codes and bodies (spec.yaml).
  3. If step 2 fails we abort here, but leave the percentage of prod broken as Scap is currently unable to auto-rollback (T317405)
  4. Sync to the remaining prod servers.
Problem statement

The Swagger checks can catch a good amount of problems and don't depend on live traffic. But, we only run them after we have reached the point of no return and thus real traffic is exposed.

Proposal

Run the Swagger checks earlier.

Either on the deployment server (requires it to become a full appserver, right now it is a maintenance host only which means mwscript works but localhost:80 is not MW. I recall something about Apache conflicting with the Apache of Scap/Git sync proxies, but don't recall).

Or, on a depooled canary server.

Or, on the mwdebug we use for staging. That might play well with T239373.

Event Timeline

thcipriani triaged this task as Medium priority.Dec 18 2019, 1:59 PM
thcipriani moved this task from Needs triage to Services improvements on the Scap board.
dancy claimed this task.
dancy subscribed.

@Krinkle I think the work on T358117 satisfies the goals stated here so I'm resolving this ticket.

Thanks @dancy. Does the below match your understanding?

  • Swagger checks have been removed in favour of httpbb checks.
  • The httpbb checks include (among many more awesome checks) also requests similar to or "better" than the ones we have in the Swagger spec.
  • The new httpbb step has been added to Scap at a point in time before any real traffic is exposed to the deployed code, thus fufilling (in spirit) the idea behind this task.
  • The Swagger spec is now unused, and can be removed from the repository.

Thanks @dancy. Does the below match your understanding?

  • Swagger checks have been removed in favour of httpbb checks.

Confirmed.

  • The httpbb checks include (among many more awesome checks) also requests similar to or "better" than the ones we have in the Swagger spec.

httpbb performs 128 different checks. You can view them in /srv/deployment/httpbb-tests/appserver on the deploy server.

  • The new httpbb step has been added to Scap at a point in time before any real traffic is exposed to the deployed code, thus fufilling (in spirit) the idea behind this task.

Confirmed. The httpbb checks are performs against bare metal and k8s testservers. This happens before canaries are synced.

If the checks fail for some reason, the operator will be offered the choice to retry the checks, continue deployment anyway, or exit scap.

  • The Swagger spec is now unused, and can be removed from the repository.

It's unused by scap now. I can't speak about other potential users of it.

  • The Swagger spec is now unused, and can be removed from the repository.

It's unused by scap now. I can't speak about other potential users of it.

It was made for scap, I'm unaware of anyone else ever having used it. My strong intuition is it's safe to remove now.

But I'd bet we have a way to check requests for that endpoint over the last 90 days or so and confirm the only regular requests are form our deploy host.

Oddly, I don't see any results for the path /spec.yaml on the logstash apache2 accesslog dashboard—@Krinkle do you know of any other magic to confirm the only regular requests for that are made from the deploy host so we can remove it?

I'd use kafkacat to consume webrequest_text from a stats host.

The following would query the last 1 billion requests and then exit. At ~8M/minute this searches roughly the last 2 hours.

krinkle@stats1007$ kafkacat -C -b kafka-jumbo1007.eqiad.wmnet -t webrequest_text -o '-1000000000' -e | fgrep '"/spec.yaml'

For a slower definitive search across e.g. last two month, you could use Hive (same table, same host). The webrequest_text page on Wikitech has example Hive SQL queries.