Page MenuHomePhabricator

Run httpbb periodically
Open, MediumPublic

Description

In a perfect world, we'd use httpbb to validate every change to the Apache config, so all the tests would always pass in prod. In practice, sometimes we miss changes -- in part because the tests also depend on MediaWiki behavior and other things. When that happens, the tests fail in prod, which makes them less useful for validating future changes. In order for httpbb to be a useful tool, the baseline state of the tests should always be passing.

In order to catch unexpected changes, we should automatically run httpbb periodically (say, once per hour) and generate a nonpaging alert if anything fails.

Open questions on the implementation:

  • Which tests should we run? I'm inclined to start with appserver/* at first to get everything working, then expand to other directories like doc/ and releases/ from there if we think there's value.
  • What individual hosts should we test? "All tests on each appserver" is probably more work than we need to do. We probably don't want to pick a random host every time (the behavior should be consistent, but if it isn't, we don't want that to translate to test flakiness) so maybe we just choose the mwdebug hosts or something.
  • Where should we store the mapping between target hosts and test files? We could put it in hiera, or in a METADATA: block at the top of each yaml test file, or something else. The mapping might vary by the site where httpbb is running -- httpbb is much faster when it stays within a data center, so we might choose to have cumin1001 test an eqiad host and cumin2001 test a codfw host.
  • Do we implement this as a custom Icinga check, or just a systemd timer with Icinga monitoring? Right now the full suite of appserver tests takes 15 seconds to run in-datacenter, but that might get longer as we add more tests, so we might have to think about Icinga check timeouts if we go that route.

Event Timeline

RLazarus triaged this task as Medium priority.Aug 18 2021, 9:18 PM

Two cents re: metrics/alerting, we have the prometheus pushgateway available which seems like a good fit (more info: https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway))

Thanks for the pointer! I think if we wanted to track metrics from each run, like request latency or number of passing assertions or something, pushgateway would be the tool for the job -- but I think we don't have anything to export apart from the binary "pass/fail" result, so the exit code (plus the detailed test-failure messages on stdout) is probably all we need. If we want timeseries data for anything later on, I'll know how to store it.

What individual hosts should we test? "All tests on each appserver" is probably more work than we need to do. We probably don't want to pick a random host every time (the behavior should be consistent, but if it isn't, we don't want that to translate to test flakiness) so maybe we just choose the mwdebug hosts or something.

I think we should not use mwdebug alone since that is made for testing changes, so either we should test mwdebug _and_ another host or just another host, so that people using mwdebug for testing don't have to worry about triggering alerts and can see if new changes break things as opposed to untouched production hosts.

I would suggest we use one of the canary hosts. Maybe we can check automatically which host is the canary host with the lowest number in its name, so the oldest canary server that is pooled in production.

Sounds reasonable! I'll probably hardcode a canary host at first, then we can look at choosing one automatically.

Change 714136 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] httpbb: Add hourly test runs via systemd timers.

https://gerrit.wikimedia.org/r/714136

Change 714137 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] hieradata: Run httpbb hourly from cumin2001 against a codfw appserver

https://gerrit.wikimedia.org/r/714137

Change 714136 merged by RLazarus:

[operations/puppet@production] httpbb: Add hourly test runs via systemd timers.

https://gerrit.wikimedia.org/r/714136

Change 714137 merged by RLazarus:

[operations/puppet@production] hieradata: Run httpbb hourly from cumin2001 against a codfw appserver

https://gerrit.wikimedia.org/r/714137

Change 714642 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] httpbb: Wrap the systemd ExecCommand in \"sh -c\" so the wildcard works.

https://gerrit.wikimedia.org/r/714642

Change 714642 merged by RLazarus:

[operations/puppet@production] httpbb: Wrap the systemd ExecCommand in \"sh -c\" so the wildcard works.

https://gerrit.wikimedia.org/r/714642

Change 714646 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver.

https://gerrit.wikimedia.org/r/714646

Change 714646 merged by RLazarus:

[operations/puppet@production] hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver.

https://gerrit.wikimedia.org/r/714646

Hourly appserver tests are running on both cumin1001 (to mw1414) and cumin2001 (to mw2271). Weirdly, the tests time out in eqiad about half the time:

Aug 25 20:53:28 cumin1001 systemd[1]: Started Run httpbb appserver/ tests hourly on mw1414.eqiad.wmnet.
Aug 25 20:53:53 cumin1001 sh[22067]: Sending to mw1414.eqiad.wmnet...
Aug 25 20:53:53 cumin1001 sh[22067]: PASS: 108 requests sent to mw1414.eqiad.wmnet. All assertions passed.
Aug 25 20:53:53 cumin1001 systemd[1]: httpbb_hourly_appserver.service: Succeeded.

Aug 25 21:53:28 cumin1001 systemd[1]: Started Run httpbb appserver/ tests hourly on mw1414.eqiad.wmnet.
Aug 25 21:54:01 cumin1001 sh[30031]: Sending to mw1414.eqiad.wmnet...
Aug 25 21:54:01 cumin1001 sh[30031]: https://meta.wikimedia.org/wiki/List_of_Wikipedias (/srv/deployment/httpbb-tests/appserver/test_main.yaml:155)
Aug 25 21:54:01 cumin1001 sh[30031]:     ERROR: HTTPSConnectionPool(host='mw1414.eqiad.wmnet', port=443): Read timed out. (read timeout=10)
Aug 25 21:54:01 cumin1001 sh[30031]: ===
Aug 25 21:54:01 cumin1001 sh[30031]: ERRORS: 108 requests attempted to mw1414.eqiad.wmnet. Errors connecting to 1 host.
Aug 25 21:54:01 cumin1001 systemd[1]: httpbb_hourly_appserver.service: Main process exited, code=exited, status=1/FAILURE

I haven't been able to reproduce the slowness either by running httpbb manually, or by repeatedly curling that URL.

That test isn't doing anything special, it just fetches https://meta.wikimedia.org/wiki/List_of_Wikipedias and checks for a 200 response that contains the string "List of Wikipedias". There's definitely no reason the request should take ten seconds. (httpbb gives up on a host when it gets a connection error, so there might have been further timeouts in later requests if they had been sent. We should probably exclude timeouts from that give-up behavior.)

But this is eqiad, currently the passive DC, so the appserver is barely getting any other traffic. See mw1414's application server dashboard -- each httpbb run is quite visible as a request spike.

I think maybe what we're seeing here is just cold opcache/APCu -- that would make sense given there's no traffic, and it would also explain why subsequent runs are fast. (It doesn't explain why, sometimes, subsequent runs are slow again -- but maybe it's when there's a deployment in between, invalidating the cache.)

If that's the case, there are some options.

  • For appserver tests, we could run httpbb in the active data center only. We could do that by checking confd in the systemd timer, in the style of mw-cli-wrapper. Or we could just point all the test traffic at appservers-ro.discovery.wmnet, with the downside that we wouldn't be able to use the same host consistently anymore.
  • We could retry on timeouts, which would self-recover from the error in cases where we're just warming a cold cache. We could do that within the httpbb code, for individual requests -- or we could just retry the entire httpbb run when it fails due to a timeout.

Change 715094 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] Revert \"hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver.\"

https://gerrit.wikimedia.org/r/715094

Change 715094 merged by RLazarus:

[operations/puppet@production] hieradata: Remove hourly httpbb run on cumin1001.

https://gerrit.wikimedia.org/r/715094

Just some unsorted thoughts:

  • Can we set the timeout to 120s (the MW request timeout) to see how long the request is actually taking, and whether cold caches is a reasonable thing to blame? e.g. if it's reliably taking 12s or something.
  • Do (low) timeouts matter for httpbb's purpose? Like, we're mostly testing to see that we get a 200 OK, not timing.
  • https://meta.wikimedia.org/wiki/List_of_Wikipedias is not a small page. That said, the ParserCache entry says Real time usage: 1.731 seconds. Even if it was 6x slower on cold caches, after the first successful request, it should served out of eqiad's ParserCache (unless the codfw=>eqiad replication is messing with it??).

Another way I'd like to improve this is to deal with Puppet skew on the two hosts.

Right now, if a patch changes both appserver behavior (say, Apache config) and the httpbb tests, then the tests might or might not pass, depending on what order Puppet runs on cumin1001 and mw1418 (the current source and destination of the test traffic) and whether the timer happens to fire in between. When it does fail, it self-resolves on the next run an hour later, but it would be nice to avoid that noise.

Some possible approaches:

  • Run httpbb on the target host (sending traffic to itself), so the config and tests are updated simultaneously. But testing from externally is a slightly more realistic test, and installing httpbb on the target hosts would have to wait for T299705.
  • Adjust the alert so that it only fires after two consecutive httpbb failures, maybe in combination with running httpbb every 30 minutes instead of 60.
RLazarus claimed this task.