In a perfect world, we'd use httpbb to validate every change to the Apache config, so all the tests would always pass in prod. In practice, sometimes we miss changes -- in part because the tests also depend on MediaWiki behavior and other things. When that happens, the tests fail in prod, which makes them less useful for validating future changes. In order for httpbb to be a useful tool, the baseline state of the tests should always be passing.
In order to catch unexpected changes, we should automatically run httpbb periodically (say, once per hour) and generate a nonpaging alert if anything fails.
Open questions on the implementation:
- Which tests should we run? I'm inclined to start with appserver/* at first to get everything working, then expand to other directories like doc/ and releases/ from there if we think there's value.
- What individual hosts should we test? "All tests on each appserver" is probably more work than we need to do. We probably don't want to pick a random host every time (the behavior should be consistent, but if it isn't, we don't want that to translate to test flakiness) so maybe we just choose the mwdebug hosts or something.
- Where should we store the mapping between target hosts and test files? We could put it in hiera, or in a METADATA: block at the top of each yaml test file, or something else. The mapping might vary by the site where httpbb is running -- httpbb is much faster when it stays within a data center, so we might choose to have cumin1001 test an eqiad host and cumin2001 test a codfw host.
- Do we implement this as a custom Icinga check, or just a systemd timer with Icinga monitoring? Right now the full suite of appserver tests takes 15 seconds to run in-datacenter, but that might get longer as we add more tests, so we might have to think about Icinga check timeouts if we go that route.