The api-gateway chart defines tests for Lua and chart templates that can be run by calling make test. They should run in CI.
It would be nice to use a generalized approach that can be used by other charts as well, in a self-service kind of way. A good way to do this would be to have the Rakefile that is already used in CI find and run chart-specifc tests. However, different charts may need different tooling/frameworks/libraries for testing, which would have to be included in the chart test image. The tests for the api-gateway chart need Lua (and the busted test runner), python, and helm.
The idea is not to run tests against actual deployments in CI. That kind of testing should be done as part of the deployment process, ideally via helm test.
Impact of not doing this (for api-gateway):
- Frequency: It is likely that a change causes unforeseen breakage that could be avoided by a CI test a bout once per month. This happens even when tests are run manually during development, because of errors introduced by rebasing.
- Severity: Varies by kind of error introduced. E.g. a fatal error in the Lua code that implements the rate limit classification would effectively disable all rate limits. An error in how the configuration for the ratelimit service is generated may cause all clients to be hit by a very low limit. This should be caught by system tests on the staging cluster during deployment, but whether that would happen depends on the exact failure mode (and on the system tests actually being run manually during deployment, see T405578).
- Clean up effort: Assuming the problem is spotted by observing anomalies in the metrics after deployment to a production cluster, cleanup involves creating and deploying a revert patch. That takes about 15 minutes if all goes well.
- Workaround effort: Always remember to run the tests manually during developments and after every rebase. Make sure that people only working on the code occasionally, especially in emergencies, are aware of this. A single run of the tests takes seconds, but it adds up over time, and the risk of not doing it grows.
- Resolution effort: Depends on the implementation strategy. In the simplest form (only support minimal dependencies and bake them into the image once) it's about a week of work.
- Sloppiness: Accidentally throttling all API clients to 10 requests per minute wouldn't be slightly embarrassing.