Page MenuHomePhabricator

api-gateway: run make test in CI
Closed, ResolvedPublic

Description

The api-gateway chart defines tests for Lua and chart templates that can be run by calling make test. They should run in CI.

It would be nice to use a generalized approach that can be used by other charts as well, in a self-service kind of way. A good way to do this would be to have the Rakefile that is already used in CI find and run chart-specifc tests. However, different charts may need different tooling/frameworks/libraries for testing, which would have to be included in the chart test image. The tests for the api-gateway chart need Lua (and the busted test runner), python, and helm.

The idea is not to run tests against actual deployments in CI. That kind of testing should be done as part of the deployment process, ideally via helm test.

Impact of not doing this (for api-gateway):

  • Frequency: It is likely that a change causes unforeseen breakage that could be avoided by a CI test a bout once per month. This happens even when tests are run manually during development, because of errors introduced by rebasing.
  • Severity: Varies by kind of error introduced. E.g. a fatal error in the Lua code that implements the rate limit classification would effectively disable all rate limits. An error in how the configuration for the ratelimit service is generated may cause all clients to be hit by a very low limit. This should be caught by system tests on the staging cluster during deployment, but whether that would happen depends on the exact failure mode (and on the system tests actually being run manually during deployment, see T405578).
  • Clean up effort: Assuming the problem is spotted by observing anomalies in the metrics after deployment to a production cluster, cleanup involves creating and deploying a revert patch. That takes about 15 minutes if all goes well.
  • Workaround effort: Always remember to run the tests manually during developments and after every rebase. Make sure that people only working on the code occasionally, especially in emergencies, are aware of this. A single run of the tests takes seconds, but it adds up over time, and the risk of not doing it grows.
  • Resolution effort: Depends on the implementation strategy. In the simplest form (only support minimal dependencies and bake them into the image once) it's about a week of work.
  • Sloppiness: Accidentally throttling all API clients to 10 requests per minute wouldn't be slightly embarrassing.

Event Timeline

Change #1282962 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] Move make files to standard location

https://gerrit.wikimedia.org/r/1282962

Change #1282965 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] EXPERIMENT: Define ci_test target in Rakefile

https://gerrit.wikimedia.org/r/1282965

I have been thinking that we could extend the Rake file we are already using in production to check helm charts to also run chart-specific tests. I have been discussing this with @Joe and @Blake on Slack. They both seem to like the idea, but the main problem is the dependencies that these tests introduce.

For instance, the make test command for the api-gateway chart needs python 3 (and the unittest package) as well as Lua (and the busted test runner). And of course, we'd need gnu make itself to run these. Where does it end? What if different charts need different versions of Lua (which are famously incompatible)? The charts define containerized applications, testing bits and pieces of the min CI seems like a good idea, but the dependencies may vary.

So, how about containerizing the tests? That runs into problems because the Rakefile is already running in a docker container. We'd need a docker-in-docker setup.

Then there is helm test, which allows us to define tests in a helm chart and run them in a kubernetes pod. This is well suited for verifying a deployment or running system tests on the staging cluster (T424825), but it doesn't seem like a good fit for CI.

Perhaps it would be possible to somewhere (in the repo!) list bunch of docker containers that CI should run? @hashar what do you think?

@SLong-WMF hi, Halley suggested I ping you on this ticket, because you might have ideas on how to approach this. I am trying to figure out how to best implement per-chart CI tests on the deployment-charts repo. The main challenge is that the nature of the services is very heterogeneous, so there may be a potentially large number of tools and dependencies needed to run the tests. On the other hand, right now there only two or three charts that have such tests.

Hey Daniel! It's a little unclear to me what exactly is being tested here. I'm familiar with helm charts and orchestration tooling, are you looking for help figuring out what should be the test running architecture that runs existing tests (the foundation has a huge amount of existing unit, integration and UI layer tests) or are you looking at spinning up a new set of automated tests to validate some new functionality (or functionality that is currently uncovered)?

As a reference point my team has a hypothesis (ST5.2.1) in Q1 to define and write automated monitoring tests (ie tests covering critical user journeys that run as part of release trains).

Are you trying to test mediawiki, the deployment mechanisms themselves, validate dependencies, or make sure services/APIs stand up and are responsive after a deployment?

Hey Daniel! It's a little unclear to me what exactly is being tested here. I'm familiar with helm charts and orchestration tooling, are you looking for help figuring out what should be the test running architecture that runs existing tests (the foundation has a huge amount of existing unit, integration and UI layer tests) or are you looking at spinning up a new set of automated tests to validate some new functionality (or functionality that is currently uncovered)?

I am looking for a way to do pre-merge unit testing on helm charts in gerrit CI. We already use Rakefile to run helm lint for each chart there, and I want to do a little bit more. I have existing stand-alone tests for Lua code and chart templates for the api-gateway chart that run without deploying the service. I currently run them manually before I push to gerrit. I want gerrit to do that for me. And I was thinking that it would be nice to have a mechanism that would work for any chart, not just the one I'm working on right now.

As a reference point my team has a hypothesis (ST5.2.1) in Q1 to define and write automated monitoring tests (ie tests covering critical user journeys that run as part of release trains).

Are you trying to test mediawiki, the deployment mechanisms themselves, validate dependencies, or make sure services/APIs stand up and are responsive after a deployment?

This ticket is about pre-merge tests in gerrit CI, but I am also very interested in testing service behavior as part of the deployment pipeline, see T424825: rest-gateway: run system tests via helm test. I'm in the process of dockerizing the system tests for the rest gateway service for this purpose.

@daniel and I exchanged on that topic.

operations/deployment-charts has a helm-lint CI job which invokes the image releng/helm-linter. That images has a few Debian packages already:

helm311 helm317 rake envoyproxy helmfile helm-diff ruby-git istioctl python3-minimal python3-yaml kubeconform python3-pip wmf-certificates

And its entrypoint invokes rake

An easy path is to add Lua/Make to the image and have the deployment-charts Rakefile to invoke the suite. That is quite easy to add and we would pair on it next week.

There is a caveat though, the releng/helm-linter image is based on Debian Bullseye, it might thus have a different version of LUA than the one used by the gateway/envoy image :-\


An alternative would have to move the Envoy Lua script inside the api-gateway source code repository and have the tests be run there. The image would have to inherit from the Envoy base image. A test image can be defined via Blubber and patches sent to the api-gateway would then run the tests with the proper Envoy/Lua version that would be used in production. It seems more robust, but I don't know whether the LUA files require to be in the deployment-charts repo or not.

An easy path is to add Lua/Make to the image and have the deployment-charts Rakefile to invoke the suite. That is quite easy to add and we would pair on it next week.

The main concern with this was that it could easily "blow up" if/when people start adding more tests that require additional dependencies - and there is no automatic way to satisfy them. But I'd be okay with biting that bullet and dealing with it when the time comes.

An easy path is to add Lua/Make to the image and have the deployment-charts Rakefile to invoke the suite. That is quite easy to add and we would pair on it next week.

The main concern with this was that it could easily "blow up" if/when people start adding more tests that require additional dependencies - and there is no automatic way to satisfy them. But I'd be okay with biting that bullet and dealing with it when the time comes.

My assumption is that most charts are trivial and don't need CI tests beyond basic helm chart validation. For more complex charts, I#d still hope that "low level dependencies" will be enough (e.g. a Lua interpreter, Python, Make, etc).

Change #1282962 merged by jenkins-bot:

[operations/deployment-charts@master] Move Makefiles to standard location

https://gerrit.wikimedia.org/r/1282962

Change #1294211 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[integration/config@master] dockerfiles: helm-linter: add dependencies for api-gateway tests

https://gerrit.wikimedia.org/r/1294211

Change #1294227 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: [helm-lint] Update to helm-linter:0.8.0 for api-gateway tests

https://gerrit.wikimedia.org/r/1294227

Change #1294211 merged by jenkins-bot:

[integration/config@master] dockerfiles: helm-linter: add dependencies for api-gateway tests

https://gerrit.wikimedia.org/r/1294211

Mentioned in SAL (#wikimedia-releng) [2026-05-27T09:49:08Z] <hashar> Updated helm-lint Jenkins job to use releng/helm-linter:0.8.0 image # T424824

Change #1294227 merged by jenkins-bot:

[integration/config@master] jjb: [helm-lint] Update to helm-linter:0.8.0 for api-gateway tests

https://gerrit.wikimedia.org/r/1294227

daniel triaged this task as High priority.Mon, Jun 1, 9:39 PM

Change #1282965 merged by jenkins-bot:

[operations/deployment-charts@master] Rakefile: Run chart specific tests

https://gerrit.wikimedia.org/r/1282965