Page MenuHomePhabricator

Kask functional testing with Cassandra via the Deployment Pipeline
Open, MediumPublic

Description

In talking with @Eevans yesterday it sounds like it is desirable to have a running Cassandra instance for some portion of Kask tests run via the pipeline. The Deployment Pipeline has a step that can be used for integration tests (using helm test) that step deploys kubernetes resources defined in a helm chart (or in the templates/test directory of the chart) and runs helm test against those resources.

There are several ways to achieve the desired outcome of deploying an ephemeral Cassandra instance during the "test deployment" phase of the pipeline -- either having the Cassandra instance be a part of the base chart, or a resource defined for test explicitly. This will help inform best practice as more services have similar needs.

Pieces missing from kask/charts to run integration tests in the deployment pipeline:

  • WMF Cassandra image
  • Kask chart update
    • Update values.yaml
  • Kask .pipeline/config.yaml update

Event Timeline

thcipriani triaged this task as Medium priority.May 21 2019, 4:47 PM
thcipriani created this task.

It seems that the cassandra subchart already exists for cask (via https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/509102/ ); however, to use that in the pipeline we would have to override some values at deployment time.

It seems that the cassandra subchart already exists for cask (via https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/509102/ );

This was created with minikube/local development in mind, not really the pipeline. The image is from dockerhub and the helm chart from the incubator helm charts repo. Thus it won't even work in our infrastructure (and we haven't really vetted them much). Put differently, it was a solution to scratch a particular itch.

That being said, the subchart pattern (also used for minikube/local_dev by eventgate) is probably a valid way to solve the problems mentioned in the task.

however, to use that in the pipeline we would have to override some values at deployment time.

Yes. That one is going to be overall required even in other cases. e.g. the staging part of the pipeline is unable (on purpose) to reach to production and internet access (e.g. for citoid/cxserver) needs to go via the proxies which are configurable via helm values

Eevans renamed this task from Kask integration testing with Cassandra via the Deployment Pipeline to Kask functional testing with Cassandra via the Deployment Pipeline.May 28 2019, 1:15 PM
WDoranWMF removed a project: Platform Engineering.
jeena added a subscriber: jeena.Jun 15 2020, 7:40 PM

I created a helm test and got the integration and functional tests running in minikube. Do we have DNS within the ci cluster? Given an appropriate Cassandra subchart, we could run the kask-cassandra integration tests using helm test.

In order to run a helm test, it needs to have an image to run, so we'd need to publish the test image and be able to clean it up from the registry afterwards.

I created a helm test and got the integration and functional tests running in minikube. Do we have DNS within the ci cluster? Given an appropriate Cassandra subchart, we could run the kask-cassandra integration tests using helm test.

In order to run a helm test, it needs to have an image to run, so we'd need to publish the test image and be able to clean it up from the registry afterwards.

PipelineLib could clean up an image from the registry, provided there's an API to remove images from the registry.

jeena added a comment.Jun 15 2020, 7:58 PM

PipelineLib could clean up an image from the registry, provided there's an API to remove images from the registry.

There is this: https://docs.docker.com/registry/spec/api/#deleting-an-image

I created a helm test and got the integration and functional tests running in minikube. Do we have DNS within the ci cluster?

Yes we now do.

Given an appropriate Cassandra subchart, we could run the kask-cassandra integration tests using helm test.

And an appropriate image. Which is the issue here in fact. The subchart added in https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/509102/ is centered around an image that is not in WMF's control. Given we currently have no automateable way to make assertions on the provenance of the image, the amount of maintenance it is regularly seeing, the security updates it sports etc, it's best we don't use it in CI as is. That would add an attack vector to our CI infrastructure (and from there to our production infrastructure) that we don't currently have.

In order to run a helm test, it needs to have an image to run, so we'd need to publish the test image and be able to clean it up from the registry afterwards.

If pushing to the test image to the registry is a requirement, indeed it should be cleaned up afterwards.https://docs.docker.com/registry/spec/api/#deleting-an-image should do it. It's a bit weird as an API as you will need to pass the digest that you can get from the Docker-Content-Digest: when doing GET/HEAD on the image tag.

However, up to now we did not have a need to publish the test images to the registry but still run integration tests via helm test (IIRC mathoid runs them). What has changed?

thcipriani updated the task description. (Show Details)Jun 16 2020, 6:37 PM
jeena added a comment.EditedJun 16 2020, 7:06 PM

However, up to now we did not have a need to publish the test images to the registry but still run integration tests via helm test (IIRC mathoid runs them). What has changed?

I might have misunderstood that the team wanted to run the functional and integration tests that are in the kask repo written in go.

I just took a closer look at service-checker and I think it would be able to replicate the go integration tests, but only the get and post, not the delete ones, so if that's sufficient, then you are right, we wouldn't need to publish any additional images or make changes to the kask chart (newly added to task checklist).

@Eevans can you confirm what tests you'd like to run?

However, up to now we did not have a need to publish the test images to the registry but still run integration tests via helm test (IIRC mathoid runs them). What has changed?

I might have misunderstood that the team wanted to run the functional and integration tests that are in the kask repo written in go.

I just took a closer look at service-checker and I think it would be able to replicate the go integration tests, but only the get and post, not the delete ones, so if that's sufficient, then you are right, we wouldn't need to publish any additional images or make changes to the kask chart (newly added to task checklist).

OK, so there is a long version, and a short version. I'll relate the long version for posterity sake, feel free to jump down to TL;DR if you don't care to read it.

Long version:

I am not a fan of the idea behind service-checker and the Open API extension it relies on. What it provides is integration testing; The x-amples section of the spec is a DSL for writing tests, and service-checker is the test runner. It is a very limited form of integration testing, and one entirely of our own making. Some of the issues I see with this:

These tests should be thought of as site config (think: cases where a call might alter the state of a production system), and the specification describes the interface. So we have configuration (x-amples) embedded inside what would otherwise be treated as code (the OpenAPI spec). This mixing of concerns is a real problem (and it has created problems for the systems I have experience with here). We wouldn't otherwise allow this, I wish we didn't do so here.

By requiring this form of testing, we're making OpenAPI a hard requirement for (REST) services (whether it makes sense for that service or not). Even still, it doesn't provide testing for other types of services, or web applications like MediaWiki.

Since it is so limited, good coverage means writing additional integration tests using something more capable (duplication), and then figuring out how/where to run them. Since that requires additional effort though, there is a really good chance we don't do that, and simply live with these very basic tests (see TL;DR below).

There was some effort made recently to standardize how integration testing was done for all web applications at the WMF. The idea, as I understood it at the time, was to create some common abstractions on top of Supertest, and generally standardize layout and execution. We had some discussions at the time about utilizing that work to create the means of defining a minimal subset of tests for production monitoring, with discovery, as a replacement for the x-amples section. However, that work has concluded, the result is very MediaWiki-specific, and there doesn't appear to be any additional time to work on this in the foreseeable future.


Short version (TL;DR):

@Eevans can you confirm what tests you'd like to run?

We might as well stick to the service-checker-based tests.

jeena added a comment.Jul 1 2020, 10:12 PM

We've published a cassandra image to docker-registry.wikimedia.org/releng/cassandra311:0.0.1
I've tested this on minikube with kask using the cassandra chart specified in the requirements and everything seems fine, but let me know if you run into any problems.

After talking with @thcipriani about what work was needed, we had added creating a WMF Cassandra chart to the description of this task. I was thinking to fork the incubator chart. Does anyone have thoughts on this?

jeena updated the task description. (Show Details)Jul 1 2020, 10:15 PM

We've published a cassandra image to docker-registry.wikimedia.org/releng/cassandra311:0.0.1
I've tested this on minikube with kask using the cassandra chart specified in the requirements and everything seems fine, but let me know if you run into any problems.

This is awesome. thanks so much for this!

After talking with @thcipriani about what work was needed, we had added creating a WMF Cassandra chart to the description of this task. I was thinking to fork the incubator chart. Does anyone have thoughts on this?

If the kask cassandra subchart already works, do we really need another cassandra chart? At least at this point in time?

jeena added a comment.Jul 2 2020, 10:24 PM

If the kask cassandra subchart already works, do we really need another cassandra chart? At least at this point in time?

Great, using the existing one sounds good to me!

jeena updated the task description. (Show Details)Jul 2 2020, 10:25 PM

Change 609894 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[operations/deployment-charts@master] Kask: Use Releng Cassandra Image

https://gerrit.wikimedia.org/r/609894

Change 609894 merged by jenkins-bot:
[operations/deployment-charts@master] Kask: Use Releng Cassandra Image

https://gerrit.wikimedia.org/r/609894

Change 616174 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[mediawiki/services/kask@master] Add pipeline config

https://gerrit.wikimedia.org/r/616174

Change 616177 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[integration/config@master] Convert kask ci to use pipelinelib

https://gerrit.wikimedia.org/r/616177

Change 616174 merged by jenkins-bot:
[mediawiki/services/kask@master] Add pipeline config

https://gerrit.wikimedia.org/r/616174

Change 616177 merged by jenkins-bot:
[integration/config@master] Convert kask ci to use pipelinelib

https://gerrit.wikimedia.org/r/616177

jeena added a comment.EditedJul 31 2020, 10:36 PM

We attempted to run the tests using CI, but ran into errors deploying cassandra to k8s (on the ci cluster):

Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b00951ef2bf603555124f2baeb44ef27bb01d9163d5ca9597ee294a80812d73f" network for pod "mediawiki-services-kask-9otw3rmv-cassandra-0": NetworkPlugin cni failed to set up pod "mediawiki-services-kask-9otw3rmv-cassandra-0_ci" network: failed to get IPv6 addresses for host side of the veth pair

I did some searching and noticed some there were some fixes to calico cni for this error: https://github.com/projectcalico/cni-plugin/pull/380, https://github.com/projectcalico/cni-plugin/pull/730, but I'm not certain whether this is the same problem we are having.

Readiness and liveness probes subsequently fail with the following:

rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:110: decoding init error from pipe caused \\\"read parent: connection reset by peer\\\"\"\n"

So it seems like the sandbox error might not be the actual problem.

@akosiaris are you able to help with this?

jeena added a comment.Aug 1 2020, 1:25 AM

So I re-tried installing the chart on the ci cluster today and got some different errors.

First, I tried running the job, and got the following errors:

Warning  Unhealthy  6m27s                  kubelet, kubestage1002.eqiad.wmnet  Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:110: decoding init error from pipe caused \\\"read parent: connection reset by peer\\\"\"\n"
 Warning  Unhealthy  5m27s                  kubelet, kubestage1002.eqiad.wmnet  Liveness probe failed: rpc error: code = 2 desc = containerd: container not found
 Warning  Unhealthy  4m27s                  kubelet, kubestage1002.eqiad.wmnet  Readiness probe failed: rpc error: code = 2 desc = containerd: container not found
 Warning  Unhealthy  4m26s (x2 over 4m26s)  kubelet, kubestage1002.eqiad.wmnet  Liveness probe errored: rpc error: code = Unknown desc = container not running (836603f302bbb04e4eec335c4a28cb150ce01e666bec0b63deb656b5ba397566)
 Warning  Unhealthy  4m26s                  kubelet, kubestage1002.eqiad.wmnet  Readiness probe errored: rpc error: code = Unknown desc = container not running (836603f302bbb04e4eec335c4a28cb150ce01e666bec0b63deb656b5ba397566)
 Warning  Failed     41s (x3 over 3m11s)    kubelet, kubestage1002.eqiad.wmnet  Error: failed to start container "mediawiki-services-kask-oe8aoyga-cassandra": Error response from daemon: shim error: context deadline exceeded

I thought that might be related to the timeout so I tried to install manually with the following command and a generous timeout:

helm --tiller-namespace=ci install --namespace=ci --set docker.registry=docker-registry.wikimedia.org,docker.pull_policy=IfNotPresent,main_app.image=wikimedia/mediawiki-services-kask,main_app.version=2020-08-01-003138-candidate,subcharts.cassandra=true,cassandra.persistence.enabled=false -n mediawiki-services-kask-05397tbr --debug --timeout 600 https://releases.wikimedia.org/charts/kask-0.0.11.tgz

There were still errors:

Warning  FailedCreatePodSandBox  26s (x13 over 15m)  kubelet, kubestage1002.eqiad.wmnet  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "mediawiki-services-kask-05397tbr-cassandra-0": Error response from daemon: shim error: context deadline exceeded
 Normal   SandboxChanged          25s (x15 over 15m)  kubelet, kubestage1002.eqiad.wmnet  Pod sandbox changed, it will be killed and re-created.

We attempted to run the tests using CI, but ran into errors deploying cassandra to k8s (on the ci cluster):

Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b00951ef2bf603555124f2baeb44ef27bb01d9163d5ca9597ee294a80812d73f" network for pod "mediawiki-services-kask-9otw3rmv-cassandra-0": NetworkPlugin cni failed to set up pod "mediawiki-services-kask-9otw3rmv-cassandra-0_ci" network: failed to get IPv6 addresses for host side of the veth pair

I did some searching and noticed some there were some fixes to calico cni for this error: https://github.com/projectcalico/cni-plugin/pull/380, https://github.com/projectcalico/cni-plugin/pull/730, but I'm not certain whether this is the same problem we are having.

Readiness and liveness probes subsequently fail with the following:

rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:110: decoding init error from pipe caused \\\"read parent: connection reset by peer\\\"\"\n"

So it seems like the sandbox error might not be the actual problem.

@akosiaris are you able to help with this?

Yeah, that's pretty standard and due to our old calico version. It doesn't cause a problem in itself currently, but it would be nice to fix and it's actually fixed in future versions.

So I re-tried installing the chart on the ci cluster today and got some different errors.

Warning  Failed     41s (x3 over 3m11s)    kubelet, kubestage1002.eqiad.wmnet  Error: failed to start container "mediawiki-services-kask-oe8aoyga-cassandra": Error response from daemon: shim error: context deadline exceeded

This is actually the problem I think. Dockerd failing to start the container. logs on kubestage1002 are full of lines like

Aug 01 20:35:49 kubestage1002 dockerd[36910]: time="2020-08-01T20:35:49.526726587Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"

Looks like we finally got bit by our old docker version. This seems to be https://github.com/moby/moby/issues/29369 and it looks like it's transient (albeit with a very long time to fix itself). I no longer see those on kubestage1002 but I am willing to bet they are going to show up again. Aside from upgrading docker I don't see a very easy path out of this, but targeting a new docker version will take a while.

jeena added a comment.Aug 5 2020, 4:15 AM

Hmm, I tried to deploy again but still couldn't. I would be happy to help with upgrading docker in any way I am able!

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:46 PM