Page MenuHomePhabricator

Run integration tests against device-analytics staging
Closed, ResolvedPublic

Description

Background:

  • device-analytics is deployed in k8s "staging" namespace
  • It can be accessed via service url from within the same network
  • The main Jenkins server can
    • schedule the run post-merge jobs
    • access k8s "staging"
  • The deployment pipeline supports running helm test on post-merge

Task:

  • Create a container for @EChukwukere-WMF 's test suite that can be used via the deployment pipeline
  • Update .pipeline/config.yaml to test the resulting device-analytics container on post-merge (example: Mathoid's pipeline config
  • Update the helm-chart to include tests that invoke the container (example: mathoid chart
  • Run the container via the pipeline
    • If there is a test failure the container should exit with a non-zero exit status causing post-merge test to fail

Details

Other Assignee
EChukwukere-WMF

Event Timeline

Hi @hnowlan for the device-analytics "It can be accessed via service url from within the same network" is there a documented way to do this accessing..? first by a human then by the automated tests..? cc @EChukwukere-WMF

The service can be accessed within the production network via commands similar to curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229. For now this should work for manual tests

Thanks Hugh.

Notes from sync with @EChukwukere-WMF

  • Atieno to transfer the tests repos from Gitlab to gerrit
  • Will we be breaking up the tests from one AQS tests folder into individual tests repos in gerrit..?
  • Another option is to have the monorepo and then skip all tests except the service specific one

Some notes from the API Gateway & Thumbor SRE Check-In

  • The steps on this task describe running @EChukwukere-WMF 's test suite as part of helm test
  • It will be mostly trivial to run the test suite manually against staging
  • But it's more work to build it into the pipeline via helm test
  • There's consensus that running the test suite manually is a pre-requisite for deployment
  • There's less consensus about whether the helm test stage is needed—it sounds like it's suboptimal

Open questions:

  • Is running the test suite as part of the pipeline a pre-requisite for deployment? (leaning no, I think)
  • During the meeting, we mentioned that the proposal on this ticket (i.e., using helm test via the pipeline) is suboptimal
    • Should we change the scope of this ticket to only cover running the test suite manually (probably)?
    • Should we try for the helm test implementation as described in this ticket? (@BPirkle @SGupta-WMF @Atieno ?)
    • Are there other ways to achieve the same result in less-sub-optimal ways (I heard @hnowlan used the word "ganeti VM" at some point 😄)?

My thoughts:

  • I consider running Emeka's test suite manually as a prereq for deployment. Sounds like that is relatively straightforward, hooray
  • I would support changing the scope of this ticket to running the test suite manually, provided we create another ticket to consider the possibility of running it in automated fashion
  • the helm test state is suboptimal compared to an ideal world where the suite runs pre-merge
  • if post-merge helm test prevents us from deploying code that fails the test suite that sounds better than not running it, even if we would prefer to run it earlier
  • I don't know what a "ganeti VM" is, but if it helps us, I'll take two plus a side of fries, please :)

The things I can recall that make the helm test approach suboptimal:

  1. it involves accessing the production databases from staging
  2. any failures are only caught post-merge, which means that faulty code has already been merged to the main branch

I can live with #2 if that's our only way to run these tests automatically. I can see why we'd disallow #1 as a matter of policy. I understand it would be challenging to create staging datastores that mirror production. I wonder if there's something less ambitious we could do that would offer sufficiently robust data for the test suite without mirroring the full production dataset.

None of this discussion covers the in-service integration tests. I don't think it is necessary to run those manually in staging before deployment, as we run them locally all the time already. However, if it isn't much trouble, it'd be welcome. Ideally, we'd also have a way to run those in automated fashion just like Emeka's test suite. They face essentially the same challenges, so maybe the same solution would work for both.

Sounds good I can work on implementing the manual tests againsts staging.. @hnowlan please remind me + link me to the platform you said we can upload the test suite to..thanks

What i feel :-

  • Running the test suite is definitely a pre-requisite for any deployment.
  • The helm test post merge seems like a good approach to me only if mandate running the test suite locally by developers first or have a dev or ganeti VM.
  • Otherwise, there are risks associated with running the test suite on staging post merge , which are frequent code reverts and incorrect code propagation in repo if not reverted timely.

The linked subtask T333550 tracks getting myself production shell access setup..

So it sounds like we have alignment on a few things and have next steps.

  1. Manual test suite is a pre-req for deployment.
  2. Let's keep the scope of this ticket to manual testing and create a new one for automated testing.

2b. Who is responsible for creating the automated testing ticket? Is that @Atieno or @EChukwukere-WMF

Open question:
We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

Who is responsible for creating the automated testing ticket? Is that @Atieno or @EChukwukere-WMF

I'd guess @Atieno , as the actual work for that is more related to the deployment pipeline than to the testing system itself and Atieno has been acting as our deployment engineer recently (thank you, Atieno!)

We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

We don't own that part of our infrastructure, and the changes we'd need are probably a significant amount of work.

The CI pipeline has limited support for what we'd like to do - that work belongs to other folks. I expect their focus is (rightfully) on GitLab rather than gerrit, as switching to GitLab is an organizational goal. Admittedly I don't know how dependent that part of the CI pipeline is on gerrit vs GitLab. My impression is just that they have multiple priorities and our testing use case is not the top of their list. And that's just for the when-it-runs part. The what-it-hits part faces the complication of either hitting a production datastore from the staging cluster (which SRE understandably has pushed back on) or having some other datastore to hit (either real or mocked), which sounds nontrivial. @thcipriani , @hnowlan, did I misrepresent that? I'd be very happy to be wrong.

I do wonder if we should be more actively surfacing this use case for those folks, offering to collaborate with them on possible improvements, making a case how this would benefit more than just AQS 2.0, etc. Maybe having a partner and advocate would help advance that work, or at least get it onto someone's roadmap.

Sounds good I can work on implementing the manual tests againsts staging.. @hnowlan please remind me + link me to the platform you said we can upload the test suite to..thanks

As mentioned this URL pattern can be used https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229. The service itself runs at https://staging.svc.eqiad.wmnet:4972 and the URL paths are direct access to the service (as opposed to the URLs restbase URLs).

So it sounds like we have alignment on a few things and have next steps.

  1. Manual test suite is a pre-req for deployment.
  2. Let's keep the scope of this ticket to manual testing and create a new one for automated testing.

+1

We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

We don't own that part of our infrastructure, and the changes we'd need are probably a significant amount of work.

The CI pipeline has limited support for what we'd like to do - that work belongs to other folks. I expect their focus is (rightfully) on GitLab rather than gerrit, as switching to GitLab is an organizational goal. Admittedly I don't know how dependent that part of the CI pipeline is on gerrit vs GitLab. My impression is just that they have multiple priorities and our testing use case is not the top of their list. And that's just for the when-it-runs part. The what-it-hits part faces the complication of either hitting a production datastore from the staging cluster (which SRE understandably has pushed back on) or having some other datastore to hit (either real or mocked), which sounds nontrivial. @thcipriani , @hnowlan, did I misrepresent that? I'd be very happy to be wrong.

To the best of my knowledge (while making some guessing) creating a distinct AQS-dev cluster in a VM specifically for testing AQS with a single-node Cassandra cluster would not be super hard. I did similar when trying to migrate AQS to new hardware for analytics. We could do it ourselves or we could make an ask of the Data Persistence team (but as far as I know their workload is quite significant and there would be a wait on getting this in place). Data Engineering might also like this or have opinions. There is a little bit of ownership ambiguity coming here (as you would expect when I've mentioned two other teams in a single paragraph)) that might need some consideration - I wouldn't want to spin this up and then just walk away, although I expect it to be somewhat minimal stress to support.

I do wonder if we should be more actively surfacing this use case for those folks, offering to collaborate with them on possible improvements, making a case how this would benefit more than just AQS 2.0, etc. Maybe having a partner and advocate would help advance that work, or at least get it onto someone's roadmap.

Yeah I agree here - running other containers as a step within CI seems like something that is a matter of "when" and not "if" as far as the system is concerned imo.

As mentioned this URL pattern can be used https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229.

ssh mwmaint1002.eqiad.wmnet

curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229

{
 "items": [
  {
   "project": "en.wikipedia",
   "access-site": "all-sites",
   "granularity": "daily",
   "timestamp": "20160202",
   "devices": 64928144,
   "offset": 8306542,
   "underestimate": 56621602
  },
<snip>

Confirmed, that's fun to see!

I can now run the test suite via an ssh tunnel and also with Bill's test proxy.