Run integration tests against device-analytics staging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	Mar 15 2023, 3:41 PM

Description

Background:

device-analytics is deployed in k8s "staging" namespace
It can be accessed via service url from within the same network
The main Jenkins server can
- schedule the run post-merge jobs
- access k8s "staging"
The deployment pipeline supports running helm test on post-merge
- examples of use are mathoid's service-checker

Task:

Create a container for @EChukwukere-WMF 's test suite that can be used via the deployment pipeline
Update .pipeline/config.yaml to test the resulting device-analytics container on post-merge (example: Mathoid's pipeline config
Update the helm-chart to include tests that invoke the container (example: mathoid chart
Run the container via the pipeline
- If there is a test failure the container should exit with a non-zero exit status causing post-merge test to fail

Details

Other Assignee: EChukwukere-WMF

Related Objects
Search...

Status	Assigned	Task
Resolved	Atieno	T332193 Run integration tests against device-analytics staging
Resolved	Clement_Goubert	T333550 Requesting access to analytics-privatedata-users for atieno
Resolved	Clement_Goubert	T334480 Remove phabricator Multi-factor Auth for Atieno

Event Timeline

thcipriani created this task.Mar 15 2023, 3:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2023, 3:41 PM

JArguello-WMF triaged this task as High priority.Mar 17 2023, 1:55 PM

Hi @hnowlan for the device-analytics "It can be accessed via service url from within the same network" is there a documented way to do this accessing..? first by a human then by the automated tests..? cc @EChukwukere-WMF

Atieno moved this task from Next Up to In Progress on the API Platform (Sprint 06) board.Mar 21 2023, 1:42 PM

Atieno claimed this task.Mar 21 2023, 1:44 PM

The service can be accessed within the production network via commands similar to curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229. For now this should work for manual tests

Thanks Hugh.

Notes from sync with @EChukwukere-WMF

Atieno to transfer the tests repos from Gitlab to gerrit
Will we be breaking up the tests from one AQS tests folder into individual tests repos in gerrit..?
Another option is to have the monorepo and then skip all tests except the service specific one

Some notes from the API Gateway & Thumbor SRE Check-In

The steps on this task describe running @EChukwukere-WMF 's test suite as part of helm test
It will be mostly trivial to run the test suite manually against staging
But it's more work to build it into the pipeline via helm test
There's consensus that running the test suite manually is a pre-requisite for deployment
There's less consensus about whether the helm test stage is needed—it sounds like it's suboptimal

Open questions:

Is running the test suite as part of the pipeline a pre-requisite for deployment? (leaning no, I think)
During the meeting, we mentioned that the proposal on this ticket (i.e., using helm test via the pipeline) is suboptimal
- Should we change the scope of this ticket to only cover running the test suite manually (probably)?
- Should we try for the helm test implementation as described in this ticket? (@BPirkle @SGupta-WMF @Atieno ?)
- Are there other ways to achieve the same result in less-sub-optimal ways (I heard @hnowlan used the word "ganeti VM" at some point 😄)?

My thoughts:

I consider running Emeka's test suite manually as a prereq for deployment. Sounds like that is relatively straightforward, hooray
I would support changing the scope of this ticket to running the test suite manually, provided we create another ticket to consider the possibility of running it in automated fashion
the helm test state is suboptimal compared to an ideal world where the suite runs pre-merge
if post-merge helm test prevents us from deploying code that fails the test suite that sounds better than not running it, even if we would prefer to run it earlier
I don't know what a "ganeti VM" is, but if it helps us, I'll take two plus a side of fries, please :)

The things I can recall that make the helm test approach suboptimal:

it involves accessing the production databases from staging
any failures are only caught post-merge, which means that faulty code has already been merged to the main branch

I can live with #2 if that's our only way to run these tests automatically. I can see why we'd disallow #1 as a matter of policy. I understand it would be challenging to create staging datastores that mirror production. I wonder if there's something less ambitious we could do that would offer sufficiently robust data for the test suite without mirroring the full production dataset.

None of this discussion covers the in-service integration tests. I don't think it is necessary to run those manually in staging before deployment, as we run them locally all the time already. However, if it isn't much trouble, it'd be welcome. Ideally, we'd also have a way to run those in automated fashion just like Emeka's test suite. They face essentially the same challenges, so maybe the same solution would work for both.

Sounds good I can work on implementing the manual tests againsts staging.. @hnowlan please remind me + link me to the platform you said we can upload the test suite to..thanks

What i feel :-

Running the test suite is definitely a pre-requisite for any deployment.

The helm test post merge seems like a good approach to me only if mandate running the test suite locally by developers first or have a dev or ganeti VM.

Otherwise, there are risks associated with running the test suite on staging post merge , which are frequent code reverts and incorrect code propagation in repo if not reverted timely.

Atieno added a subtask: T333550: Requesting access to analytics-privatedata-users for atieno.Mar 30 2023, 12:17 PM

The linked subtask T333550 tracks getting myself production shell access setup..

So it sounds like we have alignment on a few things and have next steps.

Manual test suite is a pre-req for deployment.
Let's keep the scope of this ticket to manual testing and create a new one for automated testing.

2b. Who is responsible for creating the automated testing ticket? Is that @Atieno or @EChukwukere-WMF

Open question:
We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

Who is responsible for creating the automated testing ticket? Is that @Atieno or @EChukwukere-WMF

I'd guess @Atieno , as the actual work for that is more related to the deployment pipeline than to the testing system itself and Atieno has been acting as our deployment engineer recently (thank you, Atieno!)

We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

We don't own that part of our infrastructure, and the changes we'd need are probably a significant amount of work.

The CI pipeline has limited support for what we'd like to do - that work belongs to other folks. I expect their focus is (rightfully) on GitLab rather than gerrit, as switching to GitLab is an organizational goal. Admittedly I don't know how dependent that part of the CI pipeline is on gerrit vs GitLab. My impression is just that they have multiple priorities and our testing use case is not the top of their list. And that's just for the when-it-runs part. The what-it-hits part faces the complication of either hitting a production datastore from the staging cluster (which SRE understandably has pushed back on) or having some other datastore to hit (either real or mocked), which sounds nontrivial. @thcipriani , @hnowlan, did I misrepresent that? I'd be very happy to be wrong.

I do wonder if we should be more actively surfacing this use case for those folks, offering to collaborate with them on possible improvements, making a case how this would benefit more than just AQS 2.0, etc. Maybe having a partner and advocate would help advance that work, or at least get it onto someone's roadmap.

In T332193#8737872, @Atieno wrote:

Sounds good I can work on implementing the manual tests againsts staging.. @hnowlan please remind me + link me to the platform you said we can upload the test suite to..thanks

As mentioned this URL pattern can be used https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229. The service itself runs at https://staging.svc.eqiad.wmnet:4972 and the URL paths are direct access to the service (as opposed to the URLs restbase URLs).

In T332193#8741997, @FJoseph-WMF wrote:

So it sounds like we have alignment on a few things and have next steps.

Manual test suite is a pre-req for deployment.

Let's keep the scope of this ticket to manual testing and create a new one for automated testing.

In T332193#8742464, @BPirkle wrote:

We need to decide on an optimal automated testing solution. Is there any reason why we can't focus on the optimal solution instead of trying to make the sub-optimal work?

We don't own that part of our infrastructure, and the changes we'd need are probably a significant amount of work.

The CI pipeline has limited support for what we'd like to do - that work belongs to other folks. I expect their focus is (rightfully) on GitLab rather than gerrit, as switching to GitLab is an organizational goal. Admittedly I don't know how dependent that part of the CI pipeline is on gerrit vs GitLab. My impression is just that they have multiple priorities and our testing use case is not the top of their list. And that's just for the when-it-runs part. The what-it-hits part faces the complication of either hitting a production datastore from the staging cluster (which SRE understandably has pushed back on) or having some other datastore to hit (either real or mocked), which sounds nontrivial. @thcipriani , @hnowlan, did I misrepresent that? I'd be very happy to be wrong.

To the best of my knowledge (while making some guessing) creating a distinct AQS-dev cluster in a VM specifically for testing AQS with a single-node Cassandra cluster would not be super hard. I did similar when trying to migrate AQS to new hardware for analytics. We could do it ourselves or we could make an ask of the Data Persistence team (but as far as I know their workload is quite significant and there would be a wait on getting this in place). Data Engineering might also like this or have opinions. There is a little bit of ownership ambiguity coming here (as you would expect when I've mentioned two other teams in a single paragraph)) that might need some consideration - I wouldn't want to spin this up and then just walk away, although I expect it to be somewhat minimal stress to support.

I do wonder if we should be more actively surfacing this use case for those folks, offering to collaborate with them on possible improvements, making a case how this would benefit more than just AQS 2.0, etc. Maybe having a partner and advocate would help advance that work, or at least get it onto someone's roadmap.

Yeah I agree here - running other containers as a step within CI seems like something that is a matter of "when" and not "if" as far as the system is concerned imo.

As mentioned this URL pattern can be used https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229.

ssh mwmaint1002.eqiad.wmnet

curl https://staging.svc.eqiad.wmnet:4972/metrics/unique-devices/en.wikipedia.org/all-sites/daily/20160201/20160229

{
 "items": [
  {
   "project": "en.wikipedia",
   "access-site": "all-sites",
   "granularity": "daily",
   "timestamp": "20160202",
   "devices": 64928144,
   "offset": 8306542,
   "underestimate": 56621602
  },
<snip>

Confirmed, that's fun to see!

JArguello-WMF edited projects, added API Platform (Sprint 07); removed API Platform (Sprint 06).Apr 5 2023, 2:15 PM

JArguello-WMF moved this task from Next Up to In Progress on the API Platform (Sprint 07) board.

Dzahn changed the status of subtask T333550: Requesting access to analytics-privatedata-users for atieno from Open to In Progress.Apr 6 2023, 5:47 PM

Atieno added a subtask: T334480: Remove phabricator Multi-factor Auth for Atieno.Apr 13 2023, 7:27 AM

Clement_Goubert changed the status of subtask T333550: Requesting access to analytics-privatedata-users for atieno from In Progress to Stalled.Apr 17 2023, 8:39 AM

Clement_Goubert closed subtask T334480: Remove phabricator Multi-factor Auth for Atieno as Resolved.Apr 18 2023, 12:40 PM

Clement_Goubert changed the status of subtask T333550: Requesting access to analytics-privatedata-users for atieno from Stalled to In Progress.Apr 18 2023, 12:56 PM

Clement_Goubert closed subtask T333550: Requesting access to analytics-privatedata-users for atieno as Resolved.Apr 19 2023, 9:19 AM

JArguello-WMF edited projects, added AQS2.0 (Sprint 10); removed API Platform (Sprint 07).Apr 26 2023, 2:55 PM

JArguello-WMF moved this task from Next Up to In Progress on the AQS2.0 (Sprint 10) board.

Atieno mentioned this in T335505: Figure out what's outstanding to have device-analytics serving 100% Production data.May 10 2023, 2:31 PM