Page MenuHomePhabricator

AQS 2.0: Revisit in-service testing approach
Closed, ResolvedPublic

Description

This task is to discuss and determine the best approach for in-service (implemented within the service itself) testing for AQS 2.0. A secondary (but perhaps more important in the long run) goal is to establish testing patterns for future similar services. This is likely to spawn related implementation tasks.

Background

AQS 2.0 provides public-facing access to certain WMF metrics data via what is effectively a federated API implemented in six Go services. It replaces the original AQS, which was implemented via a fork of RESTBase and which provided access to the same data via an API of virtually identical contract.

There is a lot of "new" involved with AQS 2.0. This includes:

  1. Go is not in widespread use at WMF at this time (Kask is the primary production example of current usage).
  2. deployment of services to our production k8s environment is still somewhat formative.
  3. while AQS 2.0 may not be the pinnacle of a microservices approach, it is at least more that direction than its predecessor.
  4. we're also trying to build it off our relatively untried Go service scaffolding and related servicelib.
  5. we're trying out WMF's GitLab installation
  6. we're collecting common utility functions in a separate Go package housed in its own repository
  7. we are using Docker-based "testing" environments to stand in for production datastores during development and local testing
  8. in addition to tests implemented within the service, we have an external testing framework

A result of all this is that we are encountering a larger-than-average number of questions and surprises as the project proceeds. The most recent of these regards in-service testing, especially as it applies to our CI systems.

Context (testing specific)

Here is a bit more context specifically around testing (points are numbered for ease of reference in case someone wants to mention them in a followup comment):

  1. we have both local/in-service testing, implemented in Go within the service itself, as well as a separate testing framework that performs integration testing by invoking the services externally. This task is really only about the first of those: the in-service tests. The external tests are covered in tasks like T317790: <Spike> AQS 2.0 Testing Plan.
  2. our current testing approach, as outlined about a year and a half ago (yikes!) in T288160: Development and test environments for AQS 2.0 Cassandra services, is to spin up Docker/Cassandra containers with canned data. The original task has a preference for hosted dev/staging environments, possibly ones that we could spin up in CI. But it anticipates that may not be available, and suggests local Docker Compose environments as an alternative. That's what we have done so far.
  3. while we've been calling these "unit tests", it was acknowledged early on that they were more akin to integration tests.
  4. because we have chosen to push as much common functionality as possible to the common "aqsassist" package, much of the unit testing responsibility sits there, rather than in the individual services. (And aqsassist does have actual unit tests.)
  5. it is correct from a theoretical perspective that these are not strictly unit tests. However, there is existing art for this sort of testing in the current production AQS code that we're replacing, and it seems to have served that codebase well. In fact, that's where we got the list of in-service tests to implement:T317720: AQS 2.0: Unique Devices: Implement Unit Tests T299735: AQS 2.0: Pageviews: Implement Integration Tests T317722: AQS 2.0: Mediarequests: Implement Unit Tests T317725: AQS 2.0: Geo Analytics: Implement Unit Tests T316849: Audit tests for Druid-based endpoints (these tasks include services in various stages of completion, so not all of this is implemented yet, and some services don't even exist)
  6. the current tests have been very helpful to developers during implementation
  7. because the goal of AQS 2.0 is compatibility, tests that execute the handlers at a relatively high level during CI would be useful in preventing regressions

Bigger picture

Assuming that AQS 2.0 is successful, it may establish patterns for similar Go-based services that use similar development, testing, and deployment practices. So it is worth exploring our options to find the best long-term solution, as it may benefit future work.

Issue

Now, on to the problem we're currently facing. Per discussion elsewhere (please correct me if I am misrepresenting this), our current challenge is:

  1. the current tests implemented within the AQS services require local Docker testing containers, which do not fit into our image building pipelines
  2. the current tests are more akin to integration tests than unit tests, in that they depend on an external data source

We could quickly satisfy #1 and have the pipeline succeed by "cheating" and having "make test" simply return "true". Or by implementing a small number of unit tests that don't provide much code coverage, but which also don't require testing containers, . Both those options are irresponsible, because then the pipeline won't be much help in protecting us against regressions. This situation is therefore a deployment blocker.

Solutioning Thoughts

With all that in mind, here are some thoughts related to solutions

  • the current tests have proven very valuable during development and code review, and tests of similar scope have served the existing production service well. Eliminating them would therefore be inadvisable. But we could also add additional tests that are actual unit tests.
  • could/should we arrange the service code differently to be more unit testable?
  • could/should we have multiple levels or suites of tests within the service? For comparison, Mediawiki has both "unit" and "integration" tests within its codebase, all executed via the phpunit framework.
  • is the current inability to spin up testing containers a fundamental limitation of our CI system, or simply its current status? Is it possible this could change?
  • could the current standalone testing containers be refactored such that our current system could spin them up?
  • is there a possibility of having hosted testing containers/environments?
  • is there a possibility of testing against production data sources? The author of this task has a high level of discomfort with this, but ssh'ing into prod is recommended for the druid-based endpoints in the current production AQS. Doing this via CI seems at least slightly less evil. Then again, maybe doing it via CI might actually be *more* evil, given that it opens the possibility of anyone on the planet executing random queries against production datastores via a malicious push.
  • does using an external data source compromise our testing in ways we are inherently uncomfortable with? Would mocking a data source be better or worse?
  • does the anticipated move from gerrit to GitLab change anything for any of this?

Regarding mocking a data source vs using an external data source: mocking is probably more theoretically correct, and it is easy to say we should just do that. However, there are arguments against it:

  • it makes our testing less representative of production, in that our mocking system will never accurately behave like the "real thing"
  • it creates a lot more code to maintain - mocking two separate databases isn't trivial
  • it raises the bar for the skillsets required. Mocking a system requires the coder to sufficiently understand both the thing being mocked, and the technology being used to do the mocking. An external system could be created by an individual familiar with the datastore (and containers for many datastore types, including the ones we use, are publicly available), thereby requiring the service developer to understand the datastore only from the outside-in perspective of creating queries and processing the response, not from the inside-out perspective of processing queries and creating responses.
  • data in an external environment can be ingested into the testing container independently of modifications to the codebase. A mocked system with no external dependencies requires a change to the codebase to update testing data. (However, this could also be considered an advantage of mocking, as using an external environment means the data being tested against can change without warning)
  • Mocked data accompanies the codebase on its entire journey, increasing the size of repos (and possibly even deployment images, depending on how we approach things).
  • We already have containers. Mocking would be an additional development effort. (This is lower priority, as we would rather do things correctly even if it means more work. But it isn't completely ignorable either.)

Task Responsibility

The author is assigning this task to themselves, as driver. However, the purpose of this task is discussion and consensus, not for one person to make decisions. Please jump in with anything I've forgotten, correct any mistakes or misrepresentations I've made, etc. There's no way I typed that many words and got it all right...

Also, I'm subscribing a bunch of people who I think would be interested, but please add anyone I forgot.

Acceptance Criteria

This task is complete when a testing solution has been agreed on and any additional tasks necessary to realize that solution have been created.

Edit: fixed a bunch of typos and poor wording choices.

Event Timeline

Integration vs. unit tests:
Just to preface my thoughts: I have no interest in getting rid of our integration tests. I think we absolutely should have integration tests and more of our systems should have them. Also to be clear, everything that follows (with the exception of one or two things that block us from getting images built) is my opinion. I'm not doing day to day development on the project and so I don't want to impose great burdens or constraints, but I also have a duty of care as the one shepherding it towards production.

However, I think it is a fundamental antipattern that our tests default to/our only testing approach is doing an integration test, and for the prior art I think the AQS 1.0 testing approach is also not the best approach. I even tried to improve this process in https://gerrit.wikimedia.org/r/c/analytics/aqs/+/679295 as part of the Cassandra 3 migration, without much success. The AQS testing approach is an improvement on the previous testing approach of using a deployment of AQS in WMCS which was prone to lagging behind in version and consistency.

These tests test overall functionality and that expected inputs lead to expected results, which is great, but they are far from unit tests by the classical definition - breaking down the service into units, be they functions, objects or whole sections. We're not doing that. But I strongly believe we can and should have both. As we are keeping our options open on augmenting and expanding the service, having the tests in place will help us avoid regressions and side-effects of changes to our approaches on a smaller scale than just our expected API behaviours.

Unit test implementation:

As regards reorganising the code - this is entirely up to the developers. I am not making any hard demands here as regards any of this stuff (although there are things that will block us from prod), just stating opinions.

If we decide to implement unit testing, it's up to the team the extent to which the unit tests are implemented. We don't have to test *all* functionality, or somehow replicate the degree of testing we get from the integration tests. The two should complement each other rather than compete or one replace the other.

Druid/testing against prod:
I am strongly against the idea of tests querying production data sources in general, and moreso against the idea of it happening in an automated fashion. It would be really nice if we could come up with a replacement for this as I didn't know this was convention and it makes me itchy. This seems like a prime candidate for mocking.

On mocking in general:
This is where we get towards fundamental differences between unit and integration tests. An integration test will use the real service, and a unit test will mock whatever is required to get the job done. I agree that there is a significant up-front cost in mocking, but unfortunately I also think this is the debt we're paying off by not including testing as part of our development plan. I don't mean this as a criticism, it's just where we currently are!

Mocking might be less representative, but that's not really what it's trying to imo. We should be making a best effort but the ultimate goal is not to mimic production data but more expected behaviours of a system (like cassandra). If we have a significant volume of mock data in the service (which I don't think we will), we can strip it off before deployment - if needs be we can completely remove all test code before building our final image.

CI approach:
I have asked Releng for feedback on whether running a second container is an option within our CI system. I have no real knowledge of whether it's possible, but I haven't seen it yet. If it is an option, we will need to optimise our approach to a considerable degree - having one repo for tests and code would be ideal (but isn't entirely mandatory, we will need to do some creative Makefile work), a single command/a small number of commands for running the containers, and the ability to start, run and then tear down the testing environment automatically, cleanly and without human interaction.

Thanks for the thoughts! Some snippage for brevity in the following quotes.

everything that follows (with the exception of one or two things that block us from getting images built) is my opinion. I'm not doing day to day development on the project and so I don't want to impose great burdens or constraints, but I also have a duty of care as the one shepherding it towards production.

This is very much appreciated! Developers often try to wear both the "drive this project to completion" hat and the "don't make bad decisions in the name of expediency" hat, and it is quite hard to wear both those at the same time. The perspective of someone not involved with day-to-day coding is very valuable.

However, I think it is a fundamental antipattern that our tests default to/our only testing approach is doing an integration test, and for the prior art I think the AQS 1.0 testing approach is also not the best approach.

You're probably right. I don't want to discount the unit tests in aqsassist, where much of the functionality resides, but your point still stands. Among other things, the aqsassist tests are no help in making sure that we're using that package's functionality correctly in the services.

I even tried to improve this process in https://gerrit.wikimedia.org/r/c/analytics/aqs/+/679295 as part of the Cassandra 3 migration, without much success.

I had not seen that before. Cool. :) Sorry to hear you didn't have much success with it.

These tests test overall functionality and that expected inputs lead to expected results, which is great, but they are far from unit tests by the classical definition - breaking down the service into units, be they functions, objects or whole sections. We're not doing that. But I strongly believe we can and should have both. As we are keeping our options open on augmenting and expanding the service, having the tests in place will help us avoid regressions and side-effects of changes to our approaches on a smaller scale than just our expected API behaviours.

No arguments to any of that.

If we decide to implement unit testing, it's up to the team the extent to which the unit tests are implemented. We don't have to test *all* functionality, or somehow replicate the degree of testing we get from the integration tests. The two should complement each other rather than compete or one replace the other.

This is a good point, and we'll want to find that balance. See below for some formative thoughts to start that conversation.

Druid/testing against prod:
I am strongly against the idea of tests querying production data sources in general, and moreso against the idea of it happening in an automated fashion. It would be really nice if we could come up with a replacement for this as I didn't know this was convention and it makes me itchy.

I agree completely. I mostly put that possibility in the task description so someone could go on record and say it was a terrible idea. :)

On mocking in general:
This is where we get towards fundamental differences between unit and integration tests. An integration test will use the real service, and a unit test will mock whatever is required to get the job done. I agree that there is a significant up-front cost in mocking, but unfortunately I also think this is the debt we're paying off by not including testing as part of our development plan. I don't mean this as a criticism, it's just where we currently are!

That's fair.

This is a place where @SGupta-WMF separating the code into layers may provide benefit. Without turning this into an implementation ticket, each endpoint (excluding boilerplate like healthz and metrics) has three layers:

  1. handler, in a xxx_hander.go file, which invokes the logic layer
  2. logic, in a xxx_logic.go file, which invokes the datastore
  3. entity in a xxx file, which contains no functionality and doesn't invoke anything

The handler layer could be unit tested by mocking the logic layer. No datastore mocking required.
The logic layer would require mocking the datastore in some fashion, but because this layer is mostly concerned with the structure of the response, rather than the actual values, we might get sufficient value from straightforward mocking.

Mocking might be less representative, but that's not really what it's trying to imo. We should be making a best effort but the ultimate goal is not to mimic production data but more expected behaviours of a system (like cassandra). If we have a significant volume of mock data in the service (which I don't think we will), we can strip it off before deployment - if needs be we can completely remove all test code before building our final image.

Maybe this is where we can start defining the line between the "unit" and "integration" tests. Would it provide sufficient value if the unit tests focus on making sure the service responds as desired given data of the proper schema (and in the negative case returns expected errors when given data of an unexpected schema)? That would leave the integration tests to focus on data values rather than data structure.

CI approach:
I have asked Releng for feedback on whether running a second container is an option within our CI system.

Thank you. I'm very curious what is possible.

If it is an option, we will need to optimise our approach to a considerable degree - having one repo for tests and code would be ideal (but isn't entirely mandatory, we will need to do some creative Makefile work), a single command/a small number of commands for running the containers, and the ability to start, run and then tear down the testing environment automatically, cleanly and without human interaction.

For sure. I've long been of the opinion that, even if our current testing environments could be used in CI, their current form was unsuitable. That's part of why I've deferred moving them to better repos - I didn't think we knew enough yet to properly understand what we really wanted, and I didn't want to move them twice.

Summary of the synchronous meeting we had today. Many of the task subscribers were in attendance:

  • We do see value in separate in-service unit testing vs in-service integration testing
  • The unit tests should execute without an actual data store via mocking (and therefore should not depend on the current testing environments)
  • The integration tests can rely on an external data store (i.e. some variation of the current testing environments)
  • While there would be benefits to a hosted dev/staging data store to test against, that also present challenges and risks. For example, if it is down, that affects all developers. There may be possibilities there, but we do not plan to pursue them in the short term.
  • The existing layers seem helpful for isolating pieces of the code to unit test via mocking
  • We may also want an additional layer between the logic and the database to more conveniently allow mocking of that interaction
  • It seems reasonable, at least for now, for the portion of unit testing that requires database mocking to focus on data structures/schema (which we can hopefully achieve with minimal mocking), while integration testing can do more with actual data values
  • The CI pipeline stages that run on normal code push (i.e. everything that happens before +2) are not suitable for spinning up testing containers and therefore unsuitable for integration tests
  • There exists a post-merge pipeline stage where we could spin up testing containers and run integration tests
  • Some of this is gerrit-specific. While gitlab may eventually change some things, we can explore that when gitlab is closer to ready.
  • We did not talk much about the external integration testing suite, other than to say we could discuss it as a separate topic

The implementation plan is therefore:

  • create handler unit tests (this can be done right now, with no prerequisites)
  • create logic unit tests (this may require introduction of a database abstraction layer)
  • adjust the makefile to allow execution of unit tests, integration tests, or both
  • make any necessary changes to the testing environments to make them suitable for use in the CI post-merge step (details TBD)
  • set up the CI pipeline to execute integration tests as a post-merge step

Please correct anything I got wrong, or add anything I missed.

I'm assuming that the most CI-friendly way to arrange the makefile is for "make test" to execute only unit tests. Executing integration tests (or both unit and integration) would require additional params or make targets. Am I assuming correctly?

I'll create implementation tasks and run them by the group, especially @SGupta-WMF , who has done some Go mocking before.

Relevant comment by @thcipriani in Slack:

  • This is the .pipeline/config.yaml magic to run [[ URL | helm test ]] post merge: mathoid:.pipeline/config.yaml
    • You need deploy: {test: true} as well as deploy: {chart: {name: <whatever>}}
  • That uses charts from operations/deployment-charts repo
    • Here's mathoid's service-checker example: mathoid-chart:templates/tests