Goal
The purpose of this task is to think of a way/pattern to fetch/generate data to populate test environment to use not only for AQS 2.0 but any other future project. We cannot count on production access to get it.
Current status
- At this moment we have created two dockerized test environments (Cassandra and Druid) for AQS 2.0 to be able to test our developments. And we are populating these environments with data we fetched directly from production (from Cassandra and Druid), but we have been said that's not the best practice (and we fully agree). Access to production data should be restricted, sometimes data include some PII, . . . . so we shouldn't consider this way.
- Our plan B could be to generate mock data creating specific scripts but, with this approach, we'd have to address new challenges:
- Scripts could be too complex, and could be a work overload. We have created a sample script for a specific AQS 2.0 use case and it was really simple but we can imagine we will encounter more complex use cases in this and future projects. In this case we could compare generated data with some production data we fetched in the past but we won't be able to do for every use case.
- Regardless the script complexity, how to know the generated mock data is good enough? Depending of the specific dataset we have to generate, it could be difficult to determine whether the data is well generated. Should we count on Data Engineering support?
TODO
- Identification of new ways to fetch/generate data to populate test environments
- Definition of a "standard" org-wide process/pattern to create/generate/fetch data for test environments for any current and future project
- Documentation for the new process/pattern
- Modification of the current test environments
Documentation/References
- AQS 2.0 Cassandra test environment: https://gitlab.wikimedia.org/frankie/aqs-docker-test-env
- AQS 2.0 Druid test environment: https://github.com/bpirkle/aqs-docker-druid-test-env
- A document where we have summarized what we have explored about Creating testing environment using mock and synthetic data: https://docs.google.com/document/d/1qgqJA_jK0KADVx3dEen3sD4zYGcAQ6qQTcYOXGBfDSo