Page MenuHomePhabricator

Define a procedure/pattern to populate test environments
Open, Needs TriagePublic

Description

Goal

The purpose of this task is to think of a way/pattern to fetch/generate data to populate test environment to use not only for AQS 2.0 but any other future project. We cannot count on production access to get it.

Current status
  • At this moment we have created two dockerized test environments (Cassandra and Druid) for AQS 2.0 to be able to test our developments. And we are populating these environments with data we fetched directly from production (from Cassandra and Druid), but we have been said that's not the best practice (and we fully agree). Access to production data should be restricted, sometimes data include some PII, . . . . so we shouldn't consider this way.
  • Our plan B could be to generate mock data creating specific scripts but, with this approach, we'd have to address new challenges:
    • Scripts could be too complex, and could be a work overload. We have created a sample script for a specific AQS 2.0 use case and it was really simple but we can imagine we will encounter more complex use cases in this and future projects. In this case we could compare generated data with some production data we fetched in the past but we won't be able to do for every use case.
    • Regardless the script complexity, how to know the generated mock data is good enough? Depending of the specific dataset we have to generate, it could be difficult to determine whether the data is well generated. Should we count on Data Engineering support?
TODO
  • Identification of new ways to fetch/generate data to populate test environments
  • Definition of a "standard" org-wide process/pattern to create/generate/fetch data for test environments for any current and future project
  • Documentation for the new process/pattern
  • Modification of the current test environments
Documentation/References

Event Timeline

Sfaci added a subscriber: BPirkle.

Hi @Sfaci, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allows to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

BPirkle added a subscriber: Eevans.

I'm tagging this as AQS 2.0, because that's what we're working on as we consider it. However, per the task description, this has much broader considerations. I'm also tagging several teams for visibility, in case anyone has thoughts, opinions, knows of existing art, or just wants to follow along. I'm casting a very wide net, and I doubt all of you have interest or capacity. So feel free to untag yourself if you like. Alternatively, feel free to add anyone you think should be tagged.

A few thoughts, links, etc. follow in no particular order.

So far, we're only actually using the Cassandra env. The Druid one will be necessary for service we haven't started development on yet (but will soon). The Druid container is a lot rougher, and the testing data it includes is really just for proof of concept.

One challenge we face is automated testing in CI. We do not currently have the ability to spin up testing containers on every push, nor do we have dev datastores to hit from CI. This means that any tests that run before code review occurs must mock the data being tested against from within the service itself. This is fine and appropriate for tightly focused unit tests. However, for a service that exists primarily to serve large amounts of data from a datastore, it would also be comforting to do integration tests that confirm the service performs it main reason for existence. In theory, we have the ability to spin up test containers post-merge, but before the image is actually deployed to serve production traffic. Running automated integration tests at this stage would be a lot better than not running them at all. However, it is an awkward spot because it only happens after code is merged, so any errors found will block other deployments, buggy code will potentially be pulled down by other developers, etc.

Another challenge we face is complexity. Our current approach requires people to start the testing env (make startup) then execute a second command (make bootstrap) to import the data. This has repeatedly confused people, especially people such as QA that don't interact as closely with the code as the developers do. We're exploring creating/publishing images with data already imported as part of the Druid container, but that's really formative.

FWIW, yes we know that our current repo locations are terrible. They're just placeholders until we can decide on a longer-term approach.

It is possible that we don't do separate testing environments at all. We could just mock all this directly from within the services. That has a lot of advantages, mostly by eliminating all the problems that come with depending on an external environment. However, it has some drawbacks too. For example, we've found the separation of concerns that comes with having an external environment helpful, and it seems like this approach could allow people with different skillsets to contribute to our testing systems, whereas mocking from within the service restricts contribution to people who know how to do that.

It is possible this at least partially solved by people outside WMF. We'd rather not create a custom system that we have to maintain if there is something already out there. Suggestions welcome.

We're unsure how best to apply lessons learned from anything we're doing to other teams. But we don't want to miss an opportunity to help others, establish useful patterns etc. Again, suggestions welcome.

T263489: AQS 2.0 Main AQS 2.0 task, mostly for context
T288160: Development and test environments for AQS 2.0 Cassandra services Epic task where we got started on all this
T328969: AQS 2.0: Revisit in-service testing approach This is when we realized we had sufficient CI challenges to revamp our approach
T334130: Access to AQS keyspaces for cassandra There's a relevant conversation between myself and @Eevans in the comments)
T332172: AQS 2.0: Page Analytics: add testing data for legacy endpoint to Cassandra testing env An example of where we had to go back for additional testing data
T316849: Audit tests for Druid-based endpoints Inventory of the tests in current production AQS (there are almost 200) that we want to match
T317790: <Spike> AQS 2.0 Testing Plan General testing context

akosiaris subscribed.

I am moving this one to serviceops-radar as we are interested to see how this pans out, but I am not sure how we can currently contribute to this. Feel free to undo, I 'd love to chat if you want to.

I would like to expose here all the different ways I have on my mind related to generating sample data to populate a test environment, to receive feedback from anyone who has other ideas, something already running somewhere and just something to say about it. Please, feel free to comment, criticize or propose alternative ways to do it. I'm sure there are a lot of things I don't know and I could be exposing here some meaningless ideas.

A. Fetch data from production

We know this is not the best idea to populate a test environment. I am just adding it here because, even so, it’s an existing way and because it’s the way we are populating our test environment for cassandra right now (there is no PII).

B. Mock up the data creating our own scripts for every specific dataset we need

We have been exploring this way and it seems, sometimes, pretty feasible. Although the purpose is to find a generic pattern, currently we are primarily focused on AQS dataset and, in this case, it was really easy to generate mock data for a specific dataset. But I think we can assume not all the datasets will be so easy to mock up because writing the script to mock up the data is the easiest part and, depending on the dataset, we will need to do some extra tasks to be able to guarantee data is well created. There are some caveats if we decided to use this way:

  • How can we know that generated data has enough quality (variety, size, specific value for specific fields, constraints,. . .) if we can’t access the production environment? Can we assume, at least, schema and sample data is documented somewhere? For this specific case, I couldn’t find that information
  • Creating these scripts sometimes could be too difficult and extra workload for the project

But, as an advantage, we could reuse a lot of work done here (scripting code, patterns to generate data, . . ) and even the part related to creating the environment (Dockerfile/docker-compose). Maybe it could be harder at the beginning or anytime we need to create something for the first time but we could focus on unifying efforts having a unique “framework” to mock up data for any service with some kind of parametrization. We would probably need someone from the data team (or someone who “owns” the production data) to confirm that the data has the proper quality when we cannot confirm it ourselves, but it seems reasonable.

C. Using data dumps

I have seen some data dumps exist for specific databases or datasets and they are generated periodically. But it seems no data dumps are generated for every dataset or database (e.g.: I didn’t find any of them with the same schemas as the cassandra tables we are using for the AQS project). In some manner this way is pretty similar to the first one but, since data dumps exist, I guessed if there is something we can take advantage of to consider this way. About data dumps, I have some questions I hope somebody can answer:

  • Could I assume there is a data dump somewhere for any existing dataset?
  • In case we couldn't assume that, could we request them or some automatic task could be created to generate them in the same way as the existing ones?

And some comments about the environment itself

Regardless of the way we get the test data, we are assuming that we’ll create a docker image with the appropriate database engine (Cassandra, Druid, . . . ) and the mock data as a test/stage environment for any specific project. The main purpose for this task is to find the best way to generate or get mock data but it could be interesting to discuss different points of view about this topic as well.

@Htriedman perhaps some of the data privacy techniques could be used here to generate test data from the dumps?

Hi all! This is a really interesting problem, and I think that there are definitely some data privacy techniques that seem like they could be useful here — primarily, differentially-private data synthesis.

This is a very open area of research, and there are a bunch of different ways of generating synthetic private datasets that look and behave relatively similarly to the real thing. As an example, this open source library implements 7 techniques for private data synthesis.

We don't have an established process for this (we've never done it at WMF before), and I have a couple of questions before I feel like we can move forward:

  • These datasets are not intended to be public, right? The idea is to have various test datasets that can be used as an internal test backend of AQS2.0. The actual datasets will be publicly accessible through AQS2.0 after the testing is successful, and this data will not be accessible outside of a test instance, and only be public if there's a security flaw or something. Am I understanding the idea correctly?
  • From a columnar perspective, how large will these datasets need to be? The computational resources required to generate good synthetic data scales nonlinearly with the number of columns. Are we talking about datasets with 40 columns, or 4?
  • How often would this process need to be done? Each computation is expensive and takes a long time, so if this is going to be a once-a-week thing this process might not be feasible. Once a month/quarter might be more doable.

Hope this helps!

Hi @Htriedman! Thank you very much for providing this interesting library. I have to take a look at it because I was thinking about fake data and what you bring here is a better approach and something new for me.

I'll try to answer your questions:

  • These datasets are not intended to be public, right? The idea is to have various test datasets that can be used as an internal test backend of AQS2.0. The actual datasets will be publicly accessible through AQS2.0 after the testing is successful, and this data will not be accessible outside of a test instance, and only be public if there's a security flaw or something. Am I understanding the idea correctly?

To be honest, I hadn’t thought about it because I thought that was not a problem. Currently we have a (public) docker environment with some data in csv files (fetching data from production) to create a docker image to use as a test/stage environment. As you mentioned, data will be publicly accessible through AQS 2.0 services but I didn’t think about whether this data should be publicly available as a dataset. Currently that data is public because everything we need to create this environment is available in an open GitLab repository. So far I thought we could do something similar with fake/synthetic data. Do you have something in mind regarding making these datasets public (or not)?

  • From a columnar perspective, how large will these datasets need to be? The computational resources required to generate good synthetic data scales nonlinearly with the number of columns. Are we talking about datasets with 40 columns, or 4?

Currently, with AQS 2.0, we are talking about 10 columns. But who knows whether, in the future with new projects, we could need to create larger datasets. Anyway, as I'll mentioned below, we won't need to run the process frequently so data size shouldn't be a big issue

  • How often would this process need to be done? Each computation is expensive and takes a long time, so if this is going to be a once-a-week thing this process might not be feasible. Once a month/quarter might be more doable.

I think the process wouldn't need to be done frequently. Once you have created good datasets for your test environment, I guess you can enjoy them for a long time. If your data requirements don’t change, you wouldn't need to re-generate your data. Besides that, data size would be less critical

From a columnar perspective, how large will these datasets need to be? The computational resources required to generate good synthetic data scales nonlinearly with the number of columns. Are we talking about datasets with 40 columns, or 4?

Currently, with AQS 2.0, we are talking about 10 columns. But who knows whether, in the future with new projects, we could need to create larger datasets. Anyway, as I'll mentioned below, we won't need to run the process frequently so data size shouldn't be a big issue

Great, I think this should be doable then.

How often would this process need to be done? Each computation is expensive and takes a long time, so if this is going to be a once-a-week thing this process might not be feasible. Once a month/quarter might be more doable.

I think the process wouldn't need to be done frequently. Once you have created good datasets for your test environment, I guess you can enjoy them for a long time. If your data requirements don’t change, you wouldn't need to re-generate your data. Besides that, data size would be less critical

Also good to know.

These datasets are not intended to be public, right? The idea is to have various test datasets that can be used as an internal test backend of AQS2.0. The actual datasets will be publicly accessible through AQS2.0 after the testing is successful, and this data will not be accessible outside of a test instance, and only be public if there's a security flaw or something. Am I understanding the idea correctly?

To be honest, I hadn’t thought about it because I thought that was not a problem. Currently we have a (public) docker environment with some data in csv files (fetching data from production) to create a docker image to use as a test/stage environment. As you mentioned, data will be publicly accessible through AQS 2.0 services but I didn’t think about whether this data should be publicly available as a dataset. Currently that data is public because everything we need to create this environment is available in an open GitLab repository. So far I thought we could do something similar with fake/synthetic data. Do you have something in mind regarding making these datasets public (or not)?

I don't think I'm following what you mean here. My question is this — if the docker environment is public, the gitlab repository is public, the current test data csvs are public, and the production data is public, why do we need synthetic or fake data at all? It seems like we could just take a large representative sample of production data and use that for testing. What am I missing here?

[ ... ]

I don't think I'm following what you mean here. My question is this — if the docker environment is public, the gitlab repository is public, the current test data csvs are public, and the production data is public, why do we need synthetic or fake data at all? It seems like we could just take a large representative sample of production data and use that for testing. What am I missing here?

In the context of: tests for current AQS datasets, you can do that. We know that none of that existing data contains any PII — in fact the datastore houses nothing more than a materialized representation of what the API is expected to return. In the context of where to get data for tests generally (see the ticket description), that assumption is eventually going break down. When it breaks down, we run the danger of inadvertently exposing sensitive info if folks are in the habit of querying test data from production databases (and it's not always going to be immediately obvious). If we do adopt this as best practice, I think it needs to come with sufficient rigor to safeguard against this (which will undoubtedly make it less Simple/Easy™).

A static sample can also be limited in its utility; Generated test data will often produce better results by more closely approximating all the permutations you'd see in a larger set.

TL;DR Generating data would be more complicated in the nearer term, but as a standardized practice, it completely sidesteps any current/future concerns of PII leaks, and will likely result in better quality tests.

@Eevans Thanks so much for the clarification! This rationale makes a great deal of sense to me, and I can focus on trying to provide your team with a simple and repeatable script that can do this across a variety of underlying data sources.

For me to move forward on this (from an experimentation perspective), it would be a great help if someone could point me in the direction of an easy way of doing the following:

  1. query a representative sample of underlying data we want to synthesize
  2. wrangle that underlying data into a python pandas dataframe

Would that be doable?

Thank you @Eevans for clarifying that!.

@Htriedman We are discussing about creating generic patterns/procedure to mock up/synthetize data not only for AQS 2.0 but for any other project. We mention this project because we discover this need while working on it, but the idea is to create something we can use for any future project where we need to mock up data. But, if I have understood right, the library you mentioned would use source data to create synthetic one and that way we could reuse your script to synthetize data from other datasources, so it sounds really interesting and useful. And no matter if we start working with AQS 2.0 datasets to experiment, right?.

Would it be useful for you if I provided you some data extracted from AQS 2.0 project to experiment with it? I have some datasets as CSV files that I think you can use to move forward.

I didn't realize that these CSV files are available at one of the repositories I mentioned in the ticket description (https://gitlab.wikimedia.org/frankie/aqs-docker-test-env). They are inside the test_data.d folder. You can take a look at them

if I have understood right, the library you mentioned would use source data to create synthetic one and that way we could reuse your script to synthetize data from other datasources, so it sounds really interesting and useful.

This is correct!

these CSV files are available at one of the repositories I mentioned in the ticket description

Taking a look now, I'll probably get back to you about this sometime in the next week or two.

Thanks @Htriedman!
Don't hesitate to reach out to me if you need something

Did some basic experimentation on this front here: https://gitlab.wikimedia.org/htriedman/synth-data/

As a proof-of-concept, it should work ok on WMF stat machines with source data ≤5,000 rows (at least for non-JSONified outputs). Feel free to play around with the notebook in this repo and generate data from the existing public csvs to your liking — it's a relatively simple API.

For the moment, though, please do not use this in production to create synthetic datasets of private or sensitive data. This repo is meant to be a PoC for us to experiment with, and I would want to be more hands-on and careful for a real deployment of synthetic sensitive data.

How fast @Htriedman!
Ok, I'll start playing with this first approach and I'll let you know how it's working for our datasets. I understand it's a PoC, we won't use for production purposes.
Thanks!

@Htriedman I have some questions about the project you created:

  • Which python version are you using? Currently I'm using 3.11 but I get some errors while installing requirements. I have tried also with 3.10 and it seems it works better but, at the end, there are some errors while compiling some libraries and installation fails
  • In the requirements.txt file you provided to install dependencies there are some absolute paths (e.g: @ file:///tmp/build/80754af9/anyio_1644481697509/work/dist) that I guess I should remove, right?
  • I have some issues with some specific versions for some packages. Do I have to keep in mind something before installing these requirements? For example with anaconda-client: (ERROR: Could not find a version that satisfies the requirement anaconda-client==1.7.2 (from versions: 1.1.1, 1.2.2))

Thanks!

@Sfaci I ran this on stat1006 using the conda-created-stacked, conda-activate-stacked, and conda-deactivate-stacked built-in scripts. Are you using stat machines and conda?

If so, I believe the python version should be 3.7.13, and (I think) the requirements.txt absolute paths should be hard-coded into your conda environment. Let me know if this is helpful!

I'm not using stat machines. I didn't realized I can use them. I don't know why I thought they only can be used for specific reasons and I was working on my laptop.
I'll try! Thank you very much!

Hi @Htriedman, first of all, thank you for your help to run your experiment!

Currently I'm able to run it so I can generate synthetic data from a real sample dataset and it seems it works fine (at least from my beginner point of view about synthetic data) so it could be really useful to create test environments. Event I have tried with different datasets. So, I have some new questions:

  1. . What do you consider our next steps would be with this approach (using sample data as the initial source)? I ask that because you mention we shouldn’t use it yet in production environment with private or sensitive data so I guess we need to work more on it (to anonymize, for example). It’s not the case at this moment but it’s something we should explore for the future
  2. On the other hand, I think we should consider other cases in which we won’t have any sample data because, maybe, we don’t have access to that data. What could we do in these cases? Is there any way to generate synthetic data from some constraints, parameters or similar? In this case, would we be just talking about mocking data?
  3. Regarding the cases in which datasets contain a JSON as a row value, I have seen that generated synthetic dataset doesn't keep this structure and assign a simple value to that row (if we have a filesJSON row, the generated row is file and only contais a filename). Maybe is not finished?

Thanks!

  1. What do you consider our next steps would be with this approach (using sample data as the initial source)? I ask that because you mention we shouldn’t use it yet in production environment with private or sensitive data so I guess we need to work more on it (to anonymize, for example). It’s not the case at this moment but it’s something we should explore for the future

Next step would be to reach out to me whenever your team needs to synthesize private or sensitive data. I'll take charge of private data synthesis, at least initially.

  1. On the other hand, I think we should consider other cases in which we won’t have any sample data because, maybe, we don’t have access to that data. What could we do in these cases? Is there any way to generate synthetic data from some constraints, parameters or similar? In this case, would we be just talking about mocking data?

I think in this case, you'd be better off just synthetically creating a dataset using a python script or something. The methodology I've laid out here only works when you have access to the source data.

  1. Regarding the cases in which datasets contain a JSON as a row value, I have seen that generated synthetic dataset doesn't keep this structure and assign a simple value to that row (if we have a filesJSON row, the generated row is file and only contais a filename). Maybe is not finished?

This data generator can only handle numerical, categorical, or ordinal data — JSON data does not fall into any of those data types. However, it is possible to unpack the JSON data into numerical, categorical, and ordinal rows, and that's what I've done at the bottom of the jupyter notebook (see the section entitled "case when row of the csv is a JSON object").

Just to summarize and collect what we have discussed and learned so far:

  • We have discovered a new way to generate data, synthetic data, that sounds really interesting. A sample script to generate synthetic data from an existing dataset is available here. We could count on @Htriedman to remove any sensitive information from the generated one:
    • The main advantage is that we can create a new (synthetic) dataset from a real one without having to customize the script for every different case.
    • The main caveat is that we'd need some initial data (real data) to generate the synthetic one from it and, at this moment, that's something we're trying to avoid.
  • We have also explored a bit more how to generate mock data (often called also synthetic data) using some libraries (like mimesis) to generate data in a really realistic way. Some samples using that library are available here.
    • The main advantage is that we could generate data from scratch. No production access or existing real data would be needed. We wouldn't even need to deal with privacy concerns.
    • The main caveat is related to the fact that we would need to create one script for each different case. Maybe we could create a generic script valid to generate different types of datasets. I assume we would have to modify it often to adapt it to any new circumstance not yet considered.

That being said, I think mock libraries (e.g.: mimesis) is something we should keep exploring. We wouldn't need any existing data and, regarding the caveat I have mentioned, we have to consider that these libraries have a lot of predefined data just ready to load and use. I think that could minimize the necessary effort to create a generic and useful script as a pattern to generate sample data

Any other ideas I have missed? Any other comment? Any other point of view?

I would be strongly in favor of using mock data over synthetic data, at least for the moment. We should only have an explicit preference for synthetic data if there's a real need for the underlying statistical distribution of the fake data to mirror that of the real data. If it's just for performance testing, that shouldn't be necessary.

Jrbranaa subscribed.

Added Catalyst tag for test environment need/requirement awareness.