Page MenuHomePhabricator

Query additional sample data for AQS testing
Closed, ResolvedPublic

Description

From T343273#9316165:

I'm sorry for bothering again. I'm here to ask for new data. We need some new one about pageviews to debug for a bug we have in production (it seems to be related with some specific date for specific dates). And endpoint is failing for a specific range of date and we need to add a couple of years to the script to fetch that data.

The script is https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/more-data-top-pageviews/scripts/local_group_default_T_top_pageviews.sh?ref_type=heads. It's in a new branch I have created for this change. I just added 2021 and 2022 in the year range to fetch data for these dates.

Thank you very much!


See also: T343273: Query AQS sample data for integration testing

Event Timeline

Eevans triaged this task as High priority.Nov 9 2023, 3:35 PM
Eevans updated the task description. (Show Details)

The file is quite large, so I've placed it at https://people.wikimedia.org/~eevans/local_group_default_T_top_pageviews.csv for download. Let me know when you have it so that I can clean it up.

The file is quite large, so I've placed it at https://people.wikimedia.org/~eevans/local_group_default_T_top_pageviews.csv for download. Let me know when you have it so that I can clean it up.

As a matter of policy and/or best-practice —and with apologies to those who've already heard me make this point elsewhere— I don't think that we should be querying data from production databases to populate test environments like this. For one, I don't believe that a (static, randomly selected) subset of data will produce the best results. Even contrived, hand-crafted values have the potential to better test our code, since the developer creating the tests is able to deliberately choose values that both meet and exceed each constraint. More importantly though, it risks potential exposure of sensitive data. I understand that this shouldn't be a concern for this particular dataset, but it will be true of many, and its something we cannot be too careful with.

What I propose is this:

If we deem this to be a necessary practice, and something that we intend to continue, then we follow the precedent established by Wiki Replicas, and create a pipeline to sanitize and replicate data (or subsets thereof) from production systems, to contributor accessible systems for development and testing. Personally, I think the effort required to do this would be substantial, and the time better spent elsewhere, especially given that (as stated above) the results are expected to be poorer than for curated or pseudo-randomly generated faux data.

Thoughts?

First of all I wanted to say that I appreciate your support (helping us with the data and trying to improve all this) and I totally agree with you. We know this is not a best practice and, even not having any privacy concerns with that data, the best way to deal with it is not fetching data just to populate our local test env. In fact, our purpose is not really that. We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:

  • With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
  • Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm

I also wanted to say that we considered your first words about looking another ways to populate our local test env. After talking with you for the first time about this some time ago, we decided to start a task to explore other ways to create sample data for this (an others) projects T334851: Define a procedure/pattern to populate test environments. I explored a couple of good libraries to create mock data and I found that they work really well and it was even easier to do that way for some simple datasets we use (unique_devices, for example). It was easier than fetching the data from production and we could create any specific case we wanted. With just 40 lines of code I could create that dataset for unique-devices (https://gitlab.wikimedia.org/sfaci/mock-data-generator). And thanks to Hal Triedman we learned how to deal with synthetic data to create more realistic datasets (https://gitlab.wikimedia.org/htriedman/synth-data/) (maybe it's even something more sophisticated than what we need). Of course, I think that way is the best one we can afford right now (the best one would be to have a real test environment), but it doesn't cover our needs about to know really well how datasets are structured and all its particular cases like the ones I have commented before.

Having said that, as I mentioned before, I totally agree with you that we have to create mock/synthetic data to populate our test environments for all the reasons you have explained. So, is there any way to achieve all the details about how data is stored/structured and to be informed about specific wrong/different cases like the ones I have explained before? In that case, I think it could be perfectly feasible to create our own mock/synthetic data to populate the local test env. Maybe at the beginning we'll need an extra effort to create all the needed scripts but, in the end, we can save a lot of time for a lot of people, surprises and even bothering people.

In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.

Thank you both for following up. Outside of the schema does any documentation exist for how the datasets are actually structured @Milimetric?

Is this debt that we have to pay down or is it inherent in the underlying data? Would this be a use for datahub - in terms of at least documenting how to work with the datasets? Or should we be reconsidering the datasets - seems like this is mostly about pageviews - and cleaning them?

... We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:

  • With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
  • Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm

In both of these cases, wouldn't it be better to use the code (legacy aqs and/or analytics) to suss out the contract, rather than using queries to reverse-engineer them on a reactive basis?

[ ... ]

In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.

That might be a possibility. I assume that would require everything that is already required for obtaining access to the analytics network, plus an additional process for request, approval, creation (on-boarding), and removal (off-boarding). We have no such process right now.

... We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:

  • With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
  • Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm

In both of these cases, wouldn't it be better to use the code (legacy aqs and/or analytics) to suss out the contract, rather than using queries to reverse-engineer them on a reactive basis?

Sure! In these specific cases we could use the code to see some specific details. Even so, sometimes, we didn't expect these issues and we didn't realize (our fault, of course). But, at the same time, we needed the data just to confirm what we were seeing in the code. But I meant in general, the common situation in which you have to code a new project using some new/existing datasets you don't have access. How to deal with situations like those? It's not common to have the code from the previous version and just have to rewrite it using a different tech stack.

In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.

That might be a possibility. I assume that would require everything that is already required for obtaining access to the analytics network, plus an additional process for request, approval, creation (on-boarding), and removal (off-boarding). We have no such process right now.

How can we go with that? Is someone specific the one who has to create it?

... We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:

  • With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
  • Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm

In both of these cases, wouldn't it be better to use the code (legacy aqs and/or analytics) to suss out the contract, rather than using queries to reverse-engineer them on a reactive basis?

Sure! In these specific cases we could use the code to see some specific details. Even so, sometimes, we didn't expect these issues and we didn't realize (our fault, of course). But, at the same time, we needed the data just to confirm what we were seeing in the code. But I meant in general, the common situation in which you have to code a new project using some new/existing datasets you don't have access. How to deal with situations like those? It's not common to have the code from the previous version and just have to rewrite it using a different tech stack.

Let's try it a different tack:

What would you be proposing if the dataset in question did contain PII? How would you propose to solve the problems you've articulated here?

I'm not trying to be obtuse, I realize that this dataset doesn't contain PII, but it cannot be true that there is no other way. Ultimately what I want to get to is a) does this use-case warrant an exception, and if so b) why? Exceptions are Bad™ and should be avoided, so why should we make one here —and perhaps more importantly— what is the criteria (i.e. what would we use in subsequent requests to decide whether to do so again).

In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.

That might be a possibility. I assume that would require everything that is already required for obtaining access to the analytics network, plus an additional process for request, approval, creation (on-boarding), and removal (off-boarding). We have no such process right now.

How can we go with that? Is someone specific the one who has to create it?

My feelings on this are similar to what I've written above. I'm more interested in determining whether this is something generally useful/needed, or a solution to the particular problem(s) in front of us. If it's the former, I'd like for us to stake out the who, what & why and do this right; If it's the latter —if it's something exceptional that we're doing just for this— than I'd like for us to be clear about why we need a one-off, the scope, timeline, etc.

Let's try it a different tack:

What would you be proposing if the dataset in question did contain PII? How would you propose to solve the problems you've articulated here?

I'm not trying to be obtuse, I realize that this dataset doesn't contain PII, but it cannot be true that there is no other way. Ultimately what I want to get to is a) does this use-case warrant an exception, and if so b) why? Exceptions are Bad™ and should be avoided, so why should we make one here —and perhaps more importantly— what is the criteria (i.e. what would we use in subsequent requests to decide whether to do so again).

In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.

I'm sorry. I think I explained wrong. I wasn't proposing an exception based on some cases like the ones I have commented before. I was just trying to explain why we did it. I totally agree that it's not the right way.

To avoid exceptions like this I think we need good and extensive documentation. That's why I asked about it a few comments ago. I think that's the definitive solution. I'm pretty new here and maybe there is some documentation/process I don't know. Even sometimes I'm not sure what I can ask for or what I can't. But I think that, with the right documentation, we could address mock/synthetic data generation with no problem using libraries like the ones I mentioned before. I guess we would need the dataset model description (datasets, fields, data types, . . .) and information about exceptions or unexpected known issues that we could find in the production environment (like the cases I have mentioned). That way we could even mock those cases. That way we could anticipate anything generating accurate data to populate our local test environments.

Is that something we could address?

The original scope of this ticket was a very specific request to retrieve data, and that request as been met, so I'll close this ticket now.

I'm not sure where we landed on the wider discussion of sourcing test data, or whether there is more work to followup on. If so, please feel free to open a ticket for that.

It is worth noting: There is various work in the pipeline to make the cassandra-dev cluster more of a de-facto staging environment (e.g. T328778, T355920, and T355730). While I would hate to see anyone (ab)using that to copy data off for use in tests, it may go some way into providing a view into (subsets) of the production databases when necessary.