Page MenuHomePhabricator

Query AQS sample data for integration testing
Closed, ResolvedPublic

Description

From @Sfaci via Slack:

As Will mentioned you some days ago, we are working on improving the data we use to populate our cassandra test environment. We have modified the scripts we use to fetch that data and we need your help to launch them and send us the output (a CSV file for each script). Of course we are aware of this is not the right pattern and we are already working on new ways to do in the future.

They are in https://gitlab.wikimedia.org/frankie/aqs-docker-test-env/-/tree/data-scripts-update. There are 11 scripts and each one generates a CSV file as output with the same same but different extension. No parameters or redirections are needed to execute them, just launch one by one. They have been checked and should work fine but, let me know if there is something wrong and I'll check it. The script just launch cqlsh to launch the query so you need to configure credentials in the cqlshrc file previously. I'm sure you know that because you taught us to do that way, just to confirm how scripts are created.

Event Timeline

Eevans triaged this task as Medium priority.

The scripts have all been run, and the csv files placed at: https://people.wikimedia.org/~eevans/T343273/

@Sfaci let me know when you've retrieved these, so I can cleanup this directory.

From @Sfaci via Slack:

Hi Eric! You can remove the data. I have download all the files except the biggest one (surprisingly huge! 4.5GB). I cannot download it but we have to modify something in the script that generates it because we don't want to fetch that amount of data. I will ping you again to launch again that script (and maybe some others). Sorry for that but, hopefully, just one more time. Anyway, you can remove all this data for now. Thank you very much!

Hi Eric,
Could you launch again the following scripts? We have modified them to reduce the amount of data they fetch:

  • local_group_default_T_mediarequest_top_files.sh
  • local_group_default_T_top_pageviews.sh

Both have been updated in the repo under the data-scripts-update branch (https://gitlab.wikimedia.org/frankie/aqs-docker-test-env/-/tree/data-scripts-update).

Thank you very much!

Hi Eric,
Could you launch again the following scripts? We have modified them to reduce the amount of data they fetch:

  • local_group_default_T_mediarequest_top_files.sh
  • local_group_default_T_top_pageviews.sh

Both have been updated in the repo under the data-scripts-update branch (https://gitlab.wikimedia.org/frankie/aqs-docker-test-env/-/tree/data-scripts-update).

Thank you very much!

Ok, see:

https://people.wikimedia.org/~eevans/T343273/local_group_default_T_mediarequest_top_files.csv and
https://people.wikimedia.org/~eevans/T343273/local_group_default_T_top_pageviews.csv

Hi Eric,

We have modified again the previous scripts to fix an issue we have found:

  • local_group_default_T_mediarequest_top_files.sh
  • local_group_default_T_top_pageviews.sh

Both have been updated in the repo under the data-scripts-update branch (https://gitlab.wikimedia.org/frankie/aqs-docker-test-env/-/tree/data-scripts-update).

Could you launch them again?

Thank you very much!

Hi Eric,

We have modified again the previous scripts to fix an issue we have found:

  • local_group_default_T_mediarequest_top_files.sh
  • local_group_default_T_top_pageviews.sh

Both have been updated in the repo under the data-scripts-update branch (https://gitlab.wikimedia.org/frankie/aqs-docker-test-env/-/tree/data-scripts-update).

Could you launch them again?

Thank you very much!

The updated files have been placed at: https://people.wikimedia.org/~eevans/T343273/

Thank you very much Eric. I have downloaded them. You can remove them if you need.

Hi @Eevans!

Could you run just local_group_detault_T_pageviews_per_article_flat.sh script? A new version is available in this branch in the scripts folder

Thank you very much!

Hi @Eevans!

Could you run just local_group_detault_T_pageviews_per_article_flat.sh script? A new version is available in this branch in the scripts folder

Thank you very much!

This one is relatively small so I've attached it here:

Thank you very much @Eevans!!. It's just what we needed.

Hi @Eevans! me again to ask for some sample data from cassandra.

We need to add new data about filepaths to our sample mediarequest_per_file dataset to test some edge cases. I have modified the script and we need that you run it again to fetch the data. The script is at https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/adding-files-media-per-file/scripts/local_group_default_T_mediarequest_per_file.sh
(just in case keep in mind that it's a new location. We have moved the repo from the personal one)

Thank you very much!!

Eevans added a subscriber: KOfori.

Attached here:

Thanks @Eevans!

But there is something wrong in the script I have prepared. Can you please take a look to it to see why we are not getting data for all the items? I have taken a look at the last CSV and there are information only about some of the files I wanted to get:
The purpose of the last version of the script is to get some data about the following files:

'/wikipedia/commons/1/1c/Manhattan_Bridge_Construction_1909.jpg'
'/wikipedia/commons/6/60/The_Earth_4K_Extended_Edition.webm'
'/wikipedia/commons/b/bd/Titan_(moon).ogg'
'/wikipedia/commons/7/7e/NPS_craters-of-the-moon-map.pdf'
'/wikipedia/commons/4/47/Catedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg'
'/wikipedia/commons/0/0e/Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm'
'/wikipedia/commons/e/ef/AB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%C3%A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif'

But I'm doing something wrong and the script just get the one about the files with easy names (where there is no any punctuation mark (Manhattan Bridge*, The Earth* and NPS_craters*). It seems to be something related to how these marks are stored in the cassandra dataset. I'm not sure about that.

Thank you very much!

Thanks @Eevans!

But there is something wrong in the script I have prepared. Can you please take a look to it to see why we are not getting data for all the items? I have taken a look at the last CSV and there are information only about some of the files I wanted to get:
The purpose of the last version of the script is to get some data about the following files:

'/wikipedia/commons/1/1c/Manhattan_Bridge_Construction_1909.jpg'
'/wikipedia/commons/6/60/The_Earth_4K_Extended_Edition.webm'
'/wikipedia/commons/b/bd/Titan_(moon).ogg'
'/wikipedia/commons/7/7e/NPS_craters-of-the-moon-map.pdf'
'/wikipedia/commons/4/47/Catedral_de_la_Encarnaci%C3%B3n%2C_M%C3%A1laga%2C_Espa%C3%B1a%2C_2023-05-19%2C_DD_37-39_HDR.jpg'
'/wikipedia/commons/0/0e/Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm'
'/wikipedia/commons/e/ef/AB_Tacksfabriken_och_konf-fabr._AB_Viking._Trollh%C3%A4ttan_FiBs_serien_-_Nordiska_museet_-_NMAx.0001508.tif'

But I'm doing something wrong and the script just get the one about the files with easy names (where there is no any punctuation mark (Manhattan Bridge*, The Earth* and NPS_craters*). It seems to be something related to how these marks are stored in the cassandra dataset. I'm not sure about that.

Thank you very much!

Actually, it was my bad —when following the link to the file, I didn't notice that it was from a branch. That said: It failed to run because printf was attempting to process %C (one of the url-encoded chars) as a format specifier (I changed it to echo).

Good news! I was really worried about not understanding how data is encoded xD
The csv file seems to contain all data we wanted to fetch. I'll take note about the change you did to do it in the script just in case we need to run it in the future.
Thank you very much!!

Hi again @Eevans!
I'm sorry for bothering again. I'm here to ask for new data. We need some new one about pageviews to debug for a bug we have in production (it seems to be related with some specific date for specific dates). And endpoint is failing for a specific range of date and we need to add a couple of years to the script to fetch that data.

The script is https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/more-data-top-pageviews/scripts/local_group_default_T_top_pageviews.sh?ref_type=heads. It's in a new branch I have created for this change. I just added 2021 and 2022 in the year range to fetch data for these dates.

Thank you very much!

Hi @Sfaci,

Hi again @Eevans!
I'm sorry for bothering again. I'm here to ask for new data. We need some new one about pageviews to debug for a bug we have in production (it seems to be related with some specific date for specific dates). And endpoint is failing for a specific range of date and we need to add a couple of years to the script to fetch that data.

The script is https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs-docker-cassandra-test-env/-/blob/more-data-top-pageviews/scripts/local_group_default_T_top_pageviews.sh?ref_type=heads. It's in a new branch I have created for this change. I just added 2021 and 2022 in the year range to fetch data for these dates.

Thank you very much!

I've been asked to get a new ticket for each of these, so I've created T350882 for this request (where I will follow-up momentarily).