Page MenuHomePhabricator

Editor Analytics: Get new data for AQS Druid test environment
Closed, ResolvedPublic

Description

The Druid test environment is already done at T317803: AQS 2.0: Extract production testing data for Druid-based endpoints. What we need is new data to be able to test editors & edits endpoints because the current environment data has not enought variety (we need, for example, data from different dates).

This is related to T336382: Edit Analytics: Create Druid test data environment (and maybe both tickets should be only one) because both services (editors and edit) will query from the same Druid dataset.

Event Timeline

JArguello-WMF renamed this task from Create Druid test data environment to Editor Analytics: Create Druid test data environment.May 10 2023, 2:30 PM
Sfaci renamed this task from Editor Analytics: Create Druid test data environment to Editor Analytics: Get new data for AQS Druid test environment.Jun 5 2023, 9:22 AM
Sfaci updated the task description. (Show Details)
Sfaci updated the task description. (Show Details)

At this moment we have already fetched new data from production Druid to populate the aqs-druid-test-environment. This data includes all the information about the events occurred during April 2023 for the 'ab.wikipedia' project (around 6,000 rows).

Sfaci triaged this task as Medium priority.
Sfaci edited projects, added AQS2.0 (Sprint 10); removed AQS2.0.
Sfaci moved this task from Next Up to In Progress on the AQS2.0 (Sprint 10) board.

@BPirkle I think next step here is to replace (or maybe add) the new dataset to the current druid-docker-test-env project. We could replace the existing one or maybe to add the new one as another option to ingest it when you create the image (modifying the ingest.sh script manually, for example).

The needed json and csv data files are ready to work with. I have used them with the environment while editors development.

Should we consider this ticket and T336382: Edit Analytics: Create Druid test data environment as the same one?

Sfaci updated the task description. (Show Details)
Sfaci updated the task description. (Show Details)

To improve data variety (to test properly when searching between dates using monthly granularity, and for different projects at the same time using 'all-projects' keyword available in edit service), I have added more data to the sample dataset. At this moment we have all the data about 'ab.wikipedia' and 'zu.wikipedia' projects for three entire months (03, 04 and 05).

Regarding the data size, all this data is just less than 3 MB (around 23k rows)

I think we could modify aqs-docker-druid-test-env to include this dataset by replacing the existing one that only has some data for the same day

Should we consider this ticket and T336382: Edit Analytics: Create Druid test data environment as the same one?

I think so.

I think we could modify aqs-docker-druid-test-env to include this dataset by replacing the existing one that only has some data for the same day

I agree.

This is in "Ready for Code Review", but I'm not sure what to review. Are we still going to use https://github.com/bpirkle/aqs-docker-druid-test-env for now? I know that's not a great url/repo, and we can change it whenever you're ready. But since you're in the middle of coding, and we're about to do the team transition, maybe it makes sense to do that repo move after July 1? If we're keeping the current repo for now, do you want to open a pull request with the new dataset? If you're done something different already and there's something at a different location for me to review, let me know.

I put this in "Ready for Code Review" just to point out we had to review this task and decide what to do with the dataset to update the environment and move on the task.

I'll open a pull request with the new dataset and some improvement I have made removing the 10 minutes explicit waiting which seems not to be necessary.

I was still using the same repo but, anyway, I think you are right and maybe it's a good idea to move it to the right place. Maybe we can do it after this pull request because I think the environment will be finished and ready to work.