Page MenuHomePhabricator

[IMPLEMENTATION] Back end: Create a maintenance script to aggregate and delete participant questions
Closed, ResolvedPublic

Description

See the decisions in T328032#8936008.

Acceptance criteria

  • An individual participant's answer must be aggregated after 90 days from first answer
  • All remaining unaggregated answers must be aggregated when the event ends
  • All answers must be deleted after aggregation, including those to non-PII questions
  • After aggregation and deletion, the relevant records in the ce_participants table should be updated with the aggregation timestamp (ce_participants.cep_aggregation_timestamp)
  • The script should be documented in the extension's manual.

Related Objects

Event Timeline

Daimona renamed this task from [IMPLEMENTATION] Back end: JOB to aggregate the data after the event is over or the first answer has more than X days (or the created_at has more than X days). to [IMPLEMENTATION] Back end: Create a job to aggregate and delete participant questions.Jun 20 2023, 11:20 PM
Daimona updated the task description. (Show Details)
Daimona updated the task description. (Show Details)
Daimona renamed this task from [IMPLEMENTATION] Back end: Create a job to aggregate and delete participant questions to [IMPLEMENTATION] Back end: Create a maintenance script to aggregate and delete participant questions.Jun 28 2023, 10:49 PM
Daimona updated the task description. (Show Details)

Implementation note: the queries done by this script should be fast enough. First, it should process rows in batches (let's say of 500 rows), using the primary key. It should scan the ce_question_answers table because it's the smallest (because data is deleted continuously). It should first gather a list of IDs for questions that are not PII (which is known by the application). It could do something like:

SELECT /* ... */ FROM ce_question_answers where ceqa_id >= X and ceqa_id < Y and ceqa_question_id in ( /* ... */ );
SELECT /* ... */ FROM ce_participants where cep_event_id in ( /* list from previous query */ ) and cep_user_id in ( /* list from previous query */ ) -- or a similar query using existing indexes; note that this particular query doesn't work as the would need to be filtered by tuples. Extract those whose first answer is older than 90 days. This could be cached for subsequent queries.
SELECT /* ... */ FROM campaign_events where event_id in ( /* list from first query */ ) -- exctract closed events. Unless we do this differently, on-demand. This could be cached for subsequent queries.
-- Filter the list of answers according to the 2 previous queries
-- Aggregate the data
DELETE FROM ce_question_answers where ceqa_id in ( /* ... */ )

Discussed todya with @cmelo and @MHorsey-WMF. Decisions:

  • We will add a flag to the campaign_events table to store whether all answers have already been aggregated for that event. This way, we will only need to check that flag, instead of scanning all the answers to see if there's anything left to aggregate.
  • We will make sure that concurrent aggregation jobs won't cause issues
  • We will always display the "stats" tab for events that have ended, but if the data hasn't been aggregated yet we will display a message like "Participant answers are being aggregated, check back later".
  • The script will be run more often, e.g., every 3 hours, to minimize the waits

Decision from today's design meeting: we will use the approach outlined in T336858#8988179.

Change 940422 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Introduce a maintenance script for aggregating participant answers

https://gerrit.wikimedia.org/r/940422

Change 940422 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Introduce a maintenance script for aggregating participant answers

https://gerrit.wikimedia.org/r/940422

This was done, and the script has been documented. Two things to note for QA:

  • Aggregates are not exposed anywhere in the UI or API. The only way to see them for now is by querying the ce_question_aggregation table manually.
  • The period of 90 days is not configurable and it may make testing inconvenient. You can change this locally by updating the value of AggregateParticipantAnswers::TTL_SEC to the time, in seconds, after which answers should be aggregated.
Acceptance criteria
  • An individual participant's answer must be aggregated after 90 days from first answer
  • All remaining unaggregated answers must be aggregated when the event ends

@Daimona I am able to test that the script runs correctly by running it manually, but these two AC points can't be completely tested right now it seems, correct? I did change the value of AggregateParticipantAnswers::TTL_SEC to 1 to test and everything is working as I expected it to.


✅ All answers must be deleted after aggregation, including those to non-PII questions

All answers in ce_question_answers are deleted after running the script
Screenshot 2023-08-29 at 1.49.48 AM.png (548×2 px, 339 KB)

✅ After aggregation and deletion, the relevant records in the ce_participants table should be updated with the aggregation timestamp (ce_participants.cep_aggregation_timestamp)

the users for the cep_event_id have a cep_aggregation_timestamp
Screenshot 2023-08-29 at 1.48.00 AM.png (1×2 px, 1 MB)

✅ The script should be documented in the extension's manual.

  • Script is documented and correctly runs
  • An individual participant's answer must be aggregated after 90 days from first answer
  • All remaining unaggregated answers must be aggregated when the event ends

@Daimona I am able to test that the script runs correctly by running it manually, but these two AC points can't be completely tested right now it seems, correct? I did change the value of AggregateParticipantAnswers::TTL_SEC to 1 to test and everything is working as I expected it to.

I think for the first one, the only thing you could test is that answers are aggregated after X time from first answer, where X is whatever you set as the TTL to replace the 90 days; and you already tested this.

For the second one, what you could do is:

  • Create an event, register, answer questions
  • Set the event end date to the past
  • Restore the TTL to its original value of 90 days, so that the answers you just provided cannot be deleted due to that
  • Run the script

And then verify that the participant's answers have been aggregated, not because of the 90 days (which haven't passed), but because the event has ended.

I think for the first one, the only thing you could test is that answers are aggregated after X time from first answer, where X is whatever you set as the TTL to replace the 90 days; and you already tested this.

For the second one, what you could do is:

  • Create an event, register, answer questions
  • Set the event end date to the past
  • Restore the TTL to its original value of 90 days, so that the answers you just provided cannot be deleted due to that
  • Run the script

And then verify that the participant's answers have been aggregated, not because of the 90 days (which haven't passed), but because the event has ended.

✅ All remaining unaggregated answers must be aggregated when the event ends

  • This is working as expected. Data is correctly deleted from ce_question_answers and aggregated into ce_question_aggregation with a cep_aggregation_timestamp added to the users in ce_participants when the script is run after the event ends. Moving to product sign off.
ifried subscribed.

As there is nothing user-facing to test here and Vaughn was able to confirm that the aggregation logic is working as expected in the back-end, I'm marking this ticket as Done.