[IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	cmelo
	Apr 27 2023, 5:36 PM

Description

Acceptance criteria

Tables will be created to match the following schema, plus appropriate indexes:

ce_question_answers
{
    ceqa_id: bigint, autoincrement,
    ceqa_event_id: integer,
    ceqa_user_id: integer,
    ceqa_question_id: integer,
    ceqa_answer_option: integer | nullable,
    ceqa_answer_text: string | nullable
}

ce_question_aggregation
{
    ceqag_id: bigint | autoincrement,
    ceqag_event_id: integer,
    ceqag_question_id: integer,
    ceqag_answer_option: integer,
    ceqag_answers_amount: integer
}

ce_event_questions
{
    ceeq_id: bigint | autoincrement,
    ceeq_event_id: integer,
    ceeq_question_id: integer
}

When this task is complete, a task for DBAs to create tables in production (ref: T336365)

Note that other schema changes for the "participant questions" epic are tracked elsewhere:

Adding the timestamps of when the organizers agreed to the clickwrap agreement to the ce_organizers table: T337768.
Storing when a participant first answered questions, and whether their answers have already been aggregated: T339982.

External dependencies

DBA availability

Other dependencies

Details

	Subject	Repo	Branch	Lines +/-
	Create new DB tables for participant questions	mediawiki/extensions/CampaignEvents	master	+279 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Daimona	T341392 Mark the CampaignEvents extension as stable
Resolved	cmelo	T321822 [EPIC] Participant Questions - MVP
Resolved	Daimona	T339997 Create the tables for participant questions in beta
Resolved	Daimona	T340000 Create the tables for participant questions in prod
Resolved	Daimona	T335526 [IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions
Resolved	Daimona	T328032 [SPIKE] How will we handle storage of PII data?

Event Timeline

cmelo created this task.Apr 27 2023, 5:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2023, 5:36 PM

We still need to decide what indexes we need, here are the possible indexes I could think of, please fell free to add your thoughts about them and make suggestions, I don't mean we need them all, it is just for us to start thinking about them, I am also going to ask help of a DBA as well on this.
ce_question_answers

ceqa_event_id
- Maybe useful when deleting PII data by event (we will remove the PII data after the event ends)

ceqa_event_id and ceqa_user_id
- This would be used when getting information about participants to display on the participant table. Now we only show username and registration data, in case a participant answer the affiliate question for example, this information should also be available there. I think this index could useful when join with the ce_participants table
ceqa_event_id, ceqa_question_id, ceqa_answer_option
- This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

ce_question_aggregation

ceqa_event_id
- Get all aggregate data per event
ceqag_question_id and ceqag_qustion_option_id
- This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like:

[
   question_id = > 1,
   options => [...]
   is_pii => true
]

When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

In T335526#8811550, @cmelo wrote:

We still need to decide what indexes we need, here are the possible indexes I could think of [...]
ce_question_answers

ceqa_event_id

This would be redundant with the one below ((ceqa_event_id, ceqa_user_id)).

ceqa_event_id, ceqa_question_id, ceqa_answer_option

This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

Did we make a final decision on aggregation, particularly re anonymization? As you pointed out, knowing that in advance would help.

ce_question_aggregation

ceqag_question_id and ceqag_qustion_option_id

This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I'm wondering if that is the case. You could have the same participant join multiple events, which would skew the data unless you have a way to filter out duplicates. And the only way to filter out duplicates is to group by event, I think.

In T335526#8811564, @cmelo wrote:

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like: [...]
When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

Being PII or not is a property that does not belong to the answer (which is what the table stores), but to the question. I think this should be stored wherever the question definition is stored (i.e., in the code). Putting it in this table would be an optimization, as you mentioned. Since we only expect a small set of questions, I think that would be a premature optimization, and maybe something to consider for later if we think it's necessary. At least that's what I can think of off the top of my head.

ldelench_wmf awarded a token.Apr 28 2023, 8:55 PM

cmelo updated the task description. (Show Details)May 4 2023, 12:43 PM

We will add the table ce_event_questions to store the questions of the event, for the first iteration of this feature organizers will not be able to add or remove questions, but since we will implement this in the future, we will add this table now and populate it with the 5 questions we have.

cmelo updated the task description. (Show Details)May 4 2023, 1:59 PM

cmelo renamed this task from [IMPLEMENTATION] - Create the DB tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation.May 5 2023, 12:02 AM

cmelo renamed this task from [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation.May 5 2023, 12:10 AM

MHorsey-WMF updated the task description. (Show Details)May 23 2023, 1:10 PM

Daimona edited projects, added Campaign-Tools (Campaign-Tools-Current-Sprint); removed Campaign-Tools.May 23 2023, 1:13 PM

Daimona moved this task from Upcoming / refining 💡 to Ready for development on the Campaign-Tools (Campaign-Tools-Current-Sprint) board.

Daimona renamed this task from [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions.May 24 2023, 9:39 AM

Daimona updated the task description. (Show Details)May 26 2023, 11:11 AM

Daimona updated the task description. (Show Details)May 30 2023, 12:59 PM

Daimona updated the task description. (Show Details)May 30 2023, 4:53 PM

Daimona updated the task description. (Show Details)May 30 2023, 5:09 PM

Daimona updated the task description. (Show Details)May 30 2023, 5:30 PM

I went through all the open tasks and wrote down a draft list of queries on these tables:

ce_event_questions

select ceeq_question_id from ce_event_questions where ceeq_event_id = X -- edit registration form, register for event or edit your info (to determine form fields to show)
insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ) -- enable/edit registration (add new questions)
select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)
delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y -- edit your registration info (grab current answers)
select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y -- edit your registration info (to know what to remove by PK)
delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ) -- edit your registration info (remove answers); also when deleting data after aggregation
insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y -- edit your registration info (insert/update answers)
select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X -- used when aggregating data. NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)
select ceqa_id from ce_question_answers where ceqa_event_id = X -- delete data after aggregation (to know what to remove by PK). NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)

ce_question_aggregation

select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X -- show aggregates for an event
insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ( ... ) -- inserting aggregates. NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)

Note that T328032 has an open question about when to delete the data (whether it should be based on the event dates, or the answer timestamp). Depending on the decision, we may need a new column and different queries. Also, one of the decisions in T328032 is that data will be aggregated all at once when the event ends, but if this decision is reversed and we choose to aggregate data progressively, then we will also need different queries.

Daimona added a subtask: T328032: [SPIKE] How will we handle storage of PII data?.May 30 2023, 5:46 PM

Daimona moved this task from Ready for development to Blocked ⛔ on the Campaign-Tools (Campaign-Tools-Current-Sprint) board.

Iflorez subscribed.Jun 7 2023, 6:22 PM

MHorsey-WMF updated the task description. (Show Details)Jun 15 2023, 1:32 PM

MHorsey-WMF set the point value for this task to 5.

MHorsey-WMF moved this task from Blocked ⛔ to Ready for development on the Campaign-Tools (Campaign-Tools-Current-Sprint) board.Jun 15 2023, 1:59 PM

Daimona mentioned this in T328032: [SPIKE] How will we handle storage of PII data?.Jun 15 2023, 5:22 PM

Daimona claimed this task.Jun 19 2023, 2:11 PM

Daimona moved this task from Ready for development to Development In Progress 💻 on the Campaign-Tools (Campaign-Tools-Current-Sprint) board.

In light of the new decisions recorded in T328032#8936008, the schema above can be considered final. We will need to store additional information, but that's going to be added to existing tables with a schema change. Therefore, here's the updated list of queries:

ce_event_questions

select ceeq_question_id from ce_event_questions where ceeq_event_id = X: edit registration form, register for event or edit your registration info (to determine what form fields to show)
insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ): enable/edit registration (add new questions). For the MVP, we will always an only insert 5 tuples of values per event
select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)
delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (grab current answers to show them in the form)
select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (to know what to remove by PK)
delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ): edit your registration info (remove answers). Also used when deleting data after aggregation.
insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y: edit your registration info (insert/update answers)
select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y FOR UPDATE: used when aggregating data for a specific user. The PK is needed for a subsequent DELETE, in an atomic transaction. Note that the value of Y could be obtained with a join on the ce_participants table, but then that table would also be locked, and I'm not sure if this is a good idea.
select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X FOR UPDATE: used when aggregating the remaining data because an event has ended. The PK is needed for a subsequent DELETE, in an atomic transaction.

ce_question_aggregation

select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X: show aggregates for an event
insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ... on duplicate key update ceqag_answers_amount = ceqag_answers_amount + X,: inserting/updating aggregates.

In T335526#8947228, @Daimona wrote:

In light of the new decisions recorded in T328032#8936008, the schema above can be considered final. We will need to store additional information, but that's going to be added to existing tables with a schema change. Therefore, here's the updated list of queries:

ce_event_questions

select ceeq_question_id from ce_event_questions where ceeq_event_id = X: edit registration form, register for event or edit your registration info (to determine what form fields to show)

insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ): enable/edit registration (add new questions). For the MVP, we will always an only insert 5 tuples of values per event

select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)

delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (grab current answers to show them in the form)

select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (to know what to remove by PK)

delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ): edit your registration info (remove answers). Also used when deleting data after aggregation.

insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y: edit your registration info (insert/update answers)

select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y FOR UPDATE: used when aggregating data for a specific user. The PK is needed for a subsequent DELETE, in an atomic transaction. Note that the value of Y could be obtained with a join on the ce_participants table, but then that table would also be locked, and I'm not sure if this is a good idea.

select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X FOR UPDATE: used when aggregating the remaining data because an event has ended. The PK is needed for a subsequent DELETE, in an atomic transaction.

ce_question_aggregation

select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X: show aggregates for an event

insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ... on duplicate key update ceqag_answers_amount = ceqag_answers_amount + X,: inserting/updating aggregates.

LGTM

Given the queries listed in T335526#8947228, I'm proposing the following indexes:

ce_event_questions

(ceeq_event_id, ceeq_question_id) UNIQUE: uniqueness constraint, used by queries that filter by event (and we never filter by question)

ce_question_answers

(ceqa_event_id, ceqa_user_id, ceqa_question_id) UNIQUE: needed by pretty much all queries, as they filter by event; some queries also filter by user and would use this index. The question ID is here only for the uniqueness.