Page MenuHomePhabricator

[IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions
Closed, ResolvedPublic5 Estimated Story Points

Description

Acceptance criteria

  • Tables will be created to match the following schema, plus appropriate indexes:
ce_question_answers
{
    ceqa_id: bigint, autoincrement,
    ceqa_event_id: integer,
    ceqa_user_id: integer,
    ceqa_question_id: integer,
    ceqa_answer_option: integer | nullable,
    ceqa_answer_text: string | nullable
}

ce_question_aggregation
{
    ceqag_id: bigint | autoincrement,
    ceqag_event_id: integer,
    ceqag_question_id: integer,
    ceqag_answer_option: integer,
    ceqag_answers_amount: integer
}

ce_event_questions
{
    ceeq_id: bigint | autoincrement,
    ceeq_event_id: integer,
    ceeq_question_id: integer
}
  • When this task is complete, a task for DBAs to create tables in production (ref: T336365)

Note that other schema changes for the "participant questions" epic are tracked elsewhere:

  • Adding the timestamps of when the organizers agreed to the clickwrap agreement to the ce_organizers table: T337768.
  • Storing when a participant first answered questions, and whether their answers have already been aggregated: T339982.
External dependencies
  • DBA availability
Other dependencies

Event Timeline

We still need to decide what indexes we need, here are the possible indexes I could think of, please fell free to add your thoughts about them and make suggestions, I don't mean we need them all, it is just for us to start thinking about them, I am also going to ask help of a DBA as well on this.
ce_question_answers

  • ceqa_event_id
    • Maybe useful when deleting PII data by event (we will remove the PII data after the event ends)
  • ceqa_event_id and ceqa_user_id
    • This would be used when getting information about participants to display on the participant table. Now we only show username and registration data, in case a participant answer the affiliate question for example, this information should also be available there. I think this index could useful when join with the ce_participants table
  • ceqa_event_id, ceqa_question_id, ceqa_answer_option
    • This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

ce_question_aggregation

  • ceqa_event_id
    • Get all aggregate data per event
  • ceqag_question_id and ceqag_qustion_option_id
    • This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like:

[
   question_id = > 1,
   options => [...]
   is_pii => true
]

When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

We still need to decide what indexes we need, here are the possible indexes I could think of [...]
ce_question_answers

  • ceqa_event_id

This would be redundant with the one below ((ceqa_event_id, ceqa_user_id)).

  • ceqa_event_id, ceqa_question_id, ceqa_answer_option
    • This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

Did we make a final decision on aggregation, particularly re anonymization? As you pointed out, knowing that in advance would help.

ce_question_aggregation

  • ceqag_question_id and ceqag_qustion_option_id
    • This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I'm wondering if that is the case. You could have the same participant join multiple events, which would skew the data unless you have a way to filter out duplicates. And the only way to filter out duplicates is to group by event, I think.

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like: [...]
When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

Being PII or not is a property that does not belong to the answer (which is what the table stores), but to the question. I think this should be stored wherever the question definition is stored (i.e., in the code). Putting it in this table would be an optimization, as you mentioned. Since we only expect a small set of questions, I think that would be a premature optimization, and maybe something to consider for later if we think it's necessary. At least that's what I can think of off the top of my head.

We will add the table ce_event_questions to store the questions of the event, for the first iteration of this feature organizers will not be able to add or remove questions, but since we will implement this in the future, we will add this table now and populate it with the 5 questions we have.

cmelo renamed this task from [IMPLEMENTATION] - Create the DB tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation.May 5 2023, 12:02 AM
cmelo renamed this task from [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation.May 5 2023, 12:10 AM
Daimona renamed this task from [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions.May 24 2023, 9:39 AM

I went through all the open tasks and wrote down a draft list of queries on these tables:

ce_event_questions

  • select ceeq_question_id from ce_event_questions where ceeq_event_id = X -- edit registration form, register for event or edit your info (to determine form fields to show)
  • insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ) -- enable/edit registration (add new questions)
  • select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)
  • delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

  • select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y -- edit your registration info (grab current answers)
  • select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y -- edit your registration info (to know what to remove by PK)
  • delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ) -- edit your registration info (remove answers); also when deleting data after aggregation
  • insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y -- edit your registration info (insert/update answers)
  • select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X -- used when aggregating data. NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)
  • select ceqa_id from ce_question_answers where ceqa_event_id = X -- delete data after aggregation (to know what to remove by PK). NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)

ce_question_aggregation

  • select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X -- show aggregates for an event
  • insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ( ... ) -- inserting aggregates. NOTE: this assumes that answers will be aggregated all at once, and not as they are inserted (todo T328032)

Note that T328032 has an open question about when to delete the data (whether it should be based on the event dates, or the answer timestamp). Depending on the decision, we may need a new column and different queries. Also, one of the decisions in T328032 is that data will be aggregated all at once when the event ends, but if this decision is reversed and we choose to aggregate data progressively, then we will also need different queries.

MHorsey-WMF set the point value for this task to 5.

In light of the new decisions recorded in T328032#8936008, the schema above can be considered final. We will need to store additional information, but that's going to be added to existing tables with a schema change. Therefore, here's the updated list of queries:

ce_event_questions

  • select ceeq_question_id from ce_event_questions where ceeq_event_id = X: edit registration form, register for event or edit your registration info (to determine what form fields to show)
  • insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ): enable/edit registration (add new questions). For the MVP, we will always an only insert 5 tuples of values per event
  • select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)
  • delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

  • select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (grab current answers to show them in the form)
  • select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (to know what to remove by PK)
  • delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ): edit your registration info (remove answers). Also used when deleting data after aggregation.
  • insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y: edit your registration info (insert/update answers)
  • select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y FOR UPDATE: used when aggregating data for a specific user. The PK is needed for a subsequent DELETE, in an atomic transaction. Note that the value of Y could be obtained with a join on the ce_participants table, but then that table would also be locked, and I'm not sure if this is a good idea.
  • select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X FOR UPDATE: used when aggregating the remaining data because an event has ended. The PK is needed for a subsequent DELETE, in an atomic transaction.

ce_question_aggregation

  • select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X: show aggregates for an event
  • insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ... on duplicate key update ceqag_answers_amount = ceqag_answers_amount + X,: inserting/updating aggregates.

In light of the new decisions recorded in T328032#8936008, the schema above can be considered final. We will need to store additional information, but that's going to be added to existing tables with a schema change. Therefore, here's the updated list of queries:

ce_event_questions

  • select ceeq_question_id from ce_event_questions where ceeq_event_id = X: edit registration form, register for event or edit your registration info (to determine what form fields to show)
  • insert ignore into ceeq_question_id (ceeq_event_id, ceeq_question_id) values ( X, A1 ), ..., ( X, An ): enable/edit registration (add new questions). For the MVP, we will always an only insert 5 tuples of values per event
  • select ceeq_id from ce_event_questions where ceeq_event_id = X -- NOT MVP: edit registration (to know what to delete by PK)
  • delete from ce_event_questions where ceeq_id in ( X1, ..., Xn ) -- NOT MVP: edit registration (delete removed questions)

ce_question_answers

  • select ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (grab current answers to show them in the form)
  • select ceqa_id from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y: edit your registration info (to know what to remove by PK)
  • delete from ce_question_answers where ceqa_id in ( X1, ..., Xn ): edit your registration info (remove answers). Also used when deleting data after aggregation.
  • insert into ce_question_answers ( ceqa_event_id, ceqa_user_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text ) values ... on duplicate key update ceqa_answer_option = X, ceqa_answer_text = Y: edit your registration info (insert/update answers)
  • select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X and ceqa_user_id = Y FOR UPDATE: used when aggregating data for a specific user. The PK is needed for a subsequent DELETE, in an atomic transaction. Note that the value of Y could be obtained with a join on the ce_participants table, but then that table would also be locked, and I'm not sure if this is a good idea.
  • select ceqa_id, ceqa_question_id, ceqa_answer_option, ceqa_answer_text from ce_question_answers where ceqa_event_id = X FOR UPDATE: used when aggregating the remaining data because an event has ended. The PK is needed for a subsequent DELETE, in an atomic transaction.

ce_question_aggregation

  • select ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount from ce_question_aggregation where ceqag_event_id = X: show aggregates for an event
  • insert into ce_question_aggregation ( ceqag_event_id, ceqag_question_id, ceqag_question_option_id, ceqag_answers_amount ) values ... on duplicate key update ceqag_answers_amount = ceqag_answers_amount + X,: inserting/updating aggregates.

LGTM

Given the queries listed in T335526#8947228, I'm proposing the following indexes:

ce_event_questions

  • (ceeq_event_id, ceeq_question_id) UNIQUE: uniqueness constraint, used by queries that filter by event (and we never filter by question)

ce_question_answers

  • (ceqa_event_id, ceqa_user_id, ceqa_question_id) UNIQUE: needed by pretty much all queries, as they filter by event; some queries also filter by user and would use this index. The question ID is here only for the uniqueness.

ce_question_aggregation

  • (ceqag_event_id, ceqag_question_id) UNIQUE: needed for the select filtering by event, + question ID for the uniqueness

These would be sufficient for all the queries above, AFAICS.

Change 931980 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Create new DB tables for participant questions

https://gerrit.wikimedia.org/r/931980

Change 931980 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Create new DB tables for participant questions

https://gerrit.wikimedia.org/r/931980

Daimona updated the task description. (Show Details)

Marking this as resolved. This will be tested in T339997 (beta) and T340000 (prod).