Page MenuHomePhabricator

[IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions
Open, Needs TriagePublic

Description

Acceptance criteria:

Tables will be created to match the following schema

ce_question_answers
{
    ceqa_id: bigint, autoincrement,
    ceqa_user_id: integer,
    ceqa_event_id: integer,
    ceqa_question_id: integer,
    ceqa_answer_option: integer | nullable,
    ceqa_answer_text: string | nullable
}

ce_question_aggregation
{
    ceqag_id: bigint | autoincrement,
    ceqag_question_id: integer,
    ceqag_event_id: integer,
    ceqag_qustion_option_id: integer,
    ceqag_answers_amount: integer
}

ce_event_questions
{
    ceeq_id: bigint | autoincrement,
    ceeq_event_id: integer,
    ceeq_question_id: integer
}

We will not ask organizers to add/remove questions, but since we will add this feature later, we will save the question of the events on ce_event_questions

TBD: indexes
TBD: DBA task like T336365

Event Timeline

We still need to decide what indexes we need, here are the possible indexes I could think of, please fell free to add your thoughts about them and make suggestions, I don't mean we need them all, it is just for us to start thinking about them, I am also going to ask help of a DBA as well on this.
ce_question_answers

  • ceqa_event_id
    • Maybe useful when deleting PII data by event (we will remove the PII data after the event ends)
  • ceqa_event_id and ceqa_user_id
    • This would be used when getting information about participants to display on the participant table. Now we only show username and registration data, in case a participant answer the affiliate question for example, this information should also be available there. I think this index could useful when join with the ce_participants table
  • ceqa_event_id, ceqa_question_id, ceqa_answer_option
    • This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

ce_question_aggregation

  • ceqa_event_id
    • Get all aggregate data per event
  • ceqag_question_id and ceqag_qustion_option_id
    • This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like:

[
   question_id = > 1,
   options => [...]
   is_pii => true
]

When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

We still need to decide what indexes we need, here are the possible indexes I could think of [...]
ce_question_answers

  • ceqa_event_id

This would be redundant with the one below ((ceqa_event_id, ceqa_user_id)).

  • ceqa_event_id, ceqa_question_id, ceqa_answer_option
    • This would be to quickly find the rows that match the specified event ID, question ID, and answer option, and count them without having to perform a full table scan. But this is useful only in case we want to aggregate the data after the event is over, but since we are thinking to save/update the aggregated data in real time, I think this is not needed, also this would be just for a query to count the number of rows, so I really think it is not necessary.

Did we make a final decision on aggregation, particularly re anonymization? As you pointed out, knowing that in advance would help.

ce_question_aggregation

  • ceqag_question_id and ceqag_qustion_option_id
    • This could be useful to generate some reports, like aggregate all the data during a period of time by question.

I'm wondering if that is the case. You could have the same participant join multiple events, which would skew the data unless you have a way to filter out duplicates. And the only way to filter out duplicates is to group by event, I think.

I also would like your thoughts on the question below: @Daimona, @MHorsey-WMF
Should we add a new column (is_pii) if the question is PII on ce_question_answers, or should we rely on the question ID to decide if the answer on ce_question_answers is PII or not?
If we decide to rely on the question ID, which is not in the DB, but in the code because we will store the question in an array like: [...]
When deleting the PII answers, we would need to make a "where in"query passing all question IDs where is_pii = true if we use this approach

Being PII or not is a property that does not belong to the answer (which is what the table stores), but to the question. I think this should be stored wherever the question definition is stored (i.e., in the code). Putting it in this table would be an optimization, as you mentioned. Since we only expect a small set of questions, I think that would be a premature optimization, and maybe something to consider for later if we think it's necessary. At least that's what I can think of off the top of my head.

We will add the table ce_event_questions to store the questions of the event, for the first iteration of this feature organizers will not be able to add or remove questions, but since we will implement this in the future, we will add this table now and populate it with the 5 questions we have.

cmelo renamed this task from [IMPLEMENTATION] - Create the DB tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation.Fri, May 5, 12:02 AM
cmelo renamed this task from [IMPLEMENTATION] - DB create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation.Fri, May 5, 12:10 AM
Daimona renamed this task from [IMPLEMENTATION] - DB: create the tables: ce_question_answers and ce_question_aggregation to [IMPLEMENTATION] - DB: create the tables: ce_question_answers, ce_question_aggregation and ce_event_questions.Wed, May 24, 9:39 AM