Page MenuHomePhabricator

Create new database table for event goals
Open, In Progress, Needs TriagePublic

Description

Acceptance Criteria:

  • Define the table structure
  • Implement a schema change for it
  • Create the table in production and beta

Checklist

  • Design the schema
  • Make schema change patch for it
  • Get initial reviews (+1s)
  • Tag this task with DBA
  • Request DBA signoff
  • Get approval
Deployment
  • Merge the schema patch
    • This will automatically create the new table in beta
  • Wait until the patch reaches production
  • Schedule a dedicated deployment window, note the time here
    • [[ diff_URL_on_wikitech | date and time ]]
  • Create the table everywhere:
    • mwscript sql.php --wiki=testwiki --cluster=extension1 /srv/mediawiki/php-1.46.0-wmf.XX/extensions/CampaignEvents/db_patches/mysql/patch-add-NAME.sql
    • mwscript sql.php --wiki=test2wiki --cluster=extension1 /srv/mediawiki/php-1.46.0-wmf.XX/extensions/CampaignEvents/db_patches/mysql/patch-add-NAME.sql
    • mwscript sql.php --wiki=officewiki --cluster=extension1 /srv/mediawiki/php-1.46.0-wmf.XX/extensions/CampaignEvents/db_patches/mysql/patch-add-NAME.sql
    • mwscript sql.php --wiki=metawiki --cluster=extension1 --wikidb wikishared /srv/mediawiki/php-1.46.0-wmf.XX/extensions/CampaignEvents/db_patches/mysql/patch-add-NAME.sql

Details

Event Timeline

ifried renamed this task from [placeholder] Database work to store goal to Create new database table.Jan 8 2026, 11:10 PM
Daimona renamed this task from Create new database table to Create new database table for event goals.Tue, Jan 13, 10:36 PM
Daimona updated the task description. (Show Details)
Daimona subscribed.
Proposed table schema
CREATE TABLE ce_event_goals (
  ceeg_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL, -- Surrogate primary key
  ceeg_event BIGINT UNSIGNED NOT NULL, -- References campaign_events.event_id
  ceeg_goals BLOB NOT NULL, -- JSON blob with event goals, see below
  INDEX ceeg_event (ceeg_event),
  PRIMARY KEY(ceeg_id)
);
Proposed JSON structure

goal_unit substructure:

{
  "metric": "X",
  "target": Y
},

Where X is a metric name (a string defined somewhere in the application) and Y is an integer, but might also be a different data type in future as we introduce more metrics (e.g., a float, or perhaps some kind of ranked enum/score).

In future, for multi-metric (composite) goals, we could use the following composite goal structure:

{
  "operator": Op,
  "units": (goal_unit|composite_goal)[]
}

Where Op is one of "AND" and "OR"; we could think about enforcing conjunctive normal form, as probably we wouldn't need any more depth than this. However, the specs for this go beyond the scope of this task (. The flexibility of storing JSON allows us to postpone these decisions until we'll actually need them.

Goals themselves (i.e., what goes into the table) can be either a goal_unit, or a composite_goal. For the time being, all goals will be single-metric goals (restriction imposed by the application).

Example JSON

Simple goal for the initial version:

{
  "metric": "references",
  "target": 100
},

Potential multi-metric goal:

{
  "operator": "AND",
  "metrics": [
    {
      "metric": "edits",
      "target": "5000"
    },
    {
      "metric": "articles",
      "target": "200"
    },
    {
      "operator": "OR",
      "metrics": [
        {
          "metric": "references",
          "target": "1234567"
        },
        {
          "metric": "links",
          "target": "987"
        }
      ]
    }
  ]
}

Open questions
  • Would the above blob structure work for reporting? As I think about this more, we could perhaps switch back to structured storage (not JSON) if we impose a few restrictions from the get go and going forwards (no non-integer metrics, composite goals can only be a list of units joined with AND)
  • Not related to the schema, but to make sure I understand: for delta-style metrics (bytes and links), the target would be compared against the algebraic sum, right? So, if say the target is 1000 byte, and the overall delta is +2000/-3000, would the progress towards this goal be 0 (actually -1000, but hopefully we don't wanna have negative progress)?

Thanks @Daimona !!!

ceeg_event BIGINT UNSIGNED NOT NULL, -- References campaign_events.event_id

Shouldn't it be ceeg_event_id, as it is for other tables like: ce_event_wikis -> ceew_event_id, ce_event_topics -> ceet_event_id

Would the above blob structure work for reporting?

It does, but it would be nice to name the metric with the same name of the column that represents the goal number, something like:

json
{
  "operator": "AND",
  "metrics": [
    {
      "metric": "cec_bytes_delta",
      "target": "20"
    },
    {
      "metric": "cec_links_delta",
      "target": "12"
    },
    {
      "operator": "OR",
      "metrics": [
        {
          "metric": "cec_edit_flags__1",
          "target": "12"
        },
        {
          "metric": "cec_edit_flags__0",
          "target": "18"
        }
      ]
    }
  ]
}

This was what I used to test with 2000 events with goals on superset by faking some data to test, the name of the metric being the name of the column helps a little bit to make the query to looks better to me, but other names would also work. I also helps to know where the metric is in the BD by looking at the name.

Not related to the schema, but to make sure I understand: for delta-style metrics (bytes and links), the target would be compared against the algebraic sum, right? So, if say the target is 1000 byte, and the overall delta is +2000/-3000, would the progress towards this goal be 0 (actually -1000, but hopefully we don't wanna have negative progress)?

Good question, one way to go could be to divide these types into other goal types, like below, this way the organizer could be able to decide how they want it to be calculates, but it is just one idea to think about (cc: @ifried) :

  • Goal: Number of bytes added
  • Goal: Number of bytes removed
  • Goal: Number of bytes added - Bytes removed

Another point, I think it would be better to store the json blob already supporting the multi-metric goal, I strongly think people will want it, and if we start storing the json blob with support to only simple-metric goal:, we will need to change the code to understand both simple and multi-metric goal (and also the superset query to get reports reading 2 different json structures), we could have the json structure ready to support multi-metric goal, and for now just allow one.

Not related to the schema, but to make sure I understand: for delta-style metrics (bytes and links), the target would be compared against the algebraic sum, right? So, if say the target is 1000 byte, and the overall delta is +2000/-3000, would the progress towards this goal be 0 (actually -1000, but hopefully we don't wanna have negative progress)?

Good question, one way to go could be to divide these types into other goal types, like below, this way the organizer could be able to decide how they want it to be calculates, but it is just one idea to think about (cc: @ifried) :

  • Goal: Number of bytes added
  • Goal: Number of bytes removed
  • Goal: Number of bytes added - Bytes removed

Good question @Daimona, and thanks for sharing this suggestion, @cmelo! I'm going to ask around on this topic to see what people think, but my current instinct is that organizers are most likely to be interested in bytes added (not the algebraic sum or bytes removed), so perhaps we could start with just offering as a goal "Number of bytes added" and then later we can see if we allow the other 2 options listed by Claudio (i.e., Number of bytes removed & Number of bytes added - Bytes removed).

Decisions made about the proposals:

This is the final DB structure, the only change is that the column ceeg_event will be named ceeg_event_id

CREATE TABLE ce_event_goals (
  ceeg_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL, -- Surrogate primary key
  ceeg_event_id BIGINT UNSIGNED NOT NULL, -- References campaign_events.event_id
  ceeg_goals BLOB NOT NULL, -- JSON blob with event goals, see below
  INDEX ceeg_event (ceeg_event),
  PRIMARY KEY(ceeg_id)
);

We will use the multi-metric goal JSON structure:

{
  "operator": "AND",
  "metrics": [
    {
      "metric": "edits",
      "target": "5000"
    }
}

cc: @Daimona @MHorsey-WMF

Change #1234456 had a related patch set uploaded (by Mhorsey; author: Mhorsey):

[mediawiki/extensions/CampaignEvents@master] Add new db schema for event goals

https://gerrit.wikimedia.org/r/1234456

MHorsey-WMF updated the task description. (Show Details)

Hi DBA s. Tagging to request approval for this schema change, as per https://wikitech.wikimedia.org/wiki/Schema_changes. When approved and merged, I fill file a separate task to get it applied in prod.

Hi, no concern from my side but do you want to make the index unique so it would have one row per event only? It seems nicer but not a big deal. If that's the case, you can also simply make the event as PK. No need to set up a dedicated auto_increment field just for counter.

Hi, no concern from my side but do you want to make the index unique so it would have one row per event only? It seems nicer but not a big deal.

The uniqueness constraint would hold in the very initial version, but we have plans to allow multiple goals per event in a future version.

If that's the case, you can also simply make the event as PK. No need to set up a dedicated auto_increment field just for counter.

Generally speaking, I'm curious about this. I seemed to recall a recommendation of always having a surrogate primary key (at least for maintenance reasons, don't remember if there was more), is that actually not the case?

Hi, no concern from my side but do you want to make the index unique so it would have one row per event only? It seems nicer but not a big deal.

The uniqueness constraint would hold in the very initial version, but we have plans to allow multiple goals per event in a future version.

If that's the case, you can also simply make the event as PK. No need to set up a dedicated auto_increment field just for counter.

Generally speaking, I'm curious about this. I seemed to recall a recommendation of always having a surrogate primary key (at least for maintenance reasons, don't remember if there was more), is that actually not the case?

There are pros and cons and it depends on what unique field you are talking about. As a rule of thumb: if the unique field is an integer, it's okay to use it as PK‌ (e.g. in redirect table, the page_id is PK) but if it's string, it's better to have a dedicate auto_increment PK. For example, image table uses img_name as PK‌ and that has caused us a lot of headaches since name of images can change. That's why in the new design, we have a dedicated file_id. Another point to consider is that we use the PK‌ to enumerate through the table for various reasons (compare rows of one replica with another, ...). Also if you need a composite key PK, I'd rather have a dedicated auto_increment one instead unless the churn is high and/or table is too big (links table have composite PK).

So tldr: It depends.