Page MenuHomePhabricator

Normalize change tag schema
Closed, ResolvedPublic20 Estimated Story Points

Description

Problem

Change tags are used more and more, and the current schema doesn't scale. On English Wikipedia, the wiki with the most edits, we have 40M rows in the change_tag table and it takes 12 seconds to load Special:Tags. On Wikidata, there are fewer edits but tagging is used a lot more (because so many edits are tagged with OAuth consumer IDs), so there are 184M rows in the change_tag table and loading Special:Tags takes 42 seconds (!).

The current schema is as follows:

-- A table to track tags for revisions, logs and recent changes.
CREATE TABLE /*_*/change_tag (
  ct_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- RCID for the change
  ct_rc_id int NULL,
  -- LOGID for the change
  ct_log_id int unsigned NULL,
  -- REVID for the change
  ct_rev_id int unsigned NULL,
  -- Tag applied
  ct_tag varchar(255) NOT NULL,
  -- Parameters for the tag, presently unused
  ct_params blob NULL
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/change_tag_rc_tag ON /*_*/change_tag (ct_rc_id,ct_tag);
CREATE UNIQUE INDEX /*i*/change_tag_log_tag ON /*_*/change_tag (ct_log_id,ct_tag);
CREATE UNIQUE INDEX /*i*/change_tag_rev_tag ON /*_*/change_tag (ct_rev_id,ct_tag);
-- Covering index, so we can pull all the info only out of the index.
CREATE INDEX /*i*/change_tag_tag_id ON /*_*/change_tag (ct_tag,ct_rc_id,ct_rev_id,ct_log_id);

CREATE TABLE /*_*/valid_tag (
  vt_tag varchar(255) NOT NULL PRIMARY KEY
) /*$wgDBTableOptions*/;

Problems with it are:

  • Getting the usage statistics for Special:Tags requires a query like SELECT ct_tag, COUNT(*) AS hitcount FROM change_tag GROUP BY ct_tag ORDER BY hitcount DESC, which requires scanning the entire table. This is responsible for almost all of the long load times for Special:Tags.
  • Getting all tags for a given revision/log entry/RC entry requires a GROUP_CONCAT. There is a tag_summary table to serve as a rollup for this, but for some reason we stopped using it (at Sean Pringle's instruction, IIRC).
  • Tags are stored as strings, rather than being normalized to integers. This means the full string value of some tags is stored millions of times, and the table is much larger than it needs to be.

Proposed schema

In January 2017, @Cenarium submitted a Gerrit change that creates a rollup table for tag counts. In November/December 2017, I took over this patch, and in late December @Ladsgroup suggested normalizing the tag names. Combining these ideas is how I got to this proposal; it's mostly their ideas rather than mine.

-- Table defining tag names for IDs. Also stores hit counts to avoid expensive queries on change_tag
CREATE TABLE /*_*/change_tag_def (
    -- Numerical ID of the tag (ct_tag_id refers to this)
    ctd_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
    -- Symbolic name of the tag (what would previously be put in ct_tag)
    ctd_name varchar(255) NOT NULL,
    -- Whether this tag was defined manually by a privileged user using Special:Tags
    ctd_user_defined tinyint(1) NOT NULL,
    -- Number of times this tag was used
    ctd_count bigint unsigned NOT NULL default 0,
    -- Last time this tag was added to something
    ctd_timestamp varbinary(14) NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/ctd_name ON /*_*/change_tag_def (ctd_name);
CREATE INDEX /*i*/ctd_count ON /*_*/change_tag_def (ctd_count);

ALTER TABLE /*_*/change_tags ADD
    -- Tag ID (foreign key to change_tag_def.ctd_id)
    -- Default is for migration and is removed after
    ct_tag_id int unsigned NOT NULL DEFAULT 0;

-- Moved into ctd_user_defined
DROP TABLE /*_*/valid_tag;

With this schema we could get the list of tags and their usage counts directly from the change_tag_def table, without any expensive queries. The tag table would be populated once, then kept up to date by incrementing counts when tags are added. The change_tag table would refer to tags by ID (ct_tag_id, which foreign-keys into ctd_id) rather than by name (we'd remove ct_tag).

Migration

Doing this migration is tricky, because we want to replace ct_tag with ct_tag_id, and there are indexes that use ct_tag. I think it would have to be done as follows:

  1. Create the change_tag_def table and add the ct_tag_id field to change_tag (but don't remove ct_tag yet and don't change any indexes yet).
  2. Set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH. This will cause the change_tag_def table and the ct_tag_id field to be written to when an edit is tagged, but not yet read from.
  3. Run the migration script. This will run the Special:Tags query (in ChangeTags::tagUsageStatistics()) and use it to populate the change_tag_def table. It will also populate ct_tag_id for every row in the change_tag table.
  4. Add new indexes using ct_tag_id instead of ct_tag, including unique indexes on (ct_{rc,log,rev}_id, ct_tag_id).
  5. Convert the old indexes that use ct_tag from unique to non-unique, and set a default value (empty string) for ct_tag.
  6. Set $wgChangeTagsSchemaMigrationStage to MIGRATION_NEW. This will cause the change_tag_def table and ct_tag_id to be read from, and ct_tag to no longer be written to.
  7. Remove the ct_tag field (and the indexes that reference it), and remove the default on ct_tag_id.

Implementation sketch: https://gerrit.wikimedia.org/r/#/c/405375

Open questions

  • Should rows be removed from the change_tag_def table when ctd_count reaches zero? Cenarium's original code does this, and it makes sense for a rollup table, but for an ID mapping table I'm concerned that it hurts ID stability. I don't directly see how that would be a problem, though.
    • @Anomie gave feedback on this and my proposed answer is: we should only delete zero-count rows if the tag is not "defined" in software or in the valid_tag table.
    • Consensus is to delete rows with ctd_count=0 if ctd_user_defined=0, but keep them if ctd_user_defined=1.
  • Do we need the ctd_timestamp field, or should we remove it?
    • @Anomie dug into the comments and found that @Cenarium's motivation for adding this field was so that tags that are no longer being used to tag new changes would be easy to identify. I'm interested to hear if people think that use case is worth it. I personally am leaning towards "not worth it".
    • @daniel points out this can be computed periodically with a join against the revision table if we need to look at it somewhere
  • Is ctd_defined a good name? The concept it expresses is "tag defined through an admin adding it via the web UI, as opposed to code declaring it or it just being added to things without a definition". The jargon in the code for this is an "explicitly defined tag" (e.g. ChangeTags::listExplicitlyDefinedTags()).
    • Changed to ctd_user_defined as suggested by @daniel
  • Is tag an OK name for this DB table? Should we use a different name? The name as Cenarium proposed it was change_tag_statistics, but since the table as I propose it here defines the ID->name relationship for tags, I didn't think that was a good name anymore.
    • Per @TTO's suggestion I've changed it to change_tag_def. Do people think that's a good name?

Breakdown (WIP)

Still missing the more fine-grained index tweaking (not making it unique); see “migration” above.

Related Objects

StatusSubtypeAssignedTask
ResolvedCenarium
DeclinedNone
OpenNone
OpenNone
ResolvedCenarium
ResolvedNone
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
OpenNone
ResolvedLadsgroup
ResolvedPRODUCTION ERRORLadsgroup
ResolvedMarostegui
Resolved Bstorm
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedMarostegui
ResolvedMarostegui
ResolvedTrizek-WMF
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedMarostegui
ResolvedLadsgroup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 430943 merged by jenkins-bot:
[mediawiki/core@master] Introduce change_tag_def table

https://gerrit.wikimedia.org/r/430943

I'm doing this :)

Let's make that official then, by assigning the task to you :)

I also had an in-person conversation with @Ladsgroup where he said he intended to use the same technique that site_stats uses to manage incrementing fields (store a value in memcached, use its increment primitive, and periodically write the value to the DB) for the ctd_count field.

How does that technique avoid the possibility that memcached loses the updated count before it can be written to the DB?

How does that technique avoid the possibility that memcached loses the updated count before it can be written to the DB?

I guess that would simply be considered an acceptable loss.

I also had an in-person conversation with @Ladsgroup where he said he intended to use the same technique that site_stats uses to manage incrementing fields (store a value in memcached, use its increment primitive, and periodically write the value to the DB) for the ctd_count field.

I don't think this is necessary. We don't actually use that site_stats memcached update code in production. The code was merged to core on May 11, 2012, and then a non-functional pilot deployment was done on May 17, 2012. Then nothing has been touched since then. $wgSiteStatsAsyncFactor is false on most wikis, and 1 on a few test wikis, but 1 apparently means the same thing as false. site_stats is presumably hotter than change_tag_def will be, so I don't think refactoring Aaron's complex SiteStatsUpdate code should be a dependency for this task.

My recommendation is that the change_tag_def update be done in a separate transaction, preferably using autocommit mode, to minimise the time the lock is held.

Hey, one kind inquiry: was this announced anywhere other than the wikidata-tech mailing list? This doesn't appear to strictly be about wikidata. See also T199234. I am struggling to keep up with schema changes that affect my tools.

Hey, one kind inquiry: was this announced anywhere other than the wikidata-tech mailing list? This doesn't appear to strictly be about wikidata. See also T199234. I am struggling to keep up with schema changes that affect my tools.

We recently announced it in wikitech-l

It was also highlighted in Scrum of Scrums for inter-team communication at least a couple of times.

Change 446366 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] labs: Add change_tag_def to labs replicas

https://gerrit.wikimedia.org/r/446366

Change 446366 merged by Jcrespo:
[operations/puppet@production] labs: Add change_tag_def to labs replicas

https://gerrit.wikimedia.org/r/446366

Getting all tags for a given revision/log entry/RC entry requires a GROUP_CONCAT. There is a tag_summary table to serve as a rollup for this, but for some reason we stopped using it (at Sean Pringle's instruction, IIRC).

@Catrope, @Ladsgroup, what is the plan for the tag_summary table? Are we planning to keep it around in its current form? I do actually use it for analytics purposes, but I can switch to use change_tag if there's a month or so to migrate after the new schema is in place.

@Neil_P._Quinn_WMF I don't know for what type of analysis you need the table but did you try using the recently-introduced and fancy change_tag_def table and joining it with tag_summary?

@Neil_P._Quinn_WMF I don't know for what type of analysis you need the table but did you try using the recently-introduced and fancy change_tag_def table and joining it with tag_summary?

Well, right now, that join doesn't make sense, because the ts_tags field already contains a comma separated list of tag names (e.g. mobile edit,mobile web edit,possible libel or vandalism). My question is whether this will change in the future :)

I have to count edits for specific interfaces, which sometimes requires looking at multiple tags (e.g. mobile visual edits are those tagged with mobile web edit and visual edit). So I use regular expressions over ts_tags, since that saves me the step of grouping and concatenating the various rows from change_tag.

I just looked at the tag_summary in depth and it's another beast that I'm not going to touch (probably it's better just to ditch the whole thing after we properly normalize change_tag, what do you think @Catrope ?) I would suggest you to use change_tag_id in change_tag table instead, give that it's a number instead of string, querying and grouping it would be faster.

(probably it's better just to ditch the whole thing after we properly normalize change_tag, what do you think @Catrope ?) I would suggest you to use change_tag_id in change_tag table instead, give that it's a number instead of string, querying and grouping it would be faster.

Sounds fine to me! I just ask that you wait a month after adding change_tag_id before dropping tag_summary, so I can properly migrate. I'm following this task so I'll see what you decide.

Addshore triaged this task as Medium priority.Oct 9 2018, 1:46 PM

Change 467957 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Set migration stage for change tag to read new

https://gerrit.wikimedia.org/r/467957

Change 467957 merged by jenkins-bot:
[mediawiki/core@master] Set migration stage for change tag to read new

https://gerrit.wikimedia.org/r/467957

Ladsgroup removed a project: Patch-For-Review.

Almost a year now.

Everything is done 🥳 🥳🥳🥳 (We need to apply the schema change in prod and the ticket is on track: T210713: Drop change_tag.ct_tag column in production)

Just look at all of the closed subtasks.

Change 405375 abandoned by Ladsgroup:
[mediawiki/core@master] [WIP] Change tag schema normalization

Reason:
Already done. Took the liberty of abandoning it.

https://gerrit.wikimedia.org/r/405375