Problem
Change tags are used more and more, and the current schema doesn't scale. On English Wikipedia, the wiki with the most edits, we have 40M rows in the change_tag table and it takes 12 seconds to load Special:Tags. On Wikidata, there are fewer edits but tagging is used a lot more (because so many edits are tagged with OAuth consumer IDs), so there are 184M rows in the change_tag table and loading Special:Tags takes 42 seconds (!).
The current schema is as follows:
-- A table to track tags for revisions, logs and recent changes. CREATE TABLE /*_*/change_tag ( ct_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, -- RCID for the change ct_rc_id int NULL, -- LOGID for the change ct_log_id int unsigned NULL, -- REVID for the change ct_rev_id int unsigned NULL, -- Tag applied ct_tag varchar(255) NOT NULL, -- Parameters for the tag, presently unused ct_params blob NULL ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/change_tag_rc_tag ON /*_*/change_tag (ct_rc_id,ct_tag); CREATE UNIQUE INDEX /*i*/change_tag_log_tag ON /*_*/change_tag (ct_log_id,ct_tag); CREATE UNIQUE INDEX /*i*/change_tag_rev_tag ON /*_*/change_tag (ct_rev_id,ct_tag); -- Covering index, so we can pull all the info only out of the index. CREATE INDEX /*i*/change_tag_tag_id ON /*_*/change_tag (ct_tag,ct_rc_id,ct_rev_id,ct_log_id); CREATE TABLE /*_*/valid_tag ( vt_tag varchar(255) NOT NULL PRIMARY KEY ) /*$wgDBTableOptions*/;
Problems with it are:
- Getting the usage statistics for Special:Tags requires a query like SELECT ct_tag, COUNT(*) AS hitcount FROM change_tag GROUP BY ct_tag ORDER BY hitcount DESC, which requires scanning the entire table. This is responsible for almost all of the long load times for Special:Tags.
- Getting all tags for a given revision/log entry/RC entry requires a GROUP_CONCAT. There is a tag_summary table to serve as a rollup for this, but for some reason we stopped using it (at Sean Pringle's instruction, IIRC).
- Tags are stored as strings, rather than being normalized to integers. This means the full string value of some tags is stored millions of times, and the table is much larger than it needs to be.
Proposed schema
In January 2017, @Cenarium submitted a Gerrit change that creates a rollup table for tag counts. In November/December 2017, I took over this patch, and in late December @Ladsgroup suggested normalizing the tag names. Combining these ideas is how I got to this proposal; it's mostly their ideas rather than mine.
-- Table defining tag names for IDs. Also stores hit counts to avoid expensive queries on change_tag CREATE TABLE /*_*/change_tag_def ( -- Numerical ID of the tag (ct_tag_id refers to this) ctd_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, -- Symbolic name of the tag (what would previously be put in ct_tag) ctd_name varchar(255) NOT NULL, -- Whether this tag was defined manually by a privileged user using Special:Tags ctd_user_defined tinyint(1) NOT NULL, -- Number of times this tag was used ctd_count bigint unsigned NOT NULL default 0, -- Last time this tag was added to something ctd_timestamp varbinary(14) NULL ) /*$wgDBTableOptions*/; CREATE UNIQUE INDEX /*i*/ctd_name ON /*_*/change_tag_def (ctd_name); CREATE INDEX /*i*/ctd_count ON /*_*/change_tag_def (ctd_count); ALTER TABLE /*_*/change_tags ADD -- Tag ID (foreign key to change_tag_def.ctd_id) -- Default is for migration and is removed after ct_tag_id int unsigned NOT NULL DEFAULT 0; -- Moved into ctd_user_defined DROP TABLE /*_*/valid_tag;
With this schema we could get the list of tags and their usage counts directly from the change_tag_def table, without any expensive queries. The tag table would be populated once, then kept up to date by incrementing counts when tags are added. The change_tag table would refer to tags by ID (ct_tag_id, which foreign-keys into ctd_id) rather than by name (we'd remove ct_tag).
Migration
Doing this migration is tricky, because we want to replace ct_tag with ct_tag_id, and there are indexes that use ct_tag. I think it would have to be done as follows:
- Create the change_tag_def table and add the ct_tag_id field to change_tag (but don't remove ct_tag yet and don't change any indexes yet).
- Set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH. This will cause the change_tag_def table and the ct_tag_id field to be written to when an edit is tagged, but not yet read from.
- Run the migration script. This will run the Special:Tags query (in ChangeTags::tagUsageStatistics()) and use it to populate the change_tag_def table. It will also populate ct_tag_id for every row in the change_tag table.
- Add new indexes using ct_tag_id instead of ct_tag, including unique indexes on (ct_{rc,log,rev}_id, ct_tag_id).
- Convert the old indexes that use ct_tag from unique to non-unique, and set a default value (empty string) for ct_tag.
- Set $wgChangeTagsSchemaMigrationStage to MIGRATION_NEW. This will cause the change_tag_def table and ct_tag_id to be read from, and ct_tag to no longer be written to.
- Remove the ct_tag field (and the indexes that reference it), and remove the default on ct_tag_id.
Implementation sketch: https://gerrit.wikimedia.org/r/#/c/405375
Open questions
Should rows be removed from the change_tag_def table when ctd_count reaches zero? Cenarium's original code does this, and it makes sense for a rollup table, but for an ID mapping table I'm concerned that it hurts ID stability. I don't directly see how that would be a problem, though.- @Anomie gave feedback on this and my proposed answer is: we should only delete zero-count rows if the tag is not "defined" in software or in the valid_tag table.
- Consensus is to delete rows with ctd_count=0 if ctd_user_defined=0, but keep them if ctd_user_defined=1.
Do we need the ctd_timestamp field, or should we remove it?- @Anomie dug into the comments and found that @Cenarium's motivation for adding this field was so that tags that are no longer being used to tag new changes would be easy to identify. I'm interested to hear if people think that use case is worth it. I personally am leaning towards "not worth it".
- @daniel points out this can be computed periodically with a join against the revision table if we need to look at it somewhere
Is ctd_defined a good name? The concept it expresses is "tag defined through an admin adding it via the web UI, as opposed to code declaring it or it just being added to things without a definition". The jargon in the code for this is an "explicitly defined tag" (e.g. ChangeTags::listExplicitlyDefinedTags()).- Changed to ctd_user_defined as suggested by @daniel
Is tag an OK name for this DB table? Should we use a different name? The name as Cenarium proposed it was change_tag_statistics, but since the table as I propose it here defines the ID->name relationship for tags, I didn't think that was a good name anymore.- Per @TTO's suggestion I've changed it to change_tag_def. Do people think that's a good name?
Breakdown (WIP)
- T193867: Create the change_tag_def table and add the ct_tag_id field to change_tag
- T193868: Add code to write to change_tag_def table, T193871: Add maintenance script to populate change_tag_def, T193874: Add new indexes to change_tag table using ct_tag_id instead of ct_tag (no fixed order)
- T194165: Start writing to change_tag_def in production
- T194162: Add code to read from change_tag_def instead of change_tag.ct_tag
- T194164: Start reading from change_tag_def in production
- T194163: Drop change_tag.ct_tag column
Still missing the more fine-grained index tweaking (not making it unique); see “migration” above.