Normalize change tag schema
Closed, ResolvedPublic20 Estimated Story Points
Actions

Assigned To

Authored By

	Catrope
	Jan 19 2018, 10:53 PM

Description

Problem

Change tags are used more and more, and the current schema doesn't scale. On English Wikipedia, the wiki with the most edits, we have 40M rows in the change_tag table and it takes 12 seconds to load Special:Tags. On Wikidata, there are fewer edits but tagging is used a lot more (because so many edits are tagged with OAuth consumer IDs), so there are 184M rows in the change_tag table and loading Special:Tags takes 42 seconds (!).

The current schema is as follows:

-- A table to track tags for revisions, logs and recent changes.
CREATE TABLE /*_*/change_tag (
  ct_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- RCID for the change
  ct_rc_id int NULL,
  -- LOGID for the change
  ct_log_id int unsigned NULL,
  -- REVID for the change
  ct_rev_id int unsigned NULL,
  -- Tag applied
  ct_tag varchar(255) NOT NULL,
  -- Parameters for the tag, presently unused
  ct_params blob NULL
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/change_tag_rc_tag ON /*_*/change_tag (ct_rc_id,ct_tag);
CREATE UNIQUE INDEX /*i*/change_tag_log_tag ON /*_*/change_tag (ct_log_id,ct_tag);
CREATE UNIQUE INDEX /*i*/change_tag_rev_tag ON /*_*/change_tag (ct_rev_id,ct_tag);
-- Covering index, so we can pull all the info only out of the index.
CREATE INDEX /*i*/change_tag_tag_id ON /*_*/change_tag (ct_tag,ct_rc_id,ct_rev_id,ct_log_id);

CREATE TABLE /*_*/valid_tag (
  vt_tag varchar(255) NOT NULL PRIMARY KEY
) /*$wgDBTableOptions*/;

Problems with it are:

Getting the usage statistics for Special:Tags requires a query like SELECT ct_tag, COUNT(*) AS hitcount FROM change_tag GROUP BY ct_tag ORDER BY hitcount DESC, which requires scanning the entire table. This is responsible for almost all of the long load times for Special:Tags.
Getting all tags for a given revision/log entry/RC entry requires a GROUP_CONCAT. There is a tag_summary table to serve as a rollup for this, but for some reason we stopped using it (at Sean Pringle's instruction, IIRC).
Tags are stored as strings, rather than being normalized to integers. This means the full string value of some tags is stored millions of times, and the table is much larger than it needs to be.

Proposed schema

In January 2017, @Cenarium submitted a Gerrit change that creates a rollup table for tag counts. In November/December 2017, I took over this patch, and in late December @Ladsgroup suggested normalizing the tag names. Combining these ideas is how I got to this proposal; it's mostly their ideas rather than mine.

-- Table defining tag names for IDs. Also stores hit counts to avoid expensive queries on change_tag
CREATE TABLE /*_*/change_tag_def (
    -- Numerical ID of the tag (ct_tag_id refers to this)
    ctd_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
    -- Symbolic name of the tag (what would previously be put in ct_tag)
    ctd_name varchar(255) NOT NULL,
    -- Whether this tag was defined manually by a privileged user using Special:Tags
    ctd_user_defined tinyint(1) NOT NULL,
    -- Number of times this tag was used
    ctd_count bigint unsigned NOT NULL default 0,
    -- Last time this tag was added to something
    ctd_timestamp varbinary(14) NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/ctd_name ON /*_*/change_tag_def (ctd_name);
CREATE INDEX /*i*/ctd_count ON /*_*/change_tag_def (ctd_count);

ALTER TABLE /*_*/change_tags ADD
    -- Tag ID (foreign key to change_tag_def.ctd_id)
    -- Default is for migration and is removed after
    ct_tag_id int unsigned NOT NULL DEFAULT 0;

-- Moved into ctd_user_defined
DROP TABLE /*_*/valid_tag;

With this schema we could get the list of tags and their usage counts directly from the change_tag_def table, without any expensive queries. The tag table would be populated once, then kept up to date by incrementing counts when tags are added. The change_tag table would refer to tags by ID (ct_tag_id, which foreign-keys into ctd_id) rather than by name (we'd remove ct_tag).

Migration

Doing this migration is tricky, because we want to replace ct_tag with ct_tag_id, and there are indexes that use ct_tag. I think it would have to be done as follows:

Create the change_tag_def table and add the ct_tag_id field to change_tag (but don't remove ct_tag yet and don't change any indexes yet).
Set $wgChangeTagsSchemaMigrationStage to MIGRATION_WRITE_BOTH. This will cause the change_tag_def table and the ct_tag_id field to be written to when an edit is tagged, but not yet read from.
Run the migration script. This will run the Special:Tags query (in ChangeTags::tagUsageStatistics()) and use it to populate the change_tag_def table. It will also populate ct_tag_id for every row in the change_tag table.
Add new indexes using ct_tag_id instead of ct_tag, including unique indexes on (ct_{rc,log,rev}_id, ct_tag_id).
Convert the old indexes that use ct_tag from unique to non-unique, and set a default value (empty string) for ct_tag.
Set $wgChangeTagsSchemaMigrationStage to MIGRATION_NEW. This will cause the change_tag_def table and ct_tag_id to be read from, and ct_tag to no longer be written to.
Remove the ct_tag field (and the indexes that reference it), and remove the default on ct_tag_id.

Implementation sketch: https://gerrit.wikimedia.org/r/#/c/405375

Open questions

Should rows be removed from the change_tag_def table when ctd_count reaches zero? Cenarium's original code does this, and it makes sense for a rollup table, but for an ID mapping table I'm concerned that it hurts ID stability. I don't directly see how that would be a problem, though.
- @Anomie gave feedback on this and my proposed answer is: we should only delete zero-count rows if the tag is not "defined" in software or in the valid_tag table.
- Consensus is to delete rows with ctd_count=0 if ctd_user_defined=0, but keep them if ctd_user_defined=1.
~~Do we need the ctd_timestamp field, or should we remove it?~~
- @Anomie dug into the comments and found that @Cenarium's motivation for adding this field was so that tags that are no longer being used to tag new changes would be easy to identify. I'm interested to hear if people think that use case is worth it. I personally am leaning towards "not worth it".
- @daniel points out this can be computed periodically with a join against the revision table if we need to look at it somewhere
Is ctd_defined a good name? The concept it expresses is "tag defined through an admin adding it via the web UI, as opposed to code declaring it or it just being added to things without a definition". The jargon in the code for this is an "explicitly defined tag" (e.g. ChangeTags::listExplicitlyDefinedTags()).
- Changed to ctd_user_defined as suggested by @daniel
Is tag an OK name for this DB table? Should we use a different name? The name as Cenarium proposed it was change_tag_statistics, but since the table as I propose it here defines the ID->name relationship for tags, I didn't think that was a good name anymore.
- Per @TTO's suggestion I've changed it to change_tag_def. Do people think that's a good name?

Breakdown (WIP)

Still missing the more fine-grained index tweaking (not making it unique); see “migration” above.

Details

Subject	Repo	Branch	Lines +/-
[WIP] Change tag schema normalization	mediawiki/core	master	+276 -227
Set migration stage for change tag to read new	mediawiki/core	master	+18 -28
labs: Add change_tag_def to labs replicas	operations/puppet	production	+1 -0
Introduce change_tag_def table	mediawiki/core	master	+159 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Cenarium	T89553 Recent changes patrolling limited to tagged changes
Declined		None	T91425 ProblemChanges should only filter for tags indicating a 'problem'
Open		None	T134440 Add configuration settings for tags to allow more complex use cases (tracking)
Open		None	T95319 Add 'problem' status for tags
Resolved		Cenarium	T73236 Automatically tag edits that make a redirect, or convert a redirected page to a normal page, or move a page across namespaces, etc.
Resolved		None	T105189 [Bug] Wikidata action=query&list=tags should not take 15-25 seconds to respond
Resolved		Ladsgroup	T91535 Performance issues with tags
Resolved		Ladsgroup	T185355 Normalize change tag schema
Resolved		Ladsgroup	T164167 change_tag table needs redesign
Resolved		Ladsgroup	T20672 Change Tags structure to use numeric IDs instead of text
Open		None	T91312 Evaluate refactoring of change_tags to only associate with rev id
Resolved		Ladsgroup	T193867 Create the change_tag_def table and add the ct_tag_id field to change_tag
Resolved	PRODUCTION ERROR	Ladsgroup	T194302 Schema change for new change_tag_def
Resolved		• Marostegui	T195193 Schema change for ct_tag_id field to change_tag
Resolved		• Bstorm	T199818 Replicate ct_tag_id to labs
Resolved		Ladsgroup	T193868 Add code to write to change_tag_def table
Resolved		Ladsgroup	T193874 Add new indexes to change_tag table using ct_tag_id instead of ct_tag
Resolved		Ladsgroup	T194162 Add code to read from change_tag_def instead of change_tag.ct_tag
Resolved		Ladsgroup	T194163 Drop change_tag.ct_tag column
Resolved		• Marostegui	T210713 Drop change_tag.ct_tag column in production
Resolved		• Marostegui	T234800 Switchover s1 primary database master db1067 -> db1083 - 14th Nov 06:00 - 06:30 UTC
Resolved		Trizek-WMF	T234801 Community Relations support needed for a read-only window for s1 (enwiki)
Resolved		Ladsgroup	T194164 Start reading from change_tag_def in production
Resolved		Ladsgroup	T196671 Start reading from change_tag_def in beta cluster
Resolved		Ladsgroup	T194165 Start writing to change_tag_def in production
Resolved		Ladsgroup	T199334 Temporarily add config and use it to use change_tag_def table instead of change_tag table for Special:Tags
Resolved		Ladsgroup	T200064 Update documentation on mediawiki.org: Manual:Change_tag_table
Resolved		Ladsgroup	T193873 Run maintenance script to populate change_tag_def on WMF production (all wikis)
Resolved		Ladsgroup	T193871 Add maintenance script to populate change_tag_def
Resolved		Ladsgroup	T208846 Start reading from change_tag_def on wikidatawiki
Resolved		• Marostegui	T203709 Schema change for adding indexes of ct_tag_id
Resolved		Ladsgroup	T211896 Query trying to use the wrong index (change_tag_rev_tag) on change_tag

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• RazShuty moved this task from Incoming to Ordered on the Wikidata-Ministry-Of-Magic-Tech-Debt board.May 9 2018, 5:44 PM

Change 430943 merged by jenkins-bot:
[mediawiki/core@master] Introduce change_tag_def table

https://gerrit.wikimedia.org/r/430943

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)).May 9 2018, 7:00 PM

Krinkle moved this task from Untriaged to In progress on the TechCom-RFC (TechCom-RFC-Closed) board.May 9 2018, 8:12 PM

In T185355#4181858, @Ladsgroup wrote:

I'm doing this :)

Let's make that official then, by assigning the task to you :)

I also had an in-person conversation with @Ladsgroup where he said he intended to use the same technique that site_stats uses to manage incrementing fields (store a value in memcached, use its increment primitive, and periodically write the value to the DB) for the ctd_count field.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMay 18 2018, 11:09 AM

How does that technique avoid the possibility that memcached loses the updated count before it can be written to the DB?

In T185355#4213389, @Anomie wrote:

How does that technique avoid the possibility that memcached loses the updated count before it can be written to the DB?

I guess that would simply be considered an acceptable loss.

In T185355#4213247, @Catrope wrote:

I also had an in-person conversation with @Ladsgroup where he said he intended to use the same technique that site_stats uses to manage incrementing fields (store a value in memcached, use its increment primitive, and periodically write the value to the DB) for the ctd_count field.

I don't think this is necessary. We don't actually use that site_stats memcached update code in production. The code was merged to core on May 11, 2012, and then a non-functional pilot deployment was done on May 17, 2012. Then nothing has been touched since then. $wgSiteStatsAsyncFactor is false on most wikis, and 1 on a few test wikis, but 1 apparently means the same thing as false. site_stats is presumably hotter than change_tag_def will be, so I don't think refactoring Aaron's complex SiteStatsUpdate code should be a dependency for this task.

My recommendation is that the change_tag_def update be done in a separate transaction, preferably using autocommit mode, to minimise the time the lock is held.

matej_suchanek added a parent task: T91535: Performance issues with tags.Jun 2 2018, 7:00 AM

Anomie mentioned this in T193690: RFC: How should we fix the undeletion system?.Jun 4 2018, 7:13 PM

Lucas_Werkmeister_WMDE added a project: Wikidata-Campsite.Jun 5 2018, 1:12 PM

Addshore moved this task from Incoming to In Progress on the Wikidata-Campsite board.Jun 5 2018, 1:12 PM

Lucas_Werkmeister_WMDE closed subtask T193867: Create the change_tag_def table and add the ct_tag_id field to change_tag as Resolved.Jun 5 2018, 1:21 PM

jcrespo mentioned this in T191581: Nullify ct_rc_id and ts_rc_id when recent changes entries are being deleted.Jun 15 2018, 12:15 PM

Lydia_Pintscher closed subtask T193871: Add maintenance script to populate change_tag_def as Resolved.Jun 20 2018, 3:09 PM

Lydia_Pintscher closed subtask T193868: Add code to write to change_tag_def table as Resolved.

mpopov subscribed.Jun 25 2018, 11:34 PM

• Vvjjkkii reopened subtask T193868: Add code to write to change_tag_def table as Open.Jul 1 2018, 1:11 AM

• Vvjjkkii reopened subtask T193867: Create the change_tag_def table and add the ct_tag_id field to change_tag as Open.

• Vvjjkkii reopened subtask T193871: Add maintenance script to populate change_tag_def as Open.

CommunityTechBot closed subtask T193871: Add maintenance script to populate change_tag_def as Resolved.Jul 2 2018, 4:28 PM

CommunityTechBot closed subtask T193868: Add code to write to change_tag_def table as Resolved.

CommunityTechBot closed subtask T193867: Create the change_tag_def table and add the ct_tag_id field to change_tag as Resolved.

Liuxinyu970226 subscribed.Jul 3 2018, 3:13 AM

MusikAnimal mentioned this in T199234: Find a better way to notify tool maintainers of schema and API changes.Jul 10 2018, 3:13 PM

Hey, one kind inquiry: was this announced anywhere other than the wikidata-tech mailing list? This doesn't appear to strictly be about wikidata. See also T199234. I am struggling to keep up with schema changes that affect my tools.

In T185355#4412231, @MusikAnimal wrote:

Hey, one kind inquiry: was this announced anywhere other than the wikidata-tech mailing list? This doesn't appear to strictly be about wikidata. See also T199234. I am struggling to keep up with schema changes that affect my tools.

We recently announced it in wikitech-l

It was also highlighted in Scrum of Scrums for inter-team communication at least a couple of times.

Change 446366 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] labs: Add change_tag_def to labs replicas

https://gerrit.wikimedia.org/r/446366

Change 446366 merged by Jcrespo:
[operations/puppet@production] labs: Add change_tag_def to labs replicas

https://gerrit.wikimedia.org/r/446366

Ladsgroup mentioned this in T164167: change_tag table needs redesign.Jul 25 2018, 5:03 PM

Ladsgroup closed subtask T164167: change_tag table needs redesign as Resolved.

Ladsgroup mentioned this in T20672: Change Tags structure to use numeric IDs instead of text.

Ladsgroup closed subtask T20672: Change Tags structure to use numeric IDs instead of text as Resolved.

Ladsgroup closed subtask T194165: Start writing to change_tag_def in production as Resolved.

Ladsgroup mentioned this in T91535: Performance issues with tags.Jul 25 2018, 5:07 PM

Ladsgroup added a subtask: T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis).Jul 25 2018, 5:25 PM

Lydia_Pintscher closed subtask T199334: Temporarily add config and use it to use change_tag_def table instead of change_tag table for Special:Tags as Resolved.Jul 26 2018, 3:37 PM

Getting all tags for a given revision/log entry/RC entry requires a GROUP_CONCAT. There is a tag_summary table to serve as a rollup for this, but for some reason we stopped using it (at Sean Pringle's instruction, IIRC).

@Catrope, @Ladsgroup, what is the plan for the tag_summary table? Are we planning to keep it around in its current form? I do actually use it for analytics purposes, but I can switch to use change_tag if there's a month or so to migrate after the new schema is in place.

@Neil_P._Quinn_WMF I don't know for what type of analysis you need the table but did you try using the recently-introduced and fancy change_tag_def table and joining it with tag_summary?

In T185355#4453730, @Ladsgroup wrote:

@Neil_P._Quinn_WMF I don't know for what type of analysis you need the table but did you try using the recently-introduced and fancy change_tag_def table and joining it with tag_summary?

Well, right now, that join doesn't make sense, because the ts_tags field already contains a comma separated list of tag names (e.g. mobile edit,mobile web edit,possible libel or vandalism). My question is whether this will change in the future :)

I have to count edits for specific interfaces, which sometimes requires looking at multiple tags (e.g. mobile visual edits are those tagged with mobile web edit and visual edit). So I use regular expressions over ts_tags, since that saves me the step of grouping and concatenating the various rows from change_tag.

I just looked at the tag_summary in depth and it's another beast that I'm not going to touch (probably it's better just to ditch the whole thing after we properly normalize change_tag, what do you think @Catrope ?) I would suggest you to use change_tag_id in change_tag table instead, give that it's a number instead of string, querying and grouping it would be faster.

In T185355#4456115, @Ladsgroup wrote:

(probably it's better just to ditch the whole thing after we properly normalize change_tag, what do you think @Catrope ?) I would suggest you to use change_tag_id in change_tag table instead, give that it's a number instead of string, querying and grouping it would be faster.

Sounds fine to me! I just ask that you wait a month after adding change_tag_id before dropping tag_summary, so I can properly migrate. I'm following this task so I'll see what you decide.

Krinkle moved this task from Untriaged to Schema changes on the MediaWiki-libs-Rdbms board.Jul 28 2018, 8:56 PM

nshahquinn-wmf mentioned this in T201062: Load change tags into the Analytics Data Lake on a daily basis.Aug 2 2018, 6:12 PM

matej_suchanek mentioned this in T202195: Should AbuseFilter identify change tags by id (and not name)?.Aug 18 2018, 2:48 PM

Addshore added a project: Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).Aug 23 2018, 9:27 AM

Addshore moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Aug 23 2018, 9:27 AM

Ladsgroup closed subtask T193873: Run maintenance script to populate change_tag_def on WMF production (all wikis) as Resolved.Sep 6 2018, 9:04 PM

Ladsgroup closed subtask T194162: Add code to read from change_tag_def instead of change_tag.ct_tag as Resolved.Sep 7 2018, 9:20 AM

Lydia_Pintscher closed subtask T193874: Add new indexes to change_tag table using ct_tag_id instead of ct_tag as Resolved.Sep 16 2018, 10:38 AM

jcrespo mentioned this in T205904: Key 'change_tag_rev_tag' doesn't exist in table 'change_tag'.Oct 1 2018, 5:40 PM

nshahquinn-wmf mentioned this in T205940: Add change tag tables to monthly mediawiki_history sqoop.Oct 1 2018, 11:52 PM

nshahquinn-wmf mentioned this in T161149: Provide edit tags in the Data Lake edit data.Oct 8 2018, 7:49 PM

Addshore triaged this task as Medium priority.Oct 9 2018, 1:46 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptOct 9 2018, 1:46 PM

Addshore moved this task from incoming to in progress on the Wikidata board.Oct 11 2018, 8:41 AM

Change 467957 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Set migration stage for change tag to read new

https://gerrit.wikimedia.org/r/467957

Addshore mentioned this in T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out.Oct 19 2018, 10:34 AM

Addshore removed a project: Wikidata-Campsite.Nov 1 2018, 11:18 AM

Addshore closed subtask T194164: Start reading from change_tag_def in production as Resolved.Nov 6 2018, 12:13 PM

Addshore added a subtask: T208846: Start reading from change_tag_def on wikidatawiki.Nov 6 2018, 12:16 PM

Addshore changed the status of subtask T208846: Start reading from change_tag_def on wikidatawiki from Open to Stalled.

Ladsgroup changed the status of subtask T208846: Start reading from change_tag_def on wikidatawiki from Stalled to Open.Nov 14 2018, 10:25 AM

Addshore closed subtask T208846: Start reading from change_tag_def on wikidatawiki as Resolved.Nov 14 2018, 4:05 PM

Change 467957 merged by jenkins-bot:
[mediawiki/core@master] Set migration stage for change tag to read new

https://gerrit.wikimedia.org/r/467957

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.6; 2018-11-27).Nov 14 2018, 5:00 PM

Ladsgroup mentioned this in T209525: Migrate tag_summary usage to change_tag and drop the table.Nov 14 2018, 7:22 PM

Addshore closed subtask T200064: Update documentation on mediawiki.org: Manual:Change_tag_table as Resolved.Nov 15 2018, 1:06 PM

Ladsgroup mentioned this in T136687: Database error when filtering page log.Nov 28 2018, 8:13 AM

Ladsgroup closed subtask T194163: Drop change_tag.ct_tag column as Resolved.Nov 29 2018, 3:11 PM

Almost a year now.

Everything is done 🥳 🥳🥳🥳 (We need to apply the schema change in prod and the ticket is on track: T210713: Drop change_tag.ct_tag column in production)

Just look at all of the closed subtasks.

Ladsgroup moved this task from Doing to Done on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Nov 29 2018, 3:49 PM

nshahquinn-wmf awarded a token.Nov 29 2018, 9:56 PM

nshahquinn-wmf mentioned this in T210807: Update editor_month generation to use change_tag table rather than tag_summary.Nov 29 2018, 11:06 PM

TTO awarded a token.Nov 30 2018, 12:03 AM

Liuxinyu970226 unsubscribed.Dec 2 2018, 7:22 AM

• Marostegui added a subtask: T211896: Query trying to use the wrong index (change_tag_rev_tag) on change_tag.Dec 13 2018, 4:20 PM

• Marostegui mentioned this in T211896: Query trying to use the wrong index (change_tag_rev_tag) on change_tag.

Ladsgroup closed subtask T211896: Query trying to use the wrong index (change_tag_rev_tag) on change_tag as Resolved.Dec 13 2018, 6:09 PM

Bawolff mentioned this in T211849: A particular edit not showing on watchlist.Dec 16 2018, 1:39 AM

• Marostegui mentioned this in T124214: Allow filtering based on tag on Special:NewFiles.Jan 18 2019, 11:29 AM

Ladsgroup mentioned this in T89217: Should be possible to rename or merge change tags.Jan 21 2019, 10:13 AM