Page MenuHomePhabricator

change_tag table needs redesign
Closed, ResolvedPublic

Description

One of the top slow queries (P5295):

SELECT /* ChangeTags::tagUsageStatistics */ ct_tag, count(*) AS `hitcount` FROM `change_tag` GROUP BY ct_tag ORDER BY hitcount DESC

Looking at some rows in the table:

MariaDB [wikidatawiki_p]> select * from change_tag LIMIT 20;
+-----------+-----------+-----------+---------------+-----------+
| ct_rc_id  | ct_log_id | ct_rev_id | ct_tag        | ct_params |
+-----------+-----------+-----------+---------------+-----------+
| 180404170 |      NULL | 179938677 | OAuth CID: 93 | NULL      |
| 180404172 |      NULL | 179938679 | HHVM          | NULL      |
| 180404173 |      NULL | 179938680 | HHVM          | NULL      |
| 180404174 |      NULL | 179938681 | HHVM          | NULL      |
| 180404175 |      NULL | 179938682 | HHVM          | NULL      |
| 180404175 |      NULL | 179938682 | OAuth CID: 93 | NULL      |
| 180404176 |      NULL | 179938683 | HHVM          | NULL      |
| 180404177 |      NULL | 179938684 | HHVM          | NULL      |
| 180404178 |      NULL | 179938685 | HHVM          | NULL      |
| 180404179 |      NULL | 179938686 | HHVM          | NULL      |
| 180404180 |      NULL | 179938687 | HHVM          | NULL      |
| 180404180 |      NULL | 179938687 | OAuth CID: 93 | NULL      |
| 180404181 |      NULL | 179938688 | HHVM          | NULL      |
| 180404182 |      NULL | 179938689 | HHVM          | NULL      |
| 180404183 |      NULL | 179938690 | HHVM          | NULL      |
| 180404184 |      NULL | 179938691 | HHVM          | NULL      |
| 180404184 |      NULL | 179938691 | OAuth CID: 93 | NULL      |
| 180404185 |      NULL | 179938692 | HHVM          | NULL      |
| 180404186 |      NULL | 179938693 | HHVM          | NULL      |
| 180404187 |      NULL | 179938694 | HHVM          | NULL      |
+-----------+-----------+-----------+---------------+-----------+
20 rows in set (0.00 sec)

The ct_tag column is horrible in so many ways.

  • It uses varchar instead of being a foreign key to another table or at least it's not a hash for faster lookup
  • By being varchar, name of a ct_tag can not be changed easily (you might need to write tons of rows in order to do that)
  • Getting statistics of hitcounts is impossible on big tables.

I suggest ct_tag gets moved to another table and only a foreign key stays here.

Event Timeline

It uses varchar instead of being a foreign key to another table or at least it's not a hash for faster lookup

I assume the reason it (and tag_summary) were written using varchar tag names as identifiers was to avoid having to make an additional join whenever you query ct_tag or ts_tag. We don't really care about the speed of lookups on ct_tag in general. On the other hand, whether this actually matters from a performance perspective is something @jcrespo would know more about.

By being varchar, name of a ct_tag can not be changed easily (you might need to write tons of rows in order to do that)

This shouldn't ever be needed though. Display names for tags can be changed in the UI.

Getting statistics of hitcounts is impossible on big tables.

https://gerrit.wikimedia.org/r/334337/ tries to improve this situation by creating a tag_statistics table, which is essentially a pivot of change_tag (just as tag_summary is an aggregate of change_tag).

I suggest ct_tag gets moved to another table and only a foreign key stays here.

Do you suggest to change tag_summary as well?

I assume the reason it (and tag_summary) were written using varchar tag names as identifiers was to avoid having to make an additional join whenever you query ct_tag or ts_tag.

JOIN is cheap if the proper index is in place what is expensive is slow lookups like this.

This shouldn't ever be needed though. Display names for tags can be changed in the UI.

Okay but still not a good idea it would cause storage issues.

Getting statistics of hitcounts is impossible on big tables.

https://gerrit.wikimedia.org/r/334337/ tries to improve this situation by creating a tag_statistics table, which is essentially a pivot of change_tag (just as tag_summary is an aggregate of change_tag).

I don't think that would be a good idea for several reasons but @jcrespo knows better though.

I suggest ct_tag gets moved to another table and only a foreign key stays here.

Do you suggest to change tag_summary as well?

Yeah.

JOIN is cheap if the proper index is in place what is expensive is slow lookups like this.

I kind of agree with that statement in general. There has been many instances where denormalization has been clearly used to reduce development time, but on the persistence layer, normalization will produce smaller storage, and while disk is "cheap", memory and execution time when going over millions of records is not. JOIN will be faster because less amount of bytes will be read on disk, and a regular identifier is 32 bits, while a text identifier can be between 8 and let's say 255*4 bytes (which gets multiplied even more because secondary indexes store the PK internally), being the first much more easibly cachable (will rest in memory), so in reality only 1 table will be used on disk, and it will be much smaller.

I have not had the time to look at this issue in particular so I cannot say exactly what to do or not to do, but my default position is Ladsgroup's, and the opposite view should be justified, not the other way around.

This shouldn't ever be needed though. Display names for tags can be changed in the UI.

But the table receives schema changes (and a schema change is basically rewriting the full table), and maybe in the future a bug has will be found and we have to delete X tags, or modify them in any way. There are millions of reasons to normalize, and only a few to not do it. For example, revision comments cannot be changed after edit, but they are scheduled to be separated on a new table on the next revision schema change. This will reduce a 400GB+ table to a 200GB table + a 20GB table. I know that normalization takes some extra lines of code and is slightly more complex, but think that on the storage layer we do not deal with 1 of a few objects at a time on memory (the normal live-data workflow), we store and maintain millions of records, and it makes sense to optimize at serialization time as much as possible. In the end it also impacts you, because doing a schema change on large tables can take months, and on a small table a single day- so it is investing time and effort, to making further optimizations much easier. Normalization also sometimes makes summary tables unnecessary. E.g. something like: tag_usages {tag_id(PK), revision_id (PK)} + tags {tag_id (PK), tag_name, uses_count} and we can still denormalize if needed with a tag_serialized {revision_id (PK), all_tags_json}

Again as a reminder, I have not looked at the current structure, so those are very general statements, but everything that Ladsgroup said made sense and wanted to clarify those.

Getting statistics of hitcounts is impossible on big tables.

https://gerrit.wikimedia.org/r/334337/ tries to improve this situation by creating a tag_statistics table, which is essentially a pivot of change_tag (just as tag_summary is an aggregate of change_tag).

What's confusing to me is that tag_summary is no longer used as an aggregate; instead a GROUP_CONCAT() is used directly. I was very puzzled by this decision, but it was made at the encouragement of the previous DBA. See T55577: IndexPager::buildQueryInfo (LogPager) query needs tuning. @jcrespo , do you have any comment on that (tag_summary vs GROUP_CONCAT()) and/or the pivot table proposal (tag_statistics vs COUNT())?

I suggest ct_tag gets moved to another table and only a foreign key stays here.

Do you suggest to change tag_summary as well?

Moreover, what do you (@Ladsgroup) suggest we do with ct_params?

Finally, re changing tag names: I agree with @TTO that that should never happen and is a non-issue, but I agree with @Ladsgroup that there are other reasons to want to move the tag name out to its own table. I have no strong opinion on whether we should or shouldn't do that, so I'll defer to @jcrespo (who appears to say that he's in favor of it?).

Some additional context: part of the reason @Ladsgroup and I started talking about this is because we have ideas that would involve using tags more (or at least I do; I forget what the impetus was from his or Daniel's side), and we feel like we should have tag infrastructure that scales properly before we throw a lot more data at it.

My idea (or rather, Daniel Kinzler's suggestion on top of my idea) was to (ab)use change tags for tracking whether a revision was reverted and possibly which revision it was reverted by. I originally wanted to add an additional table for this, but Daniel pointed out that this would fit within ct_tag + ct_params. However, that would probably about double the number of revision tags (see numbers below), so we felt like we should clean things up in this area before doing that.

Numbers:

mysql:research@s3-analytics-slave [enwiki]> select count(*) from change_tag where ct_rev_id between 783000000 and 783999999;
+----------+
| count(*) |
+----------+
|   215510 |
+----------+
1 row in set (0.15 sec)

mysql:research@s3-analytics-slave [enwiki]> select count(distinct ct_rev_id) from change_tag where ct_rev_id between 783000000 and 783999999;
+---------------------------+
| count(distinct ct_rev_id) |
+---------------------------+
|                    127325 |
+---------------------------+
1 row in set (0.52 sec)

So based on a sample of 1M consecutive recent edits, 12.7% of edits are tagged with something, and the average edit has 1.7 tags. Finding data on reverts is hard (at least for me; maybe others have it) but I'd be surprised if the revert rate was lower than 10%, and I could easily believe it being up to 21%. A 21% revert rate at 1 tag per revert would double the volume of tagged edits, and a 42% revert rate (I doubt it's that high) would triple it.

What's confusing to me is that tag_summary is no longer used as an aggregate; instead a GROUP_CONCAT() is used directly. I was very puzzled by this decision, but it was made at the encouragement of the previous DBA. See T55577: IndexPager::buildQueryInfo (LogPager) query needs tuning. @jcrespo , do you have any comment on that (tag_summary vs GROUP_CONCAT()) and/or the pivot table proposal (tag_statistics vs COUNT())?

There are two types of aggregation we are talking about one is based on rev_id (or log_id) which will end up as tag_summary and the other one is based on the tags which will end up as tag_statistics. What I think is if we remake change_tags table with proper normalization and indexing none of these other tables would be needed.

I suggest ct_tag gets moved to another table and only a foreign key stays here.

Do you suggest to change tag_summary as well?

Moreover, what do you (@Ladsgroup) suggest we do with ct_params?

I haven't thought about it fully. What do you suggest? I don't see the need to fix it at this point but that's something we should consider at some point eventually.

Some additional context: part of the reason @Ladsgroup and I started talking about this is because we have ideas that would involve using tags more (or at least I do; I forget what the impetus was from his or Daniel's side),

Our point of view is that Special:Tags is extremely slow given the huge number of tags in Wikidata tables making aggregation and statistics impossible and making it too much resource-consuming. Also I have storage concerns.

Since the schema of change_tags table is due to be adjusted, what do you think of tweaking it to avoid the current situation with 3 UNIQUE indexes and nullable columns? I was thinking of something along the lines of:

-- A table to track tags for revisions, logs and recent changes.
CREATE TABLE change_tag (
  -- foreign key to revision.rev_id, recentchanges.rc_id or logging.log_id
  ct_change_id int unsigned NOT NULL,
  -- identifies type of the tagged change (log entry, revision, recent change, etc.)
  -- this could be also a tinyint identified by numeric named constants on the app level
  ct_change_type enum('log', 'revision', 'rc') NOT NULL,
 -- foreign key to change_tag_def.ctd_id
  ct_tag_id int NOT NULL,

  PRIMARY KEY (ct_tag_id, ct_tag_type, ct_tag_id)
);

This would simplify inserting new data into change_tags table and querying it while keeping the migration process fairly simple. INSERT IGNORE would no longer be necessary to write data, and we could remove two unique indexes out of three. The use of INSERT IGNORE here is particularly problematic, and several times we have had replication breaking errors due to duplicate entries.

Pings for awareness: @Ladsgroup and @Catrope 🙂

@TK-999 your proposal assumes that change_tag will have either a RC ID or a log ID or a revision ID. This is not the case! change_tag rows typically have an RC ID and a revision or log ID, sometimes all three. There is also change_tag rows that have a revision ID but no RC ID, but I'm not sure how that is useful.

The idea is that what is tagged is an action related to a page, which is permanently recorded in the revision table or the log table, and which also has a temporary RC entry.

Ohh, that is true, I'll have to rethink it. Still, it might make sense to consider a different schema in the long run to avoid the previously described issues...

Ladsgroup claimed this task.

This is being done as part of T185355