Refactor comment storage in the database and abstract access in MediaWiki
Closed, ResolvedPublic

Description

This is the task to implement T153333: RFC: How should we store longer revision comments?, now that the request has been closed. The text below is my current plan for the code I need to write.

Database

We'll create a new comment table that looks like this:

CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/comment_text ON comment (comment_text(100));

Comments stored in this table will be deduplicated. Replication to Labs will need to hide rows that aren't referred to by any of the tables listed below, or that are only referred to by the tables below in rows with revision-deleted comments. I'll provide more details on that in the schema change subtask, once I get to that point.

The end state will be to make the following changes to existing tables to refer to this new table instead of storing the comment directly:

  • revision.rev_commentrevision.rev_comment_id
  • archive.ar_commentarchive.ar_comment_id
  • ipblocks.ipb_reasonipblocks.ipb_reason_id
  • image.img_descriptionimage.img_description_id
  • oldimage.oi_descriptionoldimage.oi_description_id
  • filearchive.fa_deleted_reasonfilearchive.fa_deleted_reason_id
  • filearchive.fa_descriptionfilearchive.fa_description_id
  • recentchanges.rc_commentrecentchanges.rc_comment_id
  • logging.log_commentlogging.log_comment_id
  • protected_titles.pt_reasonprotected_titles.pt_reason_id
  • cu_changes.cuc_commentcu_changes.cuc_comment_id (from CheckUser)
  • cu_log.cul_reasoncu_log.cul_reason_id (from CheckUser)
    • These two should be ignored when determining visibility in the Labs replication
  • global_block_whitelist.gbw_reasonglobal_block_whitelist.gbw_reason_id (from GlobalBlocking)
  • globalblocks.gb_reasonglobalblocks.gb_reason_id (from GlobalBlocking)
  • flow_revision.rev_contentflow_revision.rev_content_id (from StructuredDiscussions)

However, since schema changes on large busy tables are a lot of work that we don't want to be blocked on for months, the revision and image tables will use a temporary auxiliary table that will later be merged into the main tables:

CREATE TABLE /*_*/revision_comment_temp (
  revcomment_rev bigint unsigned NOT NULL,     -- → revision.rev_id
  revcomment_comment bigint unsigned NOT NULL, -- → comment.comment_id
  PRIMARY KEY (revcomment_rev, revcomment_comment)
) /*$wgDBTableOptions*/;

CREATE TABLE /*_*/image_comment_temp (
  imgcomment_image varchar(255) binary NOT NULL, -- → image.img_name, ugh
  imgcomment_comment bigint unsigned NOT NULL,   -- → comment.comment_id
  PRIMARY KEY (imgcomment_image, imgcomment_comment)
) /*$wgDBTableOptions*/;

MediaWiki code

There will be a temporary feature flag with four states:

  • Read and write the old columns only.
  • Write both the old and new columns. Read from new preferentially, falling back to old.
  • Write only the new columns. Read from new preferentially, falling back to old.
  • Read and write the new columns only.

There will be a class that manages the feature flag and such. Methods it'll probably need:

  • getFields( $key ): Pass e.g. rev_comment, returns the fields that should be selected. The comment might have to be lazy-loaded later.
  • getJoin( $key ): Pass e.g. rev_comment, returns the tables to join, join conditions, and fields that should be selected.
  • insert( $dbw, $key, $comment ): Inserts the comment into the comment table, if applicable and necessary, and returns a mapping of field → value to be included in the insert into the main table.
    • I'll want a version of this that takes some LogEntry-like class (or maybe just use LogEntry?) and generates the comment_text and comment_data from it.
  • getComment( $key, $row ): Returns the comment string (or message?). May need to make a DB query if getFields() was used instead of getJoin().

Then I'll need to search the code for references to the old columns and patch them to use this new class.

Related Objects

Anomie created this task.May 31 2017, 9:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2017, 9:05 PM
jcrespo added a comment.EditedJun 1 2017, 7:00 AM

I love this and will support it to the best of my ability.

I have one question and one suggestion: The question is the meaning of comment_data (I do not see it anywhere). The suggestion is to announce "large changes" on labs list (or wikitech) so people are aware of relatively impacting changes, stressing why they are needed (e.g. to allow for longer comments in this case, and improve performance).

To keep the Phab task shorter, I left out the comments from the SQL. The actual patch will have comments explaining the purpose. But for now: the initial intended meaning[1] of comment_data is to hold a JSON blob that can be used to localize automatically-generated comments, much like how if you visit Special:Log on a wiki it will show the log messages in the language you selected in your preferences. As far as I know the primary driver of this is Wikibase, which already does this sort of thing somehow or other.

That's a good suggestion. If I forget, remind me to do that once I've written the code and put the patch into Gerrit. I created T166798 since an announcement would probably want to take that into account.

[1]: I say "initial intended meaning" because I've already heard people considering sticking other data in there.

That is cool, I just asked because I hadn't seen it yet on previous references and I think it was more or less a "new" but not completely obvious field to me when I read it. Thank you for the explanation.

Change 357599 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

Anomie added a comment.Jun 7 2017, 1:56 PM

@jcrespo: It'd be nice to prevent duplicate rows in these temporary tables at the database level as well as the application level, which means a unique index on the revcomment_rev column. Which do you think would be best?

  1. Primary key on (revcomment_rev, revcomment_comment), unique index on (revcomment_rev).
  2. Primary key on (revcomment_rev), covering non-unique index on (revcomment_rev, revcomment_comment).
  3. Primary key on (revcomment_rev), no covering index.

When I test these on a small local database (around 4000 rows), EXPLAIN on #1 uses a 'ref' on PRIMARY with "Using index" while the other two both prefer an 'eq_ref' on PRIMARY with no "Using index".

I will answer based on certain assumptions, please tell me if those are wrong:

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks. Basically, I expect queries to be:

SELECT ... FROM revision JOIN revision_comment_temp ON revision.rev_id = revcomment.revcomment_rev JOIN comment on revcomment.revcomment_rev = comment.comment_id WHERE /* filter on revision */

n-n relationship tables are normally setup with a covering PK in the order they are most frequently access, that would be (revcomment_rev, revcomment_comment), that would allow them to be unique, clustered by the main access, and using a covering index. If the other access is needed, create an additional index on (revcomment_comment) or on (revcomment_comment, revcomment_rev), but for this second one, I would wait for it to be necesary, for queries like:

SELECT ... FROM comment JOIN comment on revcomment.revcomment_rev = comment.comment_id JOIN revision ON revision.rev_id = revcomment.revcomment_rev WHERE /* filter on comment */

Where it is expected for the join to go in the opposite direction. I would assume that is not currently the case because I think there is currently no index on comment, but please tell me otherwise, as that will change my suggestion. I strongly suggest to not add such a possibility on a first step (it is ok, if it is for maintenance/migration purposes).

Anomie added a comment.Jun 7 2017, 2:45 PM

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks.

For production usage, all accesses will be from revision to comment.

In addition to potential maintenance or analytics tasks, Labs replica filtering will probably need to make accesses from comment to revision to determine whether a comment row should be visible or not. These sorts of things would also likely want indexes on columns like ipblocks.ipb_reason_id that aren't using these temporary tables.

n-n relationship tables

This isn't an n-n relation, it's n-1. Each revision has only one comment, it's an error if there are multiple rows with the same revcomment_rev.

jcrespo added a comment.EditedJun 7 2017, 3:13 PM

n-n or 1-n, same things apply. I meant "link table/relationship table".

We shouldn't change mediawiki based on labs needs (although we can facilitate it). Labs could have its own extra indexes if needed.

Anomie added a comment.Jun 7 2017, 4:19 PM

n-n or 1-n, same things apply.

Except my question is "what's the best way to add a unique constraint on (revcomment_rev)?" and that hasn't been answered.

Well, on top of the previous thing, if additionally one of the column needs a unique constraint, there is only one way to do it, make it unique. That is completely independent from the performance or best practices of the tables, it is just a contraint.

Anomie added a comment.Jun 7 2017, 4:41 PM

Ok, thanks.

-jem- added a subscriber: -jem-.Jun 8 2017, 10:58 AM
Blahma added a subscriber: Blahma.Jun 8 2017, 12:41 PM

Change 357892 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] WIP: Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

Tbayer added a subscriber: Tbayer.Jun 12 2017, 10:08 AM

Change 357599 merged by jenkins-bot:
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

I wrote about my concerns about the deduplication scheme at T153333#3491632 , since that's where most of the discussion on that topic was. Apologies for the late review, but seeing the code has made the problems clearer to me.

Change 357892 merged by jenkins-bot:
[mediawiki/core@master] Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

Change 374856 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374857 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374858 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374859 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374860 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

Change 374861 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374862 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

Change 374863 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374864 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374865 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374862 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

Change 374857 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374865 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374859 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374864 merged by jenkins-bot:
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374861 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374863 merged by jenkins-bot:
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374856 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374858 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374860 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

Jdforrester-WMF added a comment.EditedSep 5 2017, 8:35 PM

That's the last open patch for this merged – does that mean this is Resolved? Obviously there's still the migration of content in Wikimedia production, but I guess that's for T166733 not this ticket?

Anomie added a comment.Sep 5 2017, 8:39 PM

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

Change 376286 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

Change 376286 merged by Chad:
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

Anomie added a comment.Sep 6 2017, 7:26 PM

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

Thanks! I also managed to find it in error.log on mwlog1001 this morning. It turns out there were three different traces: one in Flow fixed by https://gerrit.wikimedia.org/r/#/c/374861/, and two in MobileFrontend that are being tracked at T175161.

Anomie closed this task as Resolved.Sep 7 2017, 1:35 PM

Ok, let's mark this resolved. The only remaining known issue is tracked in T175161.

For my own convenience (and in case someone asks), WMF deployment is coordinated on T174569.

Change 377779 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

Change 377779 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

Change 404506 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506

Change 404506 merged by jenkins-bot:
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506