Page MenuHomePhabricator

comment and actor view challenges for Cloud Services
Open, MediumPublic

Description

Following the comment and actor schema changes in Mediawiki, several iterations of changes have happened to the view layer of the wiki replicas consumed in Toolforge and Cloud VPS to keep up, with varying results. The problem now is that the actor and comment views are terribly slow compared to the underlying tables because of the large number of subqueries required to appropriately sanitize them.

This task is meant to surface and focus work on fixing the problem for the wiki replicas service in general vs the more Analytics-focused T209031.

Event Timeline

Bstorm created this task.

One likely solution was mentioned by @Anomie here T210693#4801836. I'm interested in hearing more about that and to see what the DBA group thinks about the possibility of that, since we are quite sure that materializing tables won't solve things.

I also had good conversations at All Hands with @Milimetric about doing sanitation in a totally different way that may provide some very good services in the future, but I don't think that there is a straightforward path at this time to using such a service (using hadoop, etc) to fix the current system, since cloud users depend on mysql services that are in sync with production (no matter how lagged...it needs to at least be a consistent lag) for now.

To get it all in one place, here's that proposal again. It relies on the fact that https://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html says the slave can have extra columns as long as they come after all the normal columns and have a default value (and that we're not using the "different data types" thing also described at that page.

  1. Add an extra column to the actor and comment tables on the WMCS copies of the tables (on sanitarium?), something like
ALTER TABLE actor ADD COLUMN wmcs_is_visible TINYINT NOT NULL DEFAULT 0;
ALTER TABLE comment ADD COLUMN wmcs_is_visible TINYINT NOT NULL DEFAULT 0;

We'll also want to create copies of all the indexes on actor and comment with wmcs_is_visible prefixed.

  1. Create a bunch of functions and triggers. Note $LOG_TYPE_LIST$ is supposed to be replaced with the list of types from allowed_logtypes in maintain-views.yaml.
    1delimiter //
    2CREATE FUNCTION wmcsCommentShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
    3BEGIN
    4 RETURN COALESCE(
    5 ( SELECT 1 FROM image WHERE img_description_id = id LIMIT 1 ) OR
    6 ( SELECT 1 FROM filearchive WHERE fa_deleted_reason_id = id LIMIT 1 ) OR
    7 ( SELECT 1 FROM filearchive WHERE fa_description_id = id AND fa_deleted&2 = 0 LIMIT 1 ) OR
    8 ( SELECT 1 FROM ipblocks WHERE ipb_reason_id = id and ipb_deleted = 0 LIMIT 1 ) OR
    9 ( SELECT 1 FROM oldimage WHERE oi_description_id = id AND oi_deleted&2 = 0 LIMIT 1 ) OR
    10 ( SELECT 1 FROM protected_titles WHERE pt_reason_id = id LIMIT 1 ) OR
    11 ( SELECT 1 FROM recentchanges WHERE rc_comment_id = id AND rc_deleted&2 = 0 LIMIT 1 ) OR
    12 ( SELECT 1 FROM revision JOIN revision_comment_temp ON(revcomment_rev = rev_id) WHERE revcomment_comment_id = id AND rev_deleted&2 = 0 LIMIT 1 ) OR
    13 ( SELECT 1 FROM logging WHERE log_comment_id = id AND log_deleted&2 = 0 AND log_type IN ($LOG_TYPE_LIST$) LIMIT 1 )
    14 , 0);
    15END //
    16delimiter ;
    17
    18delimiter //
    19CREATE PROCEDURE wmcsCommentOnInsert(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    20BEGIN
    21 IF NOT del THEN
    22 UPDATE comment SET wmcs_is_visible = 1 WHERE comment_id = id;
    23 END IF;
    24END //
    25delimiter ;
    26
    27delimiter //
    28CREATE PROCEDURE wmcsCommentOnDelete(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    29BEGIN
    30 IF NOT del AND NOT wmcsCommentShow(id) THEN
    31 UPDATE comment SET wmcs_is_visible = 0 WHERE comment_id = id;
    32 END IF;
    33END //
    34delimiter ;
    35
    36delimiter //
    37CREATE PROCEDURE wmcsCommentOnUpdate(oldId BIGINT, oldDel TINYINT, newId BIGINT, newDel TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    38BEGIN
    39 IF oldId != newId OR (NOT oldDel) != (NOT newDel) THEN
    40 CALL wmcsCommentOnDelete(oldId, oldDel);
    41 CALL wmcsCommentOnInsert(newId, newDel);
    42 END IF;
    43END //
    44delimiter ;
    45
    46CREATE TRIGGER image_wmcsCommentOnInsert AFTER INSERT ON image
    47 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.img_description_id, 0 );
    48CREATE TRIGGER image_wmcsCommentOnUpdate AFTER UPDATE ON image
    49 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.img_description_id, 0, NEW.img_description_id, 0 );
    50CREATE TRIGGER image_wmcsCommentOnDelete AFTER DELETE ON image
    51 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.img_description_id, 0 );
    52
    53CREATE TRIGGER filearchive_wmcsCommentOnInsert_reason AFTER INSERT ON filearchive
    54 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.fa_deleted_reason_id, 0 );
    55CREATE TRIGGER filearchive_wmcsCommentOnUpdate_reason AFTER UPDATE ON filearchive
    56 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.fa_deleted_reason_id, 0, NEW.fa_deleted_reason_id, 0 );
    57CREATE TRIGGER filearchive_wmcsCommentOnDelete_reason AFTER DELETE ON filearchive
    58 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.fa_deleted_reason_id, 0 );
    59CREATE TRIGGER filearchive_wmcsCommentOnInsert_description AFTER INSERT ON filearchive
    60 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.fa_description_id, NEW.fa_deleted&2 );
    61CREATE TRIGGER filearchive_wmcsCommentOnUpdate_description AFTER UPDATE ON filearchive
    62 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.fa_description_id, OLD.fa_deleted&2, NEW.fa_description_id, NEW.fa_deleted&2 );
    63CREATE TRIGGER filearchive_wmcsCommentOnDelete_description AFTER DELETE ON filearchive
    64 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.fa_description_id, OLD.fa_deleted&2 );
    65
    66CREATE TRIGGER ipblocks_wmcsCommentOnInsert AFTER INSERT ON ipblocks
    67 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.ipb_reason_id, NEW.ipb_deleted );
    68CREATE TRIGGER ipblocks_wmcsCommentOnUpdate AFTER UPDATE ON ipblocks
    69 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.ipb_reason_id, OLD.ipb_deleted, NEW.ipb_reason_id, NEW.ipb_deleted );
    70CREATE TRIGGER ipblocks_wmcsCommentOnDelete AFTER DELETE ON ipblocks
    71 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.ipb_reason_id, OLD.ipb_deleted );
    72
    73CREATE TRIGGER oldimage_wmcsCommentOnInsert AFTER INSERT ON oldimage
    74 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.oi_description_id, NEW.oi_deleted&2 );
    75CREATE TRIGGER oldimage_wmcsCommentOnUpdate AFTER UPDATE ON oldimage
    76 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.oi_description_id, OLD.oi_deleted&2, NEW.oi_description_id, NEW.oi_deleted&2 );
    77CREATE TRIGGER oldimage_wmcsCommentOnDelete AFTER DELETE ON oldimage
    78 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.oi_description_id, OLD.oi_deleted&2 );
    79
    80CREATE TRIGGER protected_titles_wmcsCommentOnInsert AFTER INSERT ON protected_titles
    81 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.pt_reason_id, 0 );
    82CREATE TRIGGER protected_titles_wmcsCommentOnUpdate AFTER UPDATE ON protected_titles
    83 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.pt_reason_id, 0, NEW.pt_reason_id, 0 );
    84CREATE TRIGGER protected_titles_wmcsCommentOnDelete AFTER DELETE ON protected_titles
    85 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.pt_reason_id, 0 );
    86
    87CREATE TRIGGER recentchanges_wmcsCommentOnInsert AFTER INSERT ON recentchanges
    88 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.rc_comment_id, NEW.rc_deleted&2 );
    89CREATE TRIGGER recentchanges_wmcsCommentOnUpdate AFTER UPDATE ON recentchanges
    90 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.rc_comment_id, OLD.rc_deleted&2, NEW.rc_comment_id, NEW.rc_deleted&2 );
    91CREATE TRIGGER recentchanges_wmcsCommentOnDelete AFTER DELETE ON recentchanges
    92 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.rc_comment_id, OLD.rc_deleted&2 );
    93
    94CREATE TRIGGER revision_comment_temp_wmcsCommentOnInsert AFTER INSERT ON revision_comment_temp
    95 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=NEW.revcomment_rev) );
    96CREATE TRIGGER revision_comment_temp_wmcsCommentOnUpdate AFTER UPDATE ON revision_comment_temp
    97 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=OLD.revcomment_rev), NEW.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=NEW.revcomment_rev) );
    98CREATE TRIGGER revision_comment_temp_wmcsCommentOnDelete AFTER DELETE ON revision_comment_temp
    99 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=OLD.revcomment_rev) );
    100CREATE TRIGGER revision_wmcsCommentOnInsert AFTER INSERT ON revision
    101 FOR EACH ROW CALL wmcsCommentOnInsert( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=NEW.rev_id), NEW.rev_deleted&2 );
    102CREATE TRIGGER revision_wmcsCommentOnUpdate AFTER UPDATE ON revision
    103 FOR EACH ROW CALL wmcsCommentOnUpdate( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=OLD.rev_id), OLD.rev_deleted&2, (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=NEW.rev_id), NEW.rev_deleted&2 );
    104CREATE TRIGGER revision_wmcsCommentOnDelete AFTER DELETE ON revision
    105 FOR EACH ROW CALL wmcsCommentOnDelete( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=OLD.rev_id), OLD.rev_deleted&2 );
    106
    107CREATE TRIGGER logging_wmcsCommentOnInsert AFTER INSERT ON logging
    108 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.log_comment_id, NOT ( NEW.log_deleted&2 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    109CREATE TRIGGER logging_wmcsCommentOnUpdate AFTER UPDATE ON logging
    110 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.log_comment_id, NOT ( OLD.log_deleted&2 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ), NEW.log_comment_id, NOT ( NEW.log_deleted&2 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    111CREATE TRIGGER logging_wmcsCommentOnDelete AFTER DELETE ON logging
    112 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.log_comment_id, NOT ( OLD.log_deleted&2 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ) );
    1delimiter //
    2CREATE FUNCTION wmcsActorShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
    3BEGIN
    4 RETURN COALESCE(
    5 ( SELECT 1 FROM actor JOIN user ON(actor_user=user_id) WHERE actor_id = id LIMIT 1 ) OR
    6 ( SELECT 1 FROM archive WHERE ar_actor = id AND ar_deleted&4 = 0 LIMIT 1 ) OR
    7 ( SELECT 1 FROM ipblocks WHERE ipb_by_actor = id and ipb_deleted = 0 LIMIT 1 ) OR
    8 ( SELECT 1 FROM image WHERE img_actor = id LIMIT 1 ) OR
    9 ( SELECT 1 FROM oldimage WHERE oi_actor = id AND oi_deleted&4 = 0 LIMIT 1 ) OR
    10 ( SELECT 1 FROM filearchive WHERE fa_actor = id AND fa_deleted&4 = 0 LIMIT 1 ) OR
    11 ( SELECT 1 FROM recentchanges WHERE rc_actor = id AND rc_deleted&4 = 0 LIMIT 1 ) OR
    12 ( SELECT 1 FROM logging WHERE log_actor = id AND log_deleted&4 = 0 AND log_type IN ($LOG_TYPE_LIST$) LIMIT 1 ) OR
    13 ( SELECT 1 FROM revision JOIN revision_actor_temp ON(revactor_rev = rev_id) WHERE revactor_actor = id AND rev_deleted&4 = 0 LIMIT 1 )
    14 , 0);
    15END //
    16delimiter ;
    17
    18delimiter //
    19CREATE PROCEDURE wmcsActorOnInsert(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    20BEGIN
    21 IF NOT del THEN
    22 UPDATE actor SET wmcs_is_visible = 1 WHERE actor_id = id;
    23 END IF;
    24END //
    25delimiter ;
    26
    27delimiter //
    28CREATE PROCEDURE wmcsActorOnDelete(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    29BEGIN
    30 IF NOT del AND NOT wmcsActorShow(id) THEN
    31 UPDATE actor SET wmcs_is_visible = 0 WHERE actor_id = id;
    32 END IF;
    33END //
    34delimiter ;
    35
    36delimiter //
    37CREATE PROCEDURE wmcsActorOnUpdate(oldId BIGINT, oldDel TINYINT, newId BIGINT, newDel TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    38BEGIN
    39 IF oldId != newId OR (NOT oldDel) != (NOT newDel) THEN
    40 CALL wmcsActorOnDelete(oldId, oldDel);
    41 CALL wmcsActorOnInsert(newId, newDel);
    42 END IF;
    43END //
    44delimiter ;
    45
    46CREATE TRIGGER archive_wmcsActorOnInsert AFTER INSERT ON archive
    47 FOR EACH ROW CALL wmcsActorOnInsert( NEW.ar_actor, NEW.ar_deleted&4 );
    48CREATE TRIGGER archive_wmcsActorOnUpdate AFTER UPDATE ON archive
    49 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.ar_actor, OLD.ar_deleted&4, NEW.ar_actor, NEW.ar_deleted&4 );
    50CREATE TRIGGER archive_wmcsActorOnDelete AFTER DELETE ON archive
    51 FOR EACH ROW CALL wmcsActorOnDelete( OLD.ar_actor, OLD.ar_deleted&4 );
    52
    53CREATE TRIGGER ipblocks_wmcsActorOnInsert AFTER INSERT ON ipblocks
    54 FOR EACH ROW CALL wmcsActorOnInsert( NEW.ipb_by_actor, NEW.ipb_deleted );
    55CREATE TRIGGER ipblocks_wmcsActorOnUpdate AFTER UPDATE ON ipblocks
    56 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.ipb_by_actor, OLD.ipb_deleted, NEW.ipb_by_actor, NEW.ipb_deleted );
    57CREATE TRIGGER ipblocks_wmcsActorOnDelete AFTER DELETE ON ipblocks
    58 FOR EACH ROW CALL wmcsActorOnDelete( OLD.ipb_by_actor, OLD.ipb_deleted );
    59
    60CREATE TRIGGER image_wmcsActorOnInsert AFTER INSERT ON image
    61 FOR EACH ROW CALL wmcsActorOnInsert( NEW.img_actor, 0 );
    62CREATE TRIGGER image_wmcsActorOnUpdate AFTER UPDATE ON image
    63 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.img_actor, 0, NEW.img_actor, 0 );
    64CREATE TRIGGER image_wmcsActorOnDelete AFTER DELETE ON image
    65 FOR EACH ROW CALL wmcsActorOnDelete( OLD.img_actor, 0 );
    66
    67CREATE TRIGGER oldimage_wmcsActorOnInsert AFTER INSERT ON oldimage
    68 FOR EACH ROW CALL wmcsActorOnInsert( NEW.oi_actor, NEW.oi_deleted&4 );
    69CREATE TRIGGER oldimage_wmcsActorOnUpdate AFTER UPDATE ON oldimage
    70 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.oi_actor, OLD.oi_deleted&4, NEW.oi_actor, NEW.oi_deleted&4 );
    71CREATE TRIGGER oldimage_wmcsActorOnDelete AFTER DELETE ON oldimage
    72 FOR EACH ROW CALL wmcsActorOnDelete( OLD.oi_actor, OLD.oi_deleted&4 );
    73
    74CREATE TRIGGER filearchive_wmcsActorOnInsert AFTER INSERT ON filearchive
    75 FOR EACH ROW CALL wmcsActorOnInsert( NEW.fa_actor, NEW.fa_deleted&4 );
    76CREATE TRIGGER filearchive_wmcsActorOnUpdate AFTER UPDATE ON filearchive
    77 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.fa_actor, OLD.fa_deleted&4, NEW.fa_actor, NEW.fa_deleted&4 );
    78CREATE TRIGGER filearchive_wmcsActorOnDelete AFTER DELETE ON filearchive
    79 FOR EACH ROW CALL wmcsActorOnDelete( OLD.fa_actor, OLD.fa_deleted&4 );
    80
    81CREATE TRIGGER recentchanges_wmcsActorOnInsert AFTER INSERT ON recentchanges
    82 FOR EACH ROW CALL wmcsActorOnInsert( NEW.rc_actor, NEW.rc_deleted&4 );
    83CREATE TRIGGER recentchanges_wmcsActorOnUpdate AFTER UPDATE ON recentchanges
    84 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.rc_actor, OLD.rc_deleted&4, NEW.rc_actor, NEW.rc_deleted&4 );
    85CREATE TRIGGER recentchanges_wmcsActorOnDelete AFTER DELETE ON recentchanges
    86 FOR EACH ROW CALL wmcsActorOnDelete( OLD.rc_actor, OLD.rc_deleted&4 );
    87
    88CREATE TRIGGER logging_wmcsActorOnInsert AFTER INSERT ON logging
    89 FOR EACH ROW CALL wmcsActorOnInsert( NEW.log_actor, NOT ( NEW.log_deleted&4 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    90CREATE TRIGGER logging_wmcsActorOnUpdate AFTER UPDATE ON logging
    91 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.log_actor, NOT ( OLD.log_deleted&4 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ), NEW.log_actor, NOT ( NEW.log_deleted&4 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    92CREATE TRIGGER logging_wmcsActorOnDelete AFTER DELETE ON logging
    93 FOR EACH ROW CALL wmcsActorOnDelete( OLD.log_actor, NOT ( OLD.log_deleted&4 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ) );
    94
    95CREATE TRIGGER revision_actor_temp_wmcsActorOnInsert AFTER INSERT ON revision_actor_temp
    96 FOR EACH ROW CALL wmcsActorOnInsert( NEW.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=NEW.revactor_rev) );
    97CREATE TRIGGER revision_actor_temp_wmcsActorOnUpdate AFTER UPDATE ON revision_actor_temp
    98 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=OLD.revactor_rev), NEW.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=NEW.revactor_rev) );
    99CREATE TRIGGER revision_actor_temp_wmcsActorOnDelete AFTER DELETE ON revision_actor_temp
    100 FOR EACH ROW CALL wmcsActorOnDelete( OLD.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=OLD.revactor_rev) );
    101CREATE TRIGGER revision_wmcsActorOnInsert AFTER INSERT ON revision
    102 FOR EACH ROW CALL wmcsActorOnInsert( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=NEW.rev_id), NEW.rev_deleted&4 );
    103CREATE TRIGGER revision_wmcsActorOnUpdate AFTER UPDATE ON revision
    104 FOR EACH ROW CALL wmcsActorOnUpdate( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=OLD.rev_id), OLD.rev_deleted&4, (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=NEW.rev_id), NEW.rev_deleted&4 );
    105CREATE TRIGGER revision_wmcsActorOnDelete AFTER DELETE ON revision
    106 FOR EACH ROW CALL wmcsActorOnDelete( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=OLD.rev_id), OLD.rev_deleted&4 );
  1. Populate wmcs_is_visible for existing rows. That might be as simple as UPDATE comment SET wmcs_is_visible = wmcsCommentShow( comment_id ) and the same for actor, or we could unroll it with a bunch of queries like UPDATE comment JOIN recentchanges ON(rc_comment_id = commend_id) SET wmcs_is_visible = 1 WHERE wmcs_is_visible = 0 AND rc_deleted&2 = 0, and/or we could run it through in batches.
  1. Update the views, instead of all the subqueries they can be just WHERE wmcs_is_visible.

Open questions:

  • Does the replication actually work like the docs say it does?
  • Are these triggers fast enough?
  • Are these triggers straightforward enough to be maintained?
  • The triggers seem to work for me in some quick local testing (MariaDB 10.3.12-MariaDB-2). Do they work at scale with all the corner cases MediaWiki might throw at them?

To get it all in one place, here's that proposal again. It relies on the fact that https://dev.mysql.com/doc/refman/5.5/en/replication-features-differing-tables.html says the slave can have extra columns as long as they come after all the normal columns and have a default value (and that we're not using the "different data types" thing also described at that page.

  1. Add an extra column to the actor and comment tables on the WMCS copies of the tables (on sanitarium?), something like
ALTER TABLE actor ADD COLUMN wmcs_is_visible TINYINT NOT NULL DEFAULT 0;
ALTER TABLE comment ADD COLUMN wmcs_is_visible TINYINT NOT NULL DEFAULT 0;

We'll also want to create copies of all the indexes on actor and comment with wmcs_is_visible prefixed.

  1. Create a bunch of functions and triggers. Note $LOG_TYPE_LIST$ is supposed to be replaced with the list of types from allowed_logtypes in maintain-views.yaml.
    1delimiter //
    2CREATE FUNCTION wmcsCommentShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
    3BEGIN
    4 RETURN COALESCE(
    5 ( SELECT 1 FROM image WHERE img_description_id = id LIMIT 1 ) OR
    6 ( SELECT 1 FROM filearchive WHERE fa_deleted_reason_id = id LIMIT 1 ) OR
    7 ( SELECT 1 FROM filearchive WHERE fa_description_id = id AND fa_deleted&2 = 0 LIMIT 1 ) OR
    8 ( SELECT 1 FROM ipblocks WHERE ipb_reason_id = id and ipb_deleted = 0 LIMIT 1 ) OR
    9 ( SELECT 1 FROM oldimage WHERE oi_description_id = id AND oi_deleted&2 = 0 LIMIT 1 ) OR
    10 ( SELECT 1 FROM protected_titles WHERE pt_reason_id = id LIMIT 1 ) OR
    11 ( SELECT 1 FROM recentchanges WHERE rc_comment_id = id AND rc_deleted&2 = 0 LIMIT 1 ) OR
    12 ( SELECT 1 FROM revision JOIN revision_comment_temp ON(revcomment_rev = rev_id) WHERE revcomment_comment_id = id AND rev_deleted&2 = 0 LIMIT 1 ) OR
    13 ( SELECT 1 FROM logging WHERE log_comment_id = id AND log_deleted&2 = 0 AND log_type IN ($LOG_TYPE_LIST$) LIMIT 1 )
    14 , 0);
    15END //
    16delimiter ;
    17
    18delimiter //
    19CREATE PROCEDURE wmcsCommentOnInsert(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    20BEGIN
    21 IF NOT del THEN
    22 UPDATE comment SET wmcs_is_visible = 1 WHERE comment_id = id;
    23 END IF;
    24END //
    25delimiter ;
    26
    27delimiter //
    28CREATE PROCEDURE wmcsCommentOnDelete(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    29BEGIN
    30 IF NOT del AND NOT wmcsCommentShow(id) THEN
    31 UPDATE comment SET wmcs_is_visible = 0 WHERE comment_id = id;
    32 END IF;
    33END //
    34delimiter ;
    35
    36delimiter //
    37CREATE PROCEDURE wmcsCommentOnUpdate(oldId BIGINT, oldDel TINYINT, newId BIGINT, newDel TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    38BEGIN
    39 IF oldId != newId OR (NOT oldDel) != (NOT newDel) THEN
    40 CALL wmcsCommentOnDelete(oldId, oldDel);
    41 CALL wmcsCommentOnInsert(newId, newDel);
    42 END IF;
    43END //
    44delimiter ;
    45
    46CREATE TRIGGER image_wmcsCommentOnInsert AFTER INSERT ON image
    47 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.img_description_id, 0 );
    48CREATE TRIGGER image_wmcsCommentOnUpdate AFTER UPDATE ON image
    49 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.img_description_id, 0, NEW.img_description_id, 0 );
    50CREATE TRIGGER image_wmcsCommentOnDelete AFTER DELETE ON image
    51 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.img_description_id, 0 );
    52
    53CREATE TRIGGER filearchive_wmcsCommentOnInsert_reason AFTER INSERT ON filearchive
    54 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.fa_deleted_reason_id, 0 );
    55CREATE TRIGGER filearchive_wmcsCommentOnUpdate_reason AFTER UPDATE ON filearchive
    56 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.fa_deleted_reason_id, 0, NEW.fa_deleted_reason_id, 0 );
    57CREATE TRIGGER filearchive_wmcsCommentOnDelete_reason AFTER DELETE ON filearchive
    58 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.fa_deleted_reason_id, 0 );
    59CREATE TRIGGER filearchive_wmcsCommentOnInsert_description AFTER INSERT ON filearchive
    60 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.fa_description_id, NEW.fa_deleted&2 );
    61CREATE TRIGGER filearchive_wmcsCommentOnUpdate_description AFTER UPDATE ON filearchive
    62 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.fa_description_id, OLD.fa_deleted&2, NEW.fa_description_id, NEW.fa_deleted&2 );
    63CREATE TRIGGER filearchive_wmcsCommentOnDelete_description AFTER DELETE ON filearchive
    64 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.fa_description_id, OLD.fa_deleted&2 );
    65
    66CREATE TRIGGER ipblocks_wmcsCommentOnInsert AFTER INSERT ON ipblocks
    67 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.ipb_reason_id, NEW.ipb_deleted );
    68CREATE TRIGGER ipblocks_wmcsCommentOnUpdate AFTER UPDATE ON ipblocks
    69 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.ipb_reason_id, OLD.ipb_deleted, NEW.ipb_reason_id, NEW.ipb_deleted );
    70CREATE TRIGGER ipblocks_wmcsCommentOnDelete AFTER DELETE ON ipblocks
    71 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.ipb_reason_id, OLD.ipb_deleted );
    72
    73CREATE TRIGGER oldimage_wmcsCommentOnInsert AFTER INSERT ON oldimage
    74 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.oi_description_id, NEW.oi_deleted&2 );
    75CREATE TRIGGER oldimage_wmcsCommentOnUpdate AFTER UPDATE ON oldimage
    76 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.oi_description_id, OLD.oi_deleted&2, NEW.oi_description_id, NEW.oi_deleted&2 );
    77CREATE TRIGGER oldimage_wmcsCommentOnDelete AFTER DELETE ON oldimage
    78 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.oi_description_id, OLD.oi_deleted&2 );
    79
    80CREATE TRIGGER protected_titles_wmcsCommentOnInsert AFTER INSERT ON protected_titles
    81 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.pt_reason_id, 0 );
    82CREATE TRIGGER protected_titles_wmcsCommentOnUpdate AFTER UPDATE ON protected_titles
    83 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.pt_reason_id, 0, NEW.pt_reason_id, 0 );
    84CREATE TRIGGER protected_titles_wmcsCommentOnDelete AFTER DELETE ON protected_titles
    85 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.pt_reason_id, 0 );
    86
    87CREATE TRIGGER recentchanges_wmcsCommentOnInsert AFTER INSERT ON recentchanges
    88 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.rc_comment_id, NEW.rc_deleted&2 );
    89CREATE TRIGGER recentchanges_wmcsCommentOnUpdate AFTER UPDATE ON recentchanges
    90 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.rc_comment_id, OLD.rc_deleted&2, NEW.rc_comment_id, NEW.rc_deleted&2 );
    91CREATE TRIGGER recentchanges_wmcsCommentOnDelete AFTER DELETE ON recentchanges
    92 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.rc_comment_id, OLD.rc_deleted&2 );
    93
    94CREATE TRIGGER revision_comment_temp_wmcsCommentOnInsert AFTER INSERT ON revision_comment_temp
    95 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=NEW.revcomment_rev) );
    96CREATE TRIGGER revision_comment_temp_wmcsCommentOnUpdate AFTER UPDATE ON revision_comment_temp
    97 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=OLD.revcomment_rev), NEW.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=NEW.revcomment_rev) );
    98CREATE TRIGGER revision_comment_temp_wmcsCommentOnDelete AFTER DELETE ON revision_comment_temp
    99 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.revcomment_comment_id, (SELECT rev_deleted&2 FROM revision WHERE rev_id=OLD.revcomment_rev) );
    100CREATE TRIGGER revision_wmcsCommentOnInsert AFTER INSERT ON revision
    101 FOR EACH ROW CALL wmcsCommentOnInsert( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=NEW.rev_id), NEW.rev_deleted&2 );
    102CREATE TRIGGER revision_wmcsCommentOnUpdate AFTER UPDATE ON revision
    103 FOR EACH ROW CALL wmcsCommentOnUpdate( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=OLD.rev_id), OLD.rev_deleted&2, (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=NEW.rev_id), NEW.rev_deleted&2 );
    104CREATE TRIGGER revision_wmcsCommentOnDelete AFTER DELETE ON revision
    105 FOR EACH ROW CALL wmcsCommentOnDelete( (SELECT revcomment_comment_id FROM revision_comment_temp WHERE revcomment_rev=OLD.rev_id), OLD.rev_deleted&2 );
    106
    107CREATE TRIGGER logging_wmcsCommentOnInsert AFTER INSERT ON logging
    108 FOR EACH ROW CALL wmcsCommentOnInsert( NEW.log_comment_id, NOT ( NEW.log_deleted&2 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    109CREATE TRIGGER logging_wmcsCommentOnUpdate AFTER UPDATE ON logging
    110 FOR EACH ROW CALL wmcsCommentOnUpdate( OLD.log_comment_id, NOT ( OLD.log_deleted&2 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ), NEW.log_comment_id, NOT ( NEW.log_deleted&2 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    111CREATE TRIGGER logging_wmcsCommentOnDelete AFTER DELETE ON logging
    112 FOR EACH ROW CALL wmcsCommentOnDelete( OLD.log_comment_id, NOT ( OLD.log_deleted&2 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ) );
    1delimiter //
    2CREATE FUNCTION wmcsActorShow(id BIGINT) RETURNS INT NOT DETERMINISTIC READS SQL DATA
    3BEGIN
    4 RETURN COALESCE(
    5 ( SELECT 1 FROM actor JOIN user ON(actor_user=user_id) WHERE actor_id = id LIMIT 1 ) OR
    6 ( SELECT 1 FROM archive WHERE ar_actor = id AND ar_deleted&4 = 0 LIMIT 1 ) OR
    7 ( SELECT 1 FROM ipblocks WHERE ipb_by_actor = id and ipb_deleted = 0 LIMIT 1 ) OR
    8 ( SELECT 1 FROM image WHERE img_actor = id LIMIT 1 ) OR
    9 ( SELECT 1 FROM oldimage WHERE oi_actor = id AND oi_deleted&4 = 0 LIMIT 1 ) OR
    10 ( SELECT 1 FROM filearchive WHERE fa_actor = id AND fa_deleted&4 = 0 LIMIT 1 ) OR
    11 ( SELECT 1 FROM recentchanges WHERE rc_actor = id AND rc_deleted&4 = 0 LIMIT 1 ) OR
    12 ( SELECT 1 FROM logging WHERE log_actor = id AND log_deleted&4 = 0 AND log_type IN ($LOG_TYPE_LIST$) LIMIT 1 ) OR
    13 ( SELECT 1 FROM revision JOIN revision_actor_temp ON(revactor_rev = rev_id) WHERE revactor_actor = id AND rev_deleted&4 = 0 LIMIT 1 )
    14 , 0);
    15END //
    16delimiter ;
    17
    18delimiter //
    19CREATE PROCEDURE wmcsActorOnInsert(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    20BEGIN
    21 IF NOT del THEN
    22 UPDATE actor SET wmcs_is_visible = 1 WHERE actor_id = id;
    23 END IF;
    24END //
    25delimiter ;
    26
    27delimiter //
    28CREATE PROCEDURE wmcsActorOnDelete(id BIGINT, del TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    29BEGIN
    30 IF NOT del AND NOT wmcsActorShow(id) THEN
    31 UPDATE actor SET wmcs_is_visible = 0 WHERE actor_id = id;
    32 END IF;
    33END //
    34delimiter ;
    35
    36delimiter //
    37CREATE PROCEDURE wmcsActorOnUpdate(oldId BIGINT, oldDel TINYINT, newId BIGINT, newDel TINYINT) NOT DETERMINISTIC MODIFIES SQL DATA
    38BEGIN
    39 IF oldId != newId OR (NOT oldDel) != (NOT newDel) THEN
    40 CALL wmcsActorOnDelete(oldId, oldDel);
    41 CALL wmcsActorOnInsert(newId, newDel);
    42 END IF;
    43END //
    44delimiter ;
    45
    46CREATE TRIGGER archive_wmcsActorOnInsert AFTER INSERT ON archive
    47 FOR EACH ROW CALL wmcsActorOnInsert( NEW.ar_actor, NEW.ar_deleted&4 );
    48CREATE TRIGGER archive_wmcsActorOnUpdate AFTER UPDATE ON archive
    49 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.ar_actor, OLD.ar_deleted&4, NEW.ar_actor, NEW.ar_deleted&4 );
    50CREATE TRIGGER archive_wmcsActorOnDelete AFTER DELETE ON archive
    51 FOR EACH ROW CALL wmcsActorOnDelete( OLD.ar_actor, OLD.ar_deleted&4 );
    52
    53CREATE TRIGGER ipblocks_wmcsActorOnInsert AFTER INSERT ON ipblocks
    54 FOR EACH ROW CALL wmcsActorOnInsert( NEW.ipb_by_actor, NEW.ipb_deleted );
    55CREATE TRIGGER ipblocks_wmcsActorOnUpdate AFTER UPDATE ON ipblocks
    56 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.ipb_by_actor, OLD.ipb_deleted, NEW.ipb_by_actor, NEW.ipb_deleted );
    57CREATE TRIGGER ipblocks_wmcsActorOnDelete AFTER DELETE ON ipblocks
    58 FOR EACH ROW CALL wmcsActorOnDelete( OLD.ipb_by_actor, OLD.ipb_deleted );
    59
    60CREATE TRIGGER image_wmcsActorOnInsert AFTER INSERT ON image
    61 FOR EACH ROW CALL wmcsActorOnInsert( NEW.img_actor, 0 );
    62CREATE TRIGGER image_wmcsActorOnUpdate AFTER UPDATE ON image
    63 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.img_actor, 0, NEW.img_actor, 0 );
    64CREATE TRIGGER image_wmcsActorOnDelete AFTER DELETE ON image
    65 FOR EACH ROW CALL wmcsActorOnDelete( OLD.img_actor, 0 );
    66
    67CREATE TRIGGER oldimage_wmcsActorOnInsert AFTER INSERT ON oldimage
    68 FOR EACH ROW CALL wmcsActorOnInsert( NEW.oi_actor, NEW.oi_deleted&4 );
    69CREATE TRIGGER oldimage_wmcsActorOnUpdate AFTER UPDATE ON oldimage
    70 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.oi_actor, OLD.oi_deleted&4, NEW.oi_actor, NEW.oi_deleted&4 );
    71CREATE TRIGGER oldimage_wmcsActorOnDelete AFTER DELETE ON oldimage
    72 FOR EACH ROW CALL wmcsActorOnDelete( OLD.oi_actor, OLD.oi_deleted&4 );
    73
    74CREATE TRIGGER filearchive_wmcsActorOnInsert AFTER INSERT ON filearchive
    75 FOR EACH ROW CALL wmcsActorOnInsert( NEW.fa_actor, NEW.fa_deleted&4 );
    76CREATE TRIGGER filearchive_wmcsActorOnUpdate AFTER UPDATE ON filearchive
    77 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.fa_actor, OLD.fa_deleted&4, NEW.fa_actor, NEW.fa_deleted&4 );
    78CREATE TRIGGER filearchive_wmcsActorOnDelete AFTER DELETE ON filearchive
    79 FOR EACH ROW CALL wmcsActorOnDelete( OLD.fa_actor, OLD.fa_deleted&4 );
    80
    81CREATE TRIGGER recentchanges_wmcsActorOnInsert AFTER INSERT ON recentchanges
    82 FOR EACH ROW CALL wmcsActorOnInsert( NEW.rc_actor, NEW.rc_deleted&4 );
    83CREATE TRIGGER recentchanges_wmcsActorOnUpdate AFTER UPDATE ON recentchanges
    84 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.rc_actor, OLD.rc_deleted&4, NEW.rc_actor, NEW.rc_deleted&4 );
    85CREATE TRIGGER recentchanges_wmcsActorOnDelete AFTER DELETE ON recentchanges
    86 FOR EACH ROW CALL wmcsActorOnDelete( OLD.rc_actor, OLD.rc_deleted&4 );
    87
    88CREATE TRIGGER logging_wmcsActorOnInsert AFTER INSERT ON logging
    89 FOR EACH ROW CALL wmcsActorOnInsert( NEW.log_actor, NOT ( NEW.log_deleted&4 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    90CREATE TRIGGER logging_wmcsActorOnUpdate AFTER UPDATE ON logging
    91 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.log_actor, NOT ( OLD.log_deleted&4 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ), NEW.log_actor, NOT ( NEW.log_deleted&4 = 0 AND NEW.log_type IN ($LOG_TYPE_LIST$) ) );
    92CREATE TRIGGER logging_wmcsActorOnDelete AFTER DELETE ON logging
    93 FOR EACH ROW CALL wmcsActorOnDelete( OLD.log_actor, NOT ( OLD.log_deleted&4 = 0 AND OLD.log_type IN ($LOG_TYPE_LIST$) ) );
    94
    95CREATE TRIGGER revision_actor_temp_wmcsActorOnInsert AFTER INSERT ON revision_actor_temp
    96 FOR EACH ROW CALL wmcsActorOnInsert( NEW.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=NEW.revactor_rev) );
    97CREATE TRIGGER revision_actor_temp_wmcsActorOnUpdate AFTER UPDATE ON revision_actor_temp
    98 FOR EACH ROW CALL wmcsActorOnUpdate( OLD.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=OLD.revactor_rev), NEW.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=NEW.revactor_rev) );
    99CREATE TRIGGER revision_actor_temp_wmcsActorOnDelete AFTER DELETE ON revision_actor_temp
    100 FOR EACH ROW CALL wmcsActorOnDelete( OLD.revactor_actor, (SELECT rev_deleted&4 FROM revision WHERE rev_id=OLD.revactor_rev) );
    101CREATE TRIGGER revision_wmcsActorOnInsert AFTER INSERT ON revision
    102 FOR EACH ROW CALL wmcsActorOnInsert( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=NEW.rev_id), NEW.rev_deleted&4 );
    103CREATE TRIGGER revision_wmcsActorOnUpdate AFTER UPDATE ON revision
    104 FOR EACH ROW CALL wmcsActorOnUpdate( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=OLD.rev_id), OLD.rev_deleted&4, (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=NEW.rev_id), NEW.rev_deleted&4 );
    105CREATE TRIGGER revision_wmcsActorOnDelete AFTER DELETE ON revision
    106 FOR EACH ROW CALL wmcsActorOnDelete( (SELECT revactor_actor FROM revision_actor_temp WHERE revactor_rev=OLD.rev_id), OLD.rev_deleted&4 );
  1. Populate wmcs_is_visible for existing rows. That might be as simple as UPDATE comment SET wmcs_is_visible = wmcsCommentShow( comment_id ) and the same for actor, or we could unroll it with a bunch of queries like UPDATE comment JOIN recentchanges ON(rc_comment_id = commend_id) SET wmcs_is_visible = 1 WHERE wmcs_is_visible = 0 AND rc_deleted&2 = 0, and/or we could run it through in batches.
  1. Update the views, instead of all the subqueries they can be just WHERE wmcs_is_visible.

Open questions:

  • Does the replication actually work like the docs say it does?

Not sure, we have seen replication breakages on sanitarium hosts (row based replication) when doing schema changes that add/drop columns on its master.
And we are running ALL_NON_LOSSY on sanitariums and labs

  • Are these triggers fast enough?

We haven't seen contention with the current triggers, but they are far simplier and smaller. I am specially worried for s3, and the massive amount of wikis it has.
Also, not sure if they should go on sanitarium or on the replicas, I don't want to mess up with more triggers on the sanitarium hosts as they are, as of today, a key piece on the sanitization process.

  • Are these triggers straightforward enough to be maintained?

They are complex for sure. If in the end we want to go for this solution, I think those should be maintained by WMCS. We do maintain the sanitization ones that live on sanitariums.

Further, I am not sure if they should be placed on the sanitarium hosts or on the replicas themselves (again - the question about the replication breakage we've seen before is still to be answered). I could try to grab one of the new hosts to do some testing (T211613)

  • The triggers seem to work for me in some quick local testing (MariaDB 10.3.12-MariaDB-2). Do they work at scale with all the corner cases MediaWiki might throw at them?

We are using 10.1, it would need testing - I can set up a host from (T211613) and test them there.

Does the replication actually work like the docs say it does?

Yes it does, it works in STATEMENT of course, as as long as the INSERT/UPDATE/DELETE can work on both hosts, it will go through. Those hosts are using ROW, which means in theory it should work because ROW works as long as it is a final column (I tested it on db1111->db1112). Normally triggers would not run on ROW, but we have https://mariadb.com/kb/en/library/running-triggers-on-the-slave-for-row-based-events/

However, this can lead to issues- tables on saniutarium/labs replicas are either the exact same tables, or do not exist and are filtered out. However, existing but having a different structure could be dangerous (one does not only insert, updates or deletes could break replication -for maintenance, even if normally they would not run on production, alters would be more dangerous and all sort of problems. I would be more confident if we had the same structure on production, even if it had different values.

There is also a performance issue, you want to have a view with WHERE wmcs_is_visible, but that would assume people want to just select the whole table, when only a single row may be needed, leading to potentially problematic query plans.

There is a lot of potential issues with that proposal, even if the basic premise works.

@Anomie: would it be much worse to implement this in mediawiki directly? Maybe a redacted column that starts out false and becomes true if rev_delete, log_delete, etc. are set.

I didn't know that, thanks for pointing it out.

I would be more confident if we had the same structure on production, even if it had different values.

The issue there is that production has no need of this, since all accesses to the tables are mediated by code that is already doing the joins and checks.

We could put these fields into tables.sql, but for every wiki anywhere except these replicas the field would be useless and ignored. Or you could add it just for Wikimedia production, so only we would have the useless field, and deal with the schema difference from tables.sql.

There is also a performance issue, you want to have a view with WHERE wmcs_is_visible, but that would assume people want to just select the whole table, when only a single row may be needed, leading to potentially problematic query plans.

We already have WHERE clauses in several of the views:

  • abuse_filter_log has afl_deleted=0
  • archive_userindex has (ar_deleted&4)=0
  • content has exists( select 1 from slots LEFT JOIN archive ON(ar_rev_id = slot_revision_id) LEFT JOIN revision ON(rev_id = slot_revision_id) WHERE content_id = slot_content_id AND (ar_deleted&1=0 OR rev_deleted&1=0))
  • filearchive_userindex has (fa_deleted&4)=0
  • global_preferences has gp_property in ( 'disablemail', 'fancysig', 'gender', 'nickname' )
  • globaluser has gu_hidden=''
  • ip_changes has a join with revision and ipc_rev_id = rev_id AND (rev_deleted & 4) = 0
  • ipblocks, ipblocks_compat, and ipblocks_userindex all have ipb_deleted=0
  • logging, logging_compat, and logging_logindex have log_type IN (... 50 strings...)
  • logging_userindex has (log_deleted&4)=0 and log_type IN (... 50 strings...)
  • oldimage_userindex has (oi_deleted&4)=0
  • recentchanges_userindex has (rc_deleted&4)=0
  • revision_userindex has (rev_deleted&4)=0
  • revision_actor_temp has a join with revision and revactor_rev = rev_id AND (rev_deleted & 4) = 0
  • user_properties has up_property in ( 'disablemail', 'fancysig', 'gender', 'nickname' )
  • user_properties_anon has a join with user and user_id=up_user and up_property like pw_property

The idea here is to bring actor and comment more in line with the above, versus their current 9 ORed subqueries.

@Anomie: would it be much worse to implement this in mediawiki directly? Maybe a redacted column that starts out false and becomes true if rev_delete, log_delete, etc. are set.

We don't seem to use triggers in MediaWiki itself. Updating it from PHP would be possible, but potential issues include:

  • Keeping MediaWiki's idea of what is "redacted" in sync with ToolForge's idea.
  • Finding all the code that does relevant database changes to properly update the flag. Including extensions and maintenance scripts, and any other corner cases like expired rows being deleted.
  • The fact that MediaWiki itself has no use case for this. So any bugs would only be noticed in the WMCS replicas.
    • Further, MediaWiki would have no use other than this for the extra indexes needed on various xx_comment_id columns.

We already have WHERE clauses in several of the views:

Yes, and they already have bad performance 0:-). I wasn't listing that as a reason we absolutely cannot do that, just points that are discouraging (I belive users like @Milimetric were affected by that issue. Different structure is my top concern. As a DBA, I give advice but offload the final decision to mediawiki core team and/or cloud. I "reserve" the veto powers only if it breaks replication.

I do veto the use of triggers on production, however, ATM, for many issues - from data consistency, to compatibility, to restricting the usage of online schema change tools.

I agree with anomie that labsdb needs should not influence production code decisions.

Yes, and they already have bad performance 0:-). I wasn't listing that as a reason we absolutely cannot do that, just points that are discouraging (I belive users like @Milimetric were affected by that issue. Different structure is my top concern. As a DBA, I give advice but offload the final decision to mediawiki core team and/or cloud. I "reserve" the veto powers only if it breaks replication.

On where clauses, from the perspective of our users, with an appropriate index (typically), adding a where clause is thought of as an improvement to performance over the ifs, cases and joins in the views by some users (thus the _userindex views' reason for being). The performance of the subquery-heavy tables is bad enough that just about any method is better.

The bare views are likely the only ones that achieve anything like prod performance, but we need better than we have on the comment and actor ones.

I will also add that replication lag is something I think people can accept if it is consistent lag. If the replicas are predictably a certain number of secs/min behind current, for instance, my suspicion at the moment is that it would serve most use cases. I'm sure any significant lag (in seconds) is concerning to some users, but I'm trying to think of making things better for the largest number. That's ignoring possible churn created beyond what the replicas can handle, of course.

I should mention that another method we could use here is based on that T208690. It sounds like a lot of work, but it may ultimately be worth trying.

I should mention that another method we could use here is based on that T208690. It sounds like a lot of work, but it may ultimately be worth trying.

As written, I don't think revision_commentindex as requested in T208690: Create revision_commentindex on Wiki Replicas in Cloud VPS would actually help this task. We'd need revision_compat_commentindex, or maybe "comment_revisionindex" that only had the revision-table subquery instead of all 9, to make a difference here.

Yup, that's what I mean. I'd have to have a separate view for each one of those cases. It sounds horrible and might create other problems.

As comment view is inherently with bad performance, I don't think having separate view for each of the cases is that horrible.

Pretty soon I'll be sending an email to wikitech-l and the cloud list telling people that, if they're not already using the compat views, they'll want to look into changing their code to start using actor columns rather than user id/name columns (or to start using the compat views). I'll be pointing to this task as the place for planning to make actor not be so slow.

Is there anything people here need from me towards firming up that plan before I send that email?

Sorry I didn't answer before, was pinged through SOS, I personally do not have further comment or huge issues with the proposed plan (I mentioned before disadvantages which you are aware, but there is certainly no perfect solution, so it is the best we can do at the moment).

As a side note, I got some questions about the new structure (MCR + actor + comment) on labs recently, I may prepare some slides/presentation for increased awareness of the mw upcoming underlying changes and would be glad if you could review them (cannot promise however, when).

I'd be happy to review the slides. Send me an email with the information whenever you're ready.

As for this task, the main open question I see is which of these would we do?

  1. Add the "is_visible" fields to actor and comment in tables.sql, alter Wikimedia production to match, and add triggers somewhere in the WMCS replica pipeline to populate the fields.
    1. Pro: No schema difference between tables.sql and production.
    2. Pro: Little added risk of replication failure at the production/WMCS boundary.
    3. Con: Unused columns present (and taking disk space) in production and on all third-party wikis, potentially confusing or annoying developers with its unused-ness.
  2. Add the "is_visible" fields to actor and comment in Wikimedia production, and add triggers somewhere in the WMCS replica pipeline to populate them.
    1. Pro: No unused columns on third-party wikis, and no developer confusion/annoyance following from that.
    2. Pro: Little added risk of replication failure at the production/WMCS boundary.
    3. Con: This would be a schema difference between tables.sql and Wikimedia production, getting in the way of T104459 and T57455.
    4. Con: Unused columns present (and taking disk space) in production (but likely not much developer confusion, and DBAs would probably the only ones who might be annoyed).
  3. Add the "is_visible" fields to actor and comment in the WMCS replica pipeline (probably at the same place as the triggers, but not necessarily).
    1. Pro: No unused columns in production or on third-party wikis, and no developer confusion/annoyance following from that.
    2. Pro: No schema difference between tables.sql and production.
    3. Con: Increased risk of replication failure at the production/WMCS boundary.

Which is the least bad? I don't know enough to choose between 2 and 3, or just how bad they might be compared to 1. If you think 1 is the least bad, we should probably run it by TechCom too.

If we settle on option 1, I can write the patch to make it happen and you (DBAs and WMCS) can take things from there once it's merged. If we settle on 2 or 3, it'd be up to you (DBAs and WMCS) to decide when and how to move forward on it.

I would prefer option 1) simply because we would still be consistent between tables.sql, production and wikireplicas.

I would prefer option 3, I do not like triggers, but if we are going to have them, let's have them on the only place there are already there (sanitarium). Also, it is the safest because if we forget to deploy the changes, the view creation will fail and we will notice.

I would prefer option 3, I do not like triggers, but if we are going to have them, let's have them on the only place there are already there (sanitarium). Also, it is the safest because if we forget to deploy the changes, the view creation will fail and we will notice.

I would understand that option 1 would include the triggers on the sanitarium hosts, no? Maybe I am not getting it right?

would include the triggers ?

Yes, but if the views reference the new columns and the changes are not deployed in 3, they will fail. In the others, they will silently allow the wrong permissions to the hosts because columns are created automatically (even without the triggers). I prefer the fail options for privacy reasons.

I would prefer no triggers or extra columns, but if we are going to have them, I would like to break only cloud, and break them fully. After all, schema changes tend to be run with replication from sanitariums.

This is assuming it will all work.

Basically, I don't like any of the 3, but that one is the one that doesn't touch production, plus we already have the infrastructure in place to maintain it:

  • Adding a columns and triggers to:

puppet: modules/role/files/mariadb/redact_sanitarium.sh

  • Adding extra checks to:

puppet: modules/role/files/mariadb/check_private_data.py

I have caught up with Jaime on IRC to try to get all the context and to make sure we were on the same page. I would like to give everyone more context.
This goes back on how we create new wikis.
The usual procedure is (https://wikitech.wikimedia.org/wiki/Add_a_wiki):

The concern here is that the above process is manual and requires human intervention, and thus, it can fail.
The failure could be that someone misses a step on https://wikitech.wikimedia.org/wiki/Add_a_wiki:

  1. someone creates a wiki without telling the DBAs
  2. Cloud team gets a ticket to place the views on the new wiki, and as the sanitization hasn't been done, the new columns (and the whole wiki) would be exposed.

Just to be clear, this is a _current_ problem, so it won't be introduced by option #1 specifically.

Option #3 minimizes the risk of exposing the two new columns as those would need to be created specifically on the sanitarium hosts (most likely), however, neither option #3 (nor option #1) fixes the current issue with exposing new wikis (not only for these tables) once they get created.

cloud-services-team are the views created only manually or you have cronjobs or procedures that run the drop+creation somewhere?
In option 1) this would mean that there is a potential race condition on where new columns are exposed before they are sanitized

The concern here is that the above process is manual and requires human intervention, and thus, it can fail.
The failure could be that someone misses a step on https://wikitech.wikimedia.org/wiki/Add_a_wiki:

  1. someone creates a wiki without telling the DBAs
  2. Cloud team gets a ticket to place the views on the new wiki, and as the sanitization hasn't been done, the new columns (and the whole wiki) would be exposed.

Is it naive of me to think this could be solved separately? How about a list of wikis that's maintained by the DBAs and used by the Cloud team to create views? It would involve one more manual step but it seems like it would calm a lot of fears of race conditions and free up the decision here from unnecessary entanglements.

cloud-services-team are the views created only manually or you have cronjobs or procedures that run the drop+creation somewhere?
In option 1) this would mean that there is a potential race condition on where new columns are exposed before they are sanitized

Apologies for missing this. I got tied up in other things. We only run view creation (and custom index creation) and/or changes manually. I also like the potential added benefit of #3 that it would make the view creation run fail on something that shouldn't be created yet.

There is currently an edge case where you can run against "all-databases" in the maintain-views script after a new database is created/replicating but isn't handed-off yet (if it shows up in the dblists). Normally, nobody runs against all-databases except to do a major change to the views as a whole--never to add a wiki, which limits that, but the case exists. However, it would also be missing the DNS relied on for access and missing the meta_p record used to notify tools that a new database is present.

A MariaDB bug currently causes it to fail without a specific grant and/or manual db creation being added, so the script currently dies anyway instead of creating the views if this was done by mistake. So there's not much likelihood of that all happening right now, but an additional barrier to such a mistake outside of the usual manual care and software bugs sounds like a great side effect.

I'll be pointing to this task as the place for planning to make actor not be so slow.

Thank you. Quick question: are we currently expected to be using the actor table for anything at the moment? Is it expected to complete a query like select * from actor where actor_name = 'Nemo bis'; in less than a minute on enwiki? If so, should specific performance reports or requests (such as adding an index to this or that column) be made here?

I'll be pointing to this task as the place for planning to make actor not be so slow.

Thank you. Quick question: are we currently expected to be using the actor table for anything at the moment? Is it expected to complete a query like select * from actor where actor_name = 'Nemo bis'; in less than a minute on enwiki? If so, should specific performance reports or requests (such as adding an index to this or that column) be made here?

The problem with queries against the actor 'table' on the Wiki Replicas is that the view that is exposed to users must join against 8 tables to ensure that no rows which are suppressed are shown. There is more information on this and alternate views which have been created to only join on one table at a time (which makes using them much faster) is documented at https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#The_actor_table_seems_really_slow--so_does_comment although we really should find a better place to put that documentation on wikitech.

Thank you. I understand that issue but I still don't understand what's the recommended way to perform simple queries over what used to be the names and comments.

Right now what I'm doing for enwiki is: I find the usernames I'm interested in; I check their contributions on the web, take the rev_id of the last; query the revision table for the rev_actor associated to that rev_id; hardcode the actor ID in subsequent queries. Same for the comments: I first make a list of relevant comments, or at least identify a range of comment_id to narrow down the query, and then hardcode that in my queries against the revision tables. Joining any variant of the revision table with any variant of the actor or comment table takes hours for any kind of query I've tried so far (not so surprisingly perhaps, given the comment table alone is over 300M rows on the enwiki replica).

I understand that the queries can be made a bit faster if I use some variants of the actor/comment tables where I'm sure the comment or username I'm looking for is present, for instance:

MariaDB [enwiki_p]> select * from actor_recentchanges where actor_name = 'Nemo bis';
+----------+------------+------------+
| actor_id | actor_user | actor_name |
+----------+------------+------------+
|     5310 |    2584239 | Nemo bis   |
+----------+------------+------------+
1 row in set (0.06 sec)

(This query used to take more than a minute when I wrote, if performed against the actor table. Now it seems to take less than 0.1 seconds for any username I try. Go figure.)

However I've not yet managed to find suitable ways to join the tables and make some query against revisions and usernames/comments. I'll try again with more combinations in the future.

I've not yet managed to find suitable ways to join the tables and make some query against revisions and usernames/comments.

Could you give more details of what you are trying, with detailed filter conditions? It doesn't need to be correct sql, just a detailed explanation of a query with example results. Or an SQL query that is slow for you and maybe we can suggest alternatives or add additional features.

One comment I can suggest is that the main use of wikireplicas is being able to do queries that are large/strange/cannot be done on production. Querying single usernames, with its revisions and comments is something that is likely to be done through the API in a much faster way, due to not having to implement crude public filtering on db layer, plus the caching and extra resources available there. The other advantage of the API is that it is stable, unlike the db internal format which is always in constant optimization (specially to optimize for size) and increasing complexity. E.g.: Get latest 500 rc changes for user Nemo_bis, with comments.

https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=recentchanges&rcuser=Nemo%20bis&rcprop=title%7Cids%7Csizes%7Cflags%7Cuser%7Ccomment&rclimit=500

(this query is the same with the old and the same db structure)

But probably I am not understanding well your needs, sorry. You can in fact mix api queries and SQL ones relatively easy, too. Please note I am not telling you not to use the wikireplicas, just that catering performance to every possible usage is difficult because of the above, and because sometimes we don't know what those needs are for thousands of users. MusikAnimal had an usage problem, filed a task about it and we were mostly able to do something about it (I think), please do the same! :-D

Bstorm lowered the priority of this task from High to Medium.Feb 25 2020, 5:19 PM

Dropping priority because we have a workaround (the subviews)

I've not yet managed to find suitable ways to join the tables and make some query against revisions and usernames/comments.

Could you give more details of what you are trying, with detailed filter conditions?

Sure, thank you for the answer. I thought I had provided an example but maybe it was on another task. An example is https://quarry.wmflabs.org/query/42113 which I consider the most basic kind of query. The only way I managed to get a result in less than 30 min was:

SELECT rev_id
FROM revision_userindex
WHERE rev_actor = 177
AND rev_id > 850000000
AND rev_comment_id IN ( SELECT comment_id
                         FROM comment_revision
                         WHERE comment_id > 284004931
                         AND comment_text LIKE "Bluelink%" );

This is supposed to be a JOIN on three tables (revision, actor/actor_revision, comment/revision_comment), but I had instead to

  1. hardcode rev_actor because any join on the actor table was prohibitively slow;
  2. manually find the actor ID from the revision table (by searching for a specific rev_id), because even selecting a single row from the actor table with a WHERE condition proved unbearably slow (on some other day or some other wiki it appeared to work; might have been a temporary hiccup);
  3. manually restrict the range of rev_id, which appears to reduce query time (presumably by reducing the number of rows scanned);
  4. using an IN instead of a join because the JOIN did not manage to complete in time, while the subquery + WHERE sometimes does;
  5. hardcode a range of comment_id that I found manually;
  6. and of course, use the sub-tables comment_revision and friends (I don't know if revision_userindex was necessary or helpful here).

Right now that fails too, presumably because we're now at revision 940M and counting and the query cannot stand going through almost 100M rows. I do manage on the command line with revision_compat:

MariaDB [enwiki_p]> SELECT count(rev_id) FROM revision_compat WHERE rev_actor = 177 AND rev_id > 900000000 AND rev_comment LIKE "Bluelink%";

+---------------+
| count(rev_id) |
+---------------+
|        284123 |
+---------------+
1 row in set (2 hours 42 min 21.21 sec)

(I'm now testing if I manage to complete the same query with rev_id > 850M which is my actual goal.)

If revision_compat is currently the only way to successfully perform a query against the revision table with a condition on the comment text, speeding up the queries against the comment table (with an index otherwise) is going to be a blocker for the removal of the revision_compat table.

T346293: ipblocks schema redesign for multiblocks propose to add a is_deleted field to actor (and one user may have multiple rows for different hidden status), and something similar may also happen in comment (though there is currently not a task for this).