Page MenuHomePhabricator

Investigate table schema / DB abstraction layer for section/comment watching/unwatching
Open, Needs TriagePublic

Description

Notes

  • Create table with user/title/sectionname
  • TBD: Expiration date of entries
  • Build an actual table schema based on this and the watchlist table (or Echo?)
  • Create interfaces (services, records?)
    • Watch this section [1]
    • Unwatch this section [1]
    • Check if I'm watching this section [1]
    • List all users watching this section
  1. Should also have a public-facing API at some point

Could create a special page to list all watched threads. Maybe not required for v1?

Event Timeline

Esanders created this task.Oct 7 2020, 2:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2020, 2:00 PM

Having never had to really interact with table structure in mediawiki before, here's what I've worked out:

Mediawiki extension SQL tables

  1. Create a SQL patch file containing all your CREATE TABLE / INDEX statements
  2. In extension.json hook LoadExtensionSchemaUpdates( DatabaseUpdater $updater )
  3. That hook should call $updater->addExtensionTable( 'your_table_name', $pathToYourSQLFile );
  4. Make sure that update.php is run, which should create the table from that SQL
    • I think this is part of the regular train deploy, but might need to double-check
    • later tagging in of DBA to review the schema once we're ready would confirm that

If you're doing something complicated, $updater->getDB()->getType() can be used to provide database-specific SQL patch files. Core code is supposed to support MySQL, Postgres, and SQLite, but it looks like extensions might get some leeway. The production servers are running MySQL / MariaDB.

In the future, for any changes, be sure to follow the production schema update guide for added scrutiny. There are a lot of functions in DatabaseUpdater which can be used to describe a change you're going to make to your schema, which will run an appropriate SQL patch file if that change looks like it's not yet made.

DiscussionTools specific notification schema

This is a loosely sketched schema:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,
	-- Reference to the specific section being subscribed to
	sub_page_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp binary(14) not null,
	PRIMARY KEY (sub_page_id, sub_page_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_page_id, sub_page_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page_id ON /*_*/discussiontools_subscription (sub_page_id);

Thoughts:

  • I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.
  • Schema has a creation date, but should probably have an index on that if we want to list the users in that order. It feels a bit awkward to wind up with a table with an index that encompasses every column, though.
  • Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.
  • If expiring, do we want to clean up the table of those entries, or leave them around?

Strongly related is the Echo schema.

I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.

For comment IDs we should be able to calculate a limit of 2 * ( max username length + timestamp length + spacers )

For heading IDs, there might be a limit somewhere on the length of the heading ID hash attribute (possibly because it would generate too-long URLs otherwise)?

Schema has a creation date, but should probably have an index on that if we want to list the users in that order.

I don't know what the best practice is here, but if we don't have a use case, maybe we should not add the index yet.

Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.

I think for version 1 we can ignore expiration.

If expiring, do we want to clean up the table of those entries, or leave them around?

We should see what is happening with watchlist expiration

	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,

I think this would be better as a page_namespace + page_title pair. page_id can change when a page is deleted and restored. Also, if the page is archived by moving it (https://en.wikipedia.org/wiki/Help:Archiving_a_talk_page/Other_procedures), the page_id will now refer to the archive, which is probably undesirable.

	-- Key to user_id
	sub_user int unsigned not null,

Might be better to make this a key to actor_id, in case we ever implement notifications for logged-out users?

@matmarex So, my logic on both of those fields was to look at the echo_event table and just copy how they were doing it. However, I think you're right that our case differs. Instead we should be copying watchlist:

CREATE TABLE /*_*/watchlist (
  wl_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- Key to user.user_id
  wl_user int unsigned NOT NULL,

  -- Key to page_namespace/page_title
  -- Note that users may watch pages which do not exist yet,
  -- or existed in the past but have been deleted.
  wl_namespace int NOT NULL default 0,
  wl_title varchar(255) binary NOT NULL default '',

  -- Timestamp used to send notification e-mails and show "updated since last visit" markers on
  -- history and recent changes / watchlist. Set to NULL when the user visits the latest revision
  -- of the page, which means that they should be sent an e-mail on the next change.
  wl_notificationtimestamp varbinary(14)

) /*$wgDBTableOptions*/;

@DLynch This looks good. Let's start on the "Create interfaces" part of this with the schema you and Bartosz described. We can tweak it as we go if necessary.

DLynch added a comment.EditedOct 30 2020, 3:32 PM

The combined patch file:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Key to page_namespace/page_title
	-- Note that users may watch pages which do not exist yet,
	-- or existed in the past but have been deleted.
	sub_namespace int NOT NULL default 0,
	sub_title varchar(255) binary NOT NULL default '',	-- Reference to the specific section being subscribed to
	sub_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp varbinary(14) not null,
	PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_namespace, sub_title, sub_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page ON /*_*/discussiontools_subscription (sub_namespace, sub_title);

I sort of feel that PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user) might be a bit much. That said, we'll definitely need that as an index at least, so it's maybe reasonable. Can let DBA comment on whether an auto-increment sub_id would be better later. 🤷🏻‍♂️

I think an auto-incrementer would be better, and is what watchlist is using. It has benefits for paging and possibly sharding.