Page MenuHomePhabricator

DBA review: conversation subscriptions
Open, MediumPublic

Description

This task is about the work involved with DBA reviewing/approving the new table(s) we'll have implemented for storing conversation subscriptions in T264885.

Background

For additional context, see T260372's ===Background and ===Details sections.

Table structure

See comment: T263817#7030301

Done

  • Tag DBA once the ===Table structure above has been defined, per the guidance @Marostegui shared in T260372#6482475
  • The usual DBA questions have been answered
  • DBA communicates what – if any – changes need to be made to the new table(s) prior to them being deployed to production.
  • Table is created

Event Timeline

Work on this has been happening in T264885.

This comment was removed by ppelberg.

Work on this has been happening in T264885.

Noted. As discussed in standup today, we're going to repurpose this ticket for the DBA review.

ppelberg renamed this task from Draft thread subscription table structure to DBA review: conversation subscriptions.Feb 9 2021, 6:31 PM
ppelberg updated the task description. (Show Details)

Info for DBA review:

Note that the code is already merged after we reviewed it internally, but the functionality is disabled using $wgDiscussionTools_topicsubscription = 'unavailable'.


Responses to questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables#Preparation:

  • Should this table be replicated to wiki replicas (does it not contain private data)?
    • No – it contains private-ish data, like the watchlist.
  • Size of the table (number of rows expected).
    • Potentially unbounded, depending on whether people use the new feature; probably smaller than watchlist (~242M rows on enwiki).
    • We could limit the size if it is needed (e.g. maximum number of entries per user, or automatically expiring old entries),
  • Expected growth per year (number of rows).
    • We don't know yet. Hypothetically, if every one of ~30,000 users who edited talk pages last month subscribed to 3 new threads per day, we'd end up somewhere around 32M rows per year (on enwiki). This is probably a gross overestimate though.
    • We should be able to get a better idea once topic subscriptions are deployed on smaller wikis or as a beta feature.
  • Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
    • Writes: Occur when a logged-in user subscribes or unsubscribes. To stick to the previous estimate, ~32M/year is ~24/minute (on enwiki).
    • Reads: Occur when a logged-in user views a discussion page, or when any user saves a comment on a discussion page. I don't know what the numbers for this would be (I couldn't find any page view statistics that would be separated by namespace and also limited to logged-in users only), but hopefully this gives you some idea.
  • Examples of queries that will be using the table.
    • Fetching a user's subscriptions on a given page [after T281000]
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForUser
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_user = 1 AND sub_item IN ('h-Matma_Rex-2020-01-22T23:19:00.000Z','h-Matma_Rex-2020-01-25T08:31:00.000Z','h-Matma_Rex-2021-03-24T16:48:00.000Z')
    • Adding a subscription
      • SubscriptionStore::addSubscriptionForUser
      • INSERT INTO `discussiontools_subscription` (sub_user,sub_namespace,sub_title,sub_section,sub_item,sub_state,sub_created) VALUES (1,1,'Scratch','a 2','h-Matma_Rex-2021-03-06T18:17:00.000Z',1,'20210423190208') ON DUPLICATE KEY UPDATE sub_state = 1
    • Removing a subscription
      • SubscriptionStore::removeSubscriptionForUser
      • UPDATE `discussiontools_subscription` SET sub_state = 0 WHERE sub_user = 1 AND sub_item = 'h-Matma_Rex-2021-03-24T16:40:00.000Z'
    • Fetching all users subscribed to a given topic (before sending them notifications)
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForTopic
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_item = 'h-Matma_Rex-2020-11-04T22:02:00.000Z' AND sub_state = 1
    • Updating the "last touched" timestamp of a subscription (after sending a notification)
      • SubscriptionStore::updateSubscriptionTimestamp
      • UPDATE `discussiontools_subscription` SET sub_notified = '20210423190239' WHERE sub_item = 'h-Matma_Rex-2020-11-04T22:02:00.000Z'
  • The release plan for the feature (are there specific wikis you'd like to test first etc).
    • First to be released as a beta feature on select wikis (see T274188, we haven't determined which ones yet, but probably some moderately sizes ones – we previously worked with arwiki, cswiki, huwiki).
    • Assuming everything goes well, then to be released as a beta feature everywhere, then enabled by default on select wikis, and then everywhere
matmarex updated the task description. (Show Details)
matmarex added a subscriber: matmarex.
LSobanski triaged this task as Medium priority.Mon, Apr 26, 9:13 AM
LSobanski moved this task from Triage to Pending comment on the DBA board.

Info for DBA review:

Note that the code is already merged after we reviewed it internally, but the functionality is disabled using $wgDiscussionTools_topicsubscription = 'unavailable'.

Let's make the VARCHAR VARBINARY instead.
Can you also move the PK right under the last column on the .sql patch? Normally the PK is the first item after the column definition. It is easier to read it that way.


Responses to questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables#Preparation:

  • Should this table be replicated to wiki replicas (does it not contain private data)?
    • No – it contains private-ish data, like the watchlist.

Thanks, could you send a patch to https://github.com/wikimedia/puppet/blob/production/manifests/realm.pp#L174 in puppet with the table added there? Once done, add me as a reviewer so I can merge it.

  • Size of the table (number of rows expected).
    • Potentially unbounded, depending on whether people use the new feature; probably smaller than watchlist (~242M rows on enwiki).
    • We could limit the size if it is needed (e.g. maximum number of entries per user, or automatically expiring old entries),

We should limit it, watchlist is almost 50GB (compressed) on enwiki so it is reaching a considerable space.
If I were to choose, I would prefer to limit the number of entries rather than expiring (to avoid fragmentation as much as we can).

  • Expected growth per year (number of rows).
    • We don't know yet. Hypothetically, if every one of ~30,000 users who edited talk pages last month subscribed to 3 new threads per day, we'd end up somewhere around 32M rows per year (on enwiki). This is probably a gross overestimate though.
    • We should be able to get a better idea once topic subscriptions are deployed on smaller wikis or as a beta feature.

Thanks - let's wait for this. Although if we are going to limit it, it should be ok.

  • Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
    • Writes: Occur when a logged-in user subscribes or unsubscribes. To stick to the previous estimate, ~32M/year is ~24/minute (on enwiki).

That should be fine.

  • Reads: Occur when a logged-in user views a discussion page, or when any user saves a comment on a discussion page. I don't know what the numbers for this would be (I couldn't find any page view statistics that would be separated by namespace and also limited to logged-in users only), but hopefully this gives you some idea.

I don't think we have such metric.

  • Examples of queries that will be using the table.
    • Fetching a user's subscriptions on a given page [after T281000]
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForUser
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_user = 1 AND sub_item IN ('h-Matma_Rex-2020-01-22T23:19:00.000Z','h-Matma_Rex-2020-01-25T08:31:00.000Z','h-Matma_Rex-2021-03-24T16:48:00.000Z')
    • Adding a subscription
      • SubscriptionStore::addSubscriptionForUser
      • INSERT INTO `discussiontools_subscription` (sub_user,sub_namespace,sub_title,sub_section,sub_item,sub_state,sub_created) VALUES (1,1,'Scratch','a 2','h-Matma_Rex-2021-03-06T18:17:00.000Z',1,'20210423190208') ON DUPLICATE KEY UPDATE sub_state = 1
    • Removing a subscription
      • SubscriptionStore::removeSubscriptionForUser
      • UPDATE `discussiontools_subscription` SET sub_state = 0 WHERE sub_user = 1 AND sub_item = 'h-Matma_Rex-2021-03-24T16:40:00.000Z'
    • Fetching all users subscribed to a given topic (before sending them notifications)
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForTopic
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_item = 'h-Matma_Rex-2020-11-04T22:02:00.000Z' AND sub_state = 1
    • Updating the "last touched" timestamp of a subscription (after sending a notification)
      • SubscriptionStore::updateSubscriptionTimestamp
      • UPDATE `discussiontools_subscription` SET sub_notified = '20210423190239' WHERE sub_item = 'h-Matma_Rex-2020-11-04T22:02:00.000Z'

The writes should be fine, once we've got some data on the table, let's analyze the SELECT. We've seen issues with WHERE + IN, on big tables.

  • The release plan for the feature (are there specific wikis you'd like to test first etc).
    • First to be released as a beta feature on select wikis (see T274188, we haven't determined which ones yet, but probably some moderately sizes ones – we previously worked with arwiki, cswiki, huwiki).
    • Assuming everything goes well, then to be released as a beta feature everywhere, then enabled by default on select wikis, and then everywhere

I would like to see it released to dewiki after those three wikis.

Change 683067 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Change DB column types from "string" to "binary" (VARCHAR to VARBINARY in MySQL)

https://gerrit.wikimedia.org/r/683067

Change 683070 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/puppet@production] realm.pp: Add discussiontools_subscription to private tables

https://gerrit.wikimedia.org/r/683070

Change 683071 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Limit number of topic subscriptions per user

https://gerrit.wikimedia.org/r/683071

Thank you for taking a look! Replies below:

Let's make the VARCHAR VARBINARY instead.

Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/683067

Can you also move the PK right under the last column on the .sql patch? Normally the PK is the first item after the column definition. It is easier to read it that way.

The .sql files are generated from the .json file, using the generateSchemaSql.php maintenance script, and as far as I can tell, there is no way to change the order. I tried reordering things in the JSON file and it made no difference.

Thanks, could you send a patch to https://github.com/wikimedia/puppet/blob/production/manifests/realm.pp#L174 in puppet with the table added there? Once done, add me as a reviewer so I can merge it.

Done: https://gerrit.wikimedia.org/r/c/operations/puppet/+/683070

We should limit it, watchlist is almost 50GB (compressed) on enwiki so it is reaching a considerable space.
If I were to choose, I would prefer to limit the number of entries rather than expiring (to avoid fragmentation as much as we can).

Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/683071

We've talked about this in standup yesterday and engineering meeting today. To summarize:

  • We'd prefer not to have a limit, but it seems reasonable to add one to prevent abuse. The patch proposes 5000 entries per user, we can adjust this up or down.
  • We definitely don't like automatic expiration.
  • We're also adding logging when a user is approaching the limit, to potentially make a decision about removing it or changing the mechanism if we're reasonably sure it's safe to do, or potentially prioritize a better interface for managing your subscriptions (T273342).
  • Examples of queries that will be using the table.
    • Fetching a user's subscriptions on a given page [after T281000]
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForUser
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_user = 1 AND sub_item IN ('h-Matma_Rex-2020-01-22T23:19:00.000Z','h-Matma_Rex-2020-01-25T08:31:00.000Z','h-Matma_Rex-2021-03-24T16:48:00.000Z')
    • (…)

The writes should be fine, once we've got some data on the table, let's analyze the SELECT. We've seen issues with WHERE + IN, on big tables.

I'd like to hear more about this. Would it be better to do multiple queries with sub_item = '...'? (We're doing that right now, but I noticed this when preparing that comment, and thought that a single query would be better.) Or should we batch them when the list of items is too large? (Or just give up in that case?)

The .sql files are generated from the .json file, using the generateSchemaSql.php maintenance script, and as far as I can tell, there is no way to change the order. I tried reordering things in the JSON file and it made no difference.

I did some brief code-reading, and I think this winds up being at the mercy of Doctrine's Platform classes. Specifically, MySqlPlatform's _getCreateTableSQL. It's not in a particularly overridable spot for MWMySqlPlatform to meddle with, so I think you're stuck with the order as-generated.

Thank you for taking a look! Replies below:

Let's make the VARCHAR VARBINARY instead.

Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/683067

Thanks

Can you also move the PK right under the last column on the .sql patch? Normally the PK is the first item after the column definition. It is easier to read it that way.

The .sql files are generated from the .json file, using the generateSchemaSql.php maintenance script, and as far as I can tell, there is no way to change the order. I tried reordering things in the JSON file and it made no difference.

Ah ok, not a problem

Thanks, could you send a patch to https://github.com/wikimedia/puppet/blob/production/manifests/realm.pp#L174 in puppet with the table added there? Once done, add me as a reviewer so I can merge it.

Done: https://gerrit.wikimedia.org/r/c/operations/puppet/+/683070

Thanks, once we are fully happy about this, I will merge it.
Please keep in mind that after the merge, I will need to restart mysql on the sanitarium hosts (those doing the table filtering) so please do not create the table until I have done so. I will make that clear on this task once that is done and the table can be created.

We should limit it, watchlist is almost 50GB (compressed) on enwiki so it is reaching a considerable space.
If I were to choose, I would prefer to limit the number of entries rather than expiring (to avoid fragmentation as much as we can).

Done: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/683071

We've talked about this in standup yesterday and engineering meeting today. To summarize:

  • We'd prefer not to have a limit, but it seems reasonable to add one to prevent abuse. The patch proposes 5000 entries per user, we can adjust this up or down.

Thanks, I did ask about the limit as your previous comment mentioned it was a possibility. Given the issues we've had with watchlist table, I do prefer to have a hard limit on this.

  • We definitely don't like automatic expiration.

+1.

  • We're also adding logging when a user is approaching the limit, to potentially make a decision about removing it or changing the mechanism if we're reasonably sure it's safe to do, or potentially prioritize a better interface for managing your subscriptions (T273342).

Sounds good.

  • Examples of queries that will be using the table.
    • Fetching a user's subscriptions on a given page [after T281000]
      • SubscriptionStore::fetchSubscriptions / getSubscriptionItemsForUser
      • SELECT sub_user,sub_item,sub_namespace,sub_title,sub_section,sub_state,sub_created,sub_notified FROM `discussiontools_subscription` WHERE sub_user = 1 AND sub_item IN ('h-Matma_Rex-2020-01-22T23:19:00.000Z','h-Matma_Rex-2020-01-25T08:31:00.000Z','h-Matma_Rex-2021-03-24T16:48:00.000Z')
    • (…)

The writes should be fine, once we've got some data on the table, let's analyze the SELECT. We've seen issues with WHERE + IN, on big tables.

I'd like to hear more about this. Would it be better to do multiple queries with sub_item = '...'? (We're doing that right now, but I noticed this when preparing that comment, and thought that a single query would be better.) Or should we batch them when the list of items is too large? (Or just give up in that case?)

For now we can leave it if you are not expecting a huge list of conditions there. We've seen the optimizer behaving badly when having a huge list of items for the IN clause and big tables: T267216 T268457
Unfortunately it is impossible to know whether this will be the case or not, as it is just entirely up to the optimizer.

Change 683070 merged by Marostegui:

[operations/puppet@production] realm.pp: Add discussiontools_subscription to private tables

https://gerrit.wikimedia.org/r/683070

Mentioned in SAL (#wikimedia-operations) [2021-05-04T05:07:58Z] <marostegui> Restart sanitarium hosts to pick up new filters T263817

I have restarted sanitarium hosts to filter the tables on the wiki replicas.
The tables can now be created in production. From what I can see, we are still pending some code tweaks to decide on the way to limit/expire things on the table (T263817#7040637)

Change 683067 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Change DB column types from "string" to "binary" (VARCHAR to VARBINARY in MySQL)

https://gerrit.wikimedia.org/r/683067