Page MenuHomePhabricator

Create required table for new Watchlist Expiry feature
Closed, ResolvedPublic

Description

The Community Tech team has started working on a new project called Watchlist Expiry. We need a new DB table to store the expiration timestamp and watchlist item ID.

  • Predicted usage:
    • We might see an increase in usage of the Watchlist due to this new feature.
    • This new table will never be bigger than the existing watchlist table.
    • Have a periodic purging of the watchlist and watchlist_expiry tables -- purging all items that have timestamp that is before the current time from both tables. https://phabricator.wikimedia.org/T235005#5714344
      • Purging will happen about twice a day to cover different timezones.
      • It will help purge the watchlist table.
  • Limitations
    • Minimal expiration date allowed is 1 day. We will guarantee expiration within a day of the chosen date, without guaranteeing exact hour.
    • To start, we will enable expiration within a preset of day/week/month, but later the product will allow choosing a custom date.
    • We can set a maximum limit for expiration that makes sense (6 months or a year)
  • Rollout plan
    • Rollout will be incremental to wikis. For a tentative plan, see {T240094#5807341}
  • Queries:
    • Create:
CREATE TABLE `watchlist_expiry` (
  `we_item` int unsigned NOT NULL PRIMARY KEY,
  `we_expiry` binary(14) NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX `we_expiry` ON `watchlist_expiry` (`we_expiry`);
  • Purge expired items:
DELETE  `watchlist`, `watchlist_expiry` FROM `watchlist_expiry` JOIN `watchlist` ON (`we_item`=`wl_id`) WHERE `we_expiry` < NOW();
  • Purge expired items explain:
EXPLAIN DELETE  `watchlist`, `watchlist_expiry` FROM `watchlist_expiry` JOIN `watchlist` ON (`we_item`=`wl_id`) WHERE `we_expiry` < NOW();
+------+-------------+------------------+--------+-------------------+---------+---------+-------------------------------+------+-------------+
| id   | select_type | table            | type   | possible_keys     | key     | key_len | ref                           | rows | Extra       |
+------+-------------+------------------+--------+-------------------+---------+---------+-------------------------------+------+-------------+
|    1 | SIMPLE      | watchlist_expiry | ALL    | PRIMARY,we_expiry | NULL    | NULL    | NULL                          |    1 | Using where |
|    1 | SIMPLE      | watchlist        | eq_ref | PRIMARY           | PRIMARY | 4       | wiki.watchlist_expiry.we_item |    1 |             |
+------+-------------+------------------+--------+-------------------+---------+---------+-------------------------------+------+-------------+
2 rows in set (0.00 sec)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
HMonroy updated the task description. (Show Details)

Why a new table instead of a new field on the existing table?

HMonroy updated the task description. (Show Details)

Why a new table instead of a new field on the existing table?

Most (or at least a large number) of watchlist items will not expire, so there would be lots of NULls in the table. Better to have a separate table with a foreign key, that only contains the required rows.

HMonroy updated the task description. (Show Details)

I think there are also performance gains to be had from a smaller overall table when doing large bulk deletes. My thinking on that might be outdated.

Our hope was that having this data in a separate table would allow us to tune the automated deletions that will happen as part of expiration. If we do this in the existing watchlist table, we risk causing table locks that could significantly affect app performance.

I had a few random thoughts about this:

  • The DELETEs need to ensure the relevant rows from both tables are deleted.
  • We might need some understanding of the transaction scenario here. I imagine we'll be doing the delete in a single query. That means that if the query fails, none of the appropriate expiring items will be deleted. How should we recover? Try again? Loop through each one and DELETE individually? One of these might be better or worse for performance.
  • This table is guaranteed to be smaller than the existing watchlist table.
  • One smaller wikis, I suppose there's the possibility that this table might be completely empty from time to time. That chance exists on a larger wiki but the probability approaches zero pretty quickly.
HMonroy updated the task description. (Show Details)
HMonroy updated the task description. (Show Details)
HMonroy updated the task description. (Show Details)

Regarding the plans for release: We will want to release this feature, once it's complete, in a cautious manner. To do this, we'll first release to a subset of wikis, focusing on a few smaller wikis. This way, we can collect data and feedback in a purposeful manner, before we release the changes to larger wikis.

Regarding the question of what percentage of users will watch a page temporarily: We cannot determine this number in advance. This feature presents a new way of watching pages and a new way of even thinking about 'watching,' from the user perspective. For this reason, current data on watched pages may be misleading or unhelpful in the context of this project. However, we can say that less people will watch pages temporarily than the total number that watch pages in general. This is because some users will want to watch some pages permanently (i.e., with no expiration date).

For more accurate and useful numbers, I propose that we first release this feature on a small subset of wikis that have a lower number of users and pages. We can track the percentage of pages that are watched temporarily, which we can then share with the Core Platform Team. From there, a determination can be made if the numbers are manageable in the context of larger wikis, or if changes would need to be made. Does this sound appropriate?

Also, @Mooeypoo, should I write this information into the actual ticket, or if it's okay if it remains in the comments?

Now that we have things organized and settled, I'm tagging DBA for assistance. Please let us know if there's anything we need to further expand on, explain, or help with. Your help is super appreciated, thank you!

I think there are also performance gains to be had from a smaller overall table when doing large bulk deletes. My thinking on that might be outdated.

This is still valid. Smaller tables are easier to deal with from a maintenance point of view - they are less likely to show weird and unexpected optimizer query plans.

Our hope was that having this data in a separate table would allow us to tune the automated deletions that will happen as part of expiration. If we do this in the existing watchlist table, we risk causing table locks that could significantly affect app performance.

How are you going to determine the size of the DELETE batches. Deleting is an expensive operation in MySQL, so ideally we'd prefer often but smaller transaction rather than less but bigger ones.
I also assume normal "wait for replication protections" will be used on the scripts.

  • We might need some understanding of the transaction scenario here. I imagine we'll be doing the delete in a single query. That means that if the query fails, none of the appropriate expiring items will be deleted. How should we recover? Try again? Loop through each one and DELETE individually? One of these might be better or worse for performance.

As I said above, the smaller, the better for performance and probably for retries as well. We rather have 1000 smaller but fast queries, than 2 huge ones.

  • This table is guaranteed to be smaller than the existing watchlist table.

Can you provide some figures - watchlist table is quite big (214141033 rows - 28GB on disk) so it is relatively easier to be under those figures? Expected size and growth?

Regarding the plans for release: We will want to release this feature, once it's complete, in a cautious manner. To do this, we'll first release to a subset of wikis, focusing on a few smaller wikis. This way, we can collect data and feedback in a purposeful manner, before we release the changes to larger wikis.

+1

Regarding the question of what percentage of users will watch a page temporarily: We cannot determine this number in advance. This feature presents a new way of watching pages and a new way of even thinking about 'watching,' from the user perspective. For this reason, current data on watched pages may be misleading or unhelpful in the context of this project. However, we can say that less people will watch pages temporarily than the total number that watch pages in general. This is because some users will want to watch some pages permanently (i.e., with no expiration date).

In regards to the issue with the large/small DELETE batches, how would the application handle if an user watches 1M pages, and decides that the expiration date will be the same for all those pages. If we'd attempt to do that sort of purge/expiration all at once I can see a problem there.

Will this table receive reads? Can you provide some example of those?

Thank you for the quick response, @Marostegui.

Our hope was that having this data in a separate table would allow us to tune the automated deletions that will happen as part of expiration. If we do this in the existing watchlist table, we risk causing table locks that could significantly affect app performance.

How are you going to determine the size of the DELETE batches. Deleting is an expensive operation in MySQL, so ideally we'd prefer often but smaller transaction rather than less but bigger ones.
I also assume normal "wait for replication protections" will be used on the scripts.

We are flexible; we initially thought the bottleneck would be the operation itself, so we offered to do it in specific "slots" (say, twice a day) -- but if it makes more sense to do deletes more often, there was a suggestion to run the purge every X edits (similar to what is done right now with RecentChanges) which would mean a much much smaller batch each time. We can also potentially make this configurable, so we can make sure it is reasonably set per the size of the wiki (and the table).

We will have some mitigations on the UI, so the actual purge can be somewhat flexible; we could do it more or less often as needed.

Does this mitigate the concern, or is there anything further we can consider?

  • We might need some understanding of the transaction scenario here. I imagine we'll be doing the delete in a single query. That means that if the query fails, none of the appropriate expiring items will be deleted. How should we recover? Try again? Loop through each one and DELETE individually? One of these might be better or worse for performance.

As I said above, the smaller, the better for performance and probably for retries as well. We rather have 1000 smaller but fast queries, than 2 huge ones.

Depending on the amount of purges, we might not necessarily need to retry; for example, if we run the purge on enwiki for every 1000 edits, then if one of those fails, it is probably okay to wait for the next one to run rather than retry, though that is something that you are in a better position to advise on.

The point here, though, is that as long as we can run this purge 1-2 a day we are answering the need to make sure items expire on the resolution of "day", which is what the users expect. What we do within it is completely flexible.

  • This table is guaranteed to be smaller than the existing watchlist table.

Can you provide some figures - watchlist table is quite big (214141033 rows - 28GB on disk) so it is relatively easier to be under those figures? Expected size and growth?

Unfortunately, that really depends on usage, we're really weary of guessing. I am leaving this one open for a little bit while I discuss potentially how to estimate this better, please stay tuned :)

What we *could* do, though, is decide (on a product level) that we limit how many items a user is allowed to set as expired, which would limit the size of the table; that said, this also depends on the amount of users using it, and the expectations from the product.

I will be getting back to this question after further discussion with the team and our product manager and try to come up with a better estimate.

Regarding the question of what percentage of users will watch a page temporarily: We cannot determine this number in advance. This feature presents a new way of watching pages and a new way of even thinking about 'watching,' from the user perspective. For this reason, current data on watched pages may be misleading or unhelpful in the context of this project. However, we can say that less people will watch pages temporarily than the total number that watch pages in general. This is because some users will want to watch some pages permanently (i.e., with no expiration date).

In regards to the issue with the large/small DELETE batches, how would the application handle if an user watches 1M pages, and decides that the expiration date will be the same for all those pages. If we'd attempt to do that sort of purge/expiration all at once I can see a problem there.

I think that we will have to set some sort of limit on expired items no matter what, so having a million expiring items sounds like it's something we can severely limit ahead of that -- but I am not sure what number would make sense to the expectation of the user. Following from the previous question, I'll get back to this ticket with a clearer answer once we talk about user expectations within the team.

Will this table receive reads? Can you provide some example of those?

Yes, there are a couple of types of reads we will need to do, but we can examine and see whether we need to change this behavior in case they are problematic:

  • Every page for logged in users: When a logged in user views an article, we need to know whether the article exists at all in the expiry table so we can show the user an indication that it is temporarily watched (a partial star instead of a full star) We might show the expiration date within a tooltip on the star.
  • Edit Summary: When a logged in user edits a page they are watching, there's a checkmark already exists where you can unwatch the page -- that will also show whether the page is temporarily watched, and will allow the user to change that expiration (or stop watching altogether)
  • The edit watchlist page: Items with an expiry will be marked + shown the expiration date

I'm pinging @aezell here to see I have not missed any of the occurrences, but those are the instances we've discussed the most with the product owner.

We are flexible; we initially thought the bottleneck would be the operation itself, so we offered to do it in specific "slots" (say, twice a day) -- but if it makes more sense to do deletes more often, there was a suggestion to run the purge every X edits (similar to what is done right now with RecentChanges) which would mean a much much smaller batch each time. We can also potentially make this configurable, so we can make sure it is reasonably set per the size of the wiki (and the table).

We will have some mitigations on the UI, so the actual purge can be somewhat flexible; we could do it more or less often as needed.

Does this mitigate the concern, or is there anything further we can consider?

Doing it every X edits sounds good to me if the result will be smaller (but more frequent) deletes. Let's make sure indeed it is configurable so we can tackle it in case we see further issues when this is released.

Depending on the amount of purges, we might not necessarily need to retry; for example, if we run the purge on enwiki for every 1000 edits, then if one of those fails, it is probably okay to wait for the next one to run rather than retry, though that is something that you are in a better position to advise on.

The point here, though, is that as long as we can run this purge 1-2 a day we are answering the need to make sure items expire on the resolution of "day", which is what the users expect. What we do within it is completely flexible.

Not sure I get this, if we move to a model where we'd be purging based on the number of edits, we should be good now?
Say we purge every 1000 edits, if that operation fails, we'd be purging once it reaches 2000?

I think one retry at least is always good, servers can be overloaded, they could be connection issues, contention issues etc...so at least a retry should be attempted I believe.

  • This table is guaranteed to be smaller than the existing watchlist table.

Can you provide some figures - watchlist table is quite big (214141033 rows - 28GB on disk) so it is relatively easier to be under those figures? Expected size and growth?

Unfortunately, that really depends on usage, we're really weary of guessing. I am leaving this one open for a little bit while I discuss potentially how to estimate this better, please stay tuned :)

What we *could* do, though, is decide (on a product level) that we limit how many items a user is allowed to set as expired, which would limit the size of the table; that said, this also depends on the amount of users using it, and the expectations from the product.

I will be getting back to this question after further discussion with the team and our product manager and try to come up with a better estimate.

Sounds good - thanks.
I like the idea of setting a limit on the user side, to prevent possible abuses or things like that.

In regards to the issue with the large/small DELETE batches, how would the application handle if an user watches 1M pages, and decides that the expiration date will be the same for all those pages. If we'd attempt to do that sort of purge/expiration all at once I can see a problem there.

I think that we will have to set some sort of limit on expired items no matter what, so having a million expiring items sounds like it's something we can severely limit ahead of that -- but I am not sure what number would make sense to the expectation of the user. Following from the previous question, I'll get back to this ticket with a clearer answer once we talk about user expectations within the team.

Thanks!

Will this table receive reads? Can you provide some example of those?

Yes, there are a couple of types of reads we will need to do, but we can examine and see whether we need to change this behavior in case they are problematic:

  • Every page for logged in users: When a logged in user views an article, we need to know whether the article exists at all in the expiry table so we can show the user an indication that it is temporarily watched (a partial star instead of a full star) We might show the expiration date within a tooltip on the star.

So for logged users, basically one on every visited page.It would be interesting to see the query indeed.

  • Edit Summary: When a logged in user edits a page they are watching, there's a checkmark already exists where you can unwatch the page -- that will also show whether the page is temporarily watched, and will allow the user to change that expiration (or stop watching altogether)
  • The edit watchlist page: Items with an expiry will be marked + shown the expiration date

I'm pinging @aezell here to see I have not missed any of the occurrences, but those are the instances we've discussed the most with the product owner.

Can this feature be easily disabled if needed?
Thank you!

Here are the modified queries; note the EXPLAIN results are on my local dev wiki, hence not representative of production:

Every page for logged in users: When a logged in user views an article, we need to know whether the article exists at all in the expiry table so we can show the user an indication that it is temporarily watched (a partial star instead of a full star) We might show the expiration date within a tooltip on the star.

The query would be changed to:

SELECT wl_id,wl_notificationtimestamp,we_expiry FROM `watchlist` LEFT JOIN `watchlist_expiry` ON ((we_item = wl_id)) WHERE wl_user = 3 AND wl_namespace = 0 AND wl_title = 'Main_Page' LIMIT 1

(replace wl_user, wl_namespace and wl_title accordingly)

Explain results:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEwatchlistconstwl_user,namespace_title,wl_user_notificationtimest...wl_user265const,const,const1
1SIMPLEwatchlist_expiryconstPRIMARYPRIMARY4const0Unique row not found

The query also appears to be cached; I'm not sure for how long.

Edit Summary: When a logged in user edits a page they are watching, there's a checkmark already exists where you can unwatch the page -- that will also show whether the page is temporarily watched, and will allow the user to change that expiration (or stop watching altogether)

This uses the same query as above, which is ran anyway because we show the watch/unwatch link (e.g. star icon) when editing a page, too.

The edit watchlist page: Items with an expiry will be marked + shown the expiration date

The query is changed to:

SELECT wl_namespace,wl_title,wl_notificationtimestamp,we_expiry FROM `watchlist` LEFT JOIN `watchlist_expiry` ON ((we_item = wl_id)) WHERE wl_user = 3 ORDER BY wl_namespace ASC,wl_title ASC

Explain results:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEwatchlistALLwl_user,wl_user_notificationtimestampNULLNULLNULL14Using where; Using filesort
1SIMPLEwatchlist_expiryeq_refPRIMARYPRIMARY4default.watchlist.wl_id1

The filesort looks bad, but this appears to be the same query plan that is currently used in MediaWiki.

Not sure I get this, if we move to a model where we'd be purging based on the number of edits, we should be good now?
Say we purge every 1000 edits, if that operation fails, we'd be purging once it reaches 2000?

I think one retry at least is always good, servers can be overloaded, they could be connection issues, contention issues etc...so at least a retry should be attempted I believe.

Yes, I think I may have misunderstood the concern you've raised. I was thinking that if the problem is too many retries, then we can limit those because from the product perspective, we are flexible in waiting for the next time the "purge" happens, if needed.

I will be getting back to this question after further discussion with the team and our product manager and try to come up with a better estimate.

Sounds good - thanks.
I like the idea of setting a limit on the user side, to prevent possible abuses or things like that.

So, i'm getting back to you on this :)

We are going to to put two limitations that will effectively limit the size of the table:

  • Limit on how many expired items a user can have at any given moment. We are aiming towards being generous, but not as generous as the watchlist (which seems unwieldy). Our thought was towards 10,000-20,000 pages per user. We don't think there will be too many users that get to that number, but there's certainly several "super" users that request a lot of pages to be temporarily watched. Is this reasonable, or should we consider another limit?
  • Limit the expiration time. This would mean that rows are temporary in that table, and that the entries "rotate out" of that table (and in effect cause the watchlist table to shrink too, we hope significantly). The limit we want to put on the system is 6 months, maximum.

Note: We can make those limits configurable per-wiki, so they can potentially be changed or adjusted per-wiki, depending on the wiki size.

Given these limitations, not only would users not be able to add millions of rows to this table, we are hopeful that it will help in reducing the size of the watchlist table itself by encouraging users to put their watchlist item on a rule that will eventually take them off the watchlist table, reducing its size.

We are planning to do some more statistic work on trying to estimate this, but I am not sure how well we can predict user behavior depending on how much people rush to put their watchlist items on expiry. On one side, that might be great because it will help reduce the size of the watchlist table, but on the other, it will make the expiry table bigger.

We're hoping our limits are sensible to try and get the best of both worlds...

Does this give enough estimate on the potential size of the table?

Will this table receive reads? Can you provide some example of those?

Yes, there are a couple of types of reads we will need to do, but we can examine and see whether we need to change this behavior in case they are problematic:

  • Every page for logged in users: When a logged in user views an article, we need to know whether the article exists at all in the expiry table so we can show the user an indication that it is temporarily watched (a partial star instead of a full star) We might show the expiration date within a tooltip on the star.

So for logged users, basically one on every visited page.It would be interesting to see the query indeed.

@MusikAnimal has added the queries in the previous reply.

There's one that I forgot to mention, which is when the user reads the Watchlist itself. MW already queries the watchlist table at that point, we will join with the expiry table, filter by expiry (so we don't display expired items, just in case) and display a note if an entry belongs to a page that will expire soon.

Can this feature be easily disabled if needed?

Yes! (though we really hope we don't have to :) -- we will be adding the entire feature behind a feature flag, to be able to enable it incrementally (and slowly/controlled) in wikis, per the release strategy that @ifried laid out above.

Usually, this kind of flag can be temporary and is supposedly removed when the feature is enabled everywhere, but I think with this feature it can also remain as a permanent configuration option for wikis that either don't need this, or should not be using this.

We will be able to toggle that config variable off and disable the feature per-wiki or overall, if needed.

Here are the modified queries; note the EXPLAIN results are on my local dev wiki, hence not representative of production:

The query is changed to:

SELECT wl_namespace,wl_title,wl_notificationtimestamp,we_expiry FROM `watchlist` LEFT JOIN `watchlist_expiry` ON ((we_item = wl_id)) WHERE wl_user = 3 ORDER BY wl_namespace ASC,wl_title ASC

Explain results:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEwatchlistALLwl_user,wl_user_notificationtimestampNULLNULLNULL14Using where; Using filesort
1SIMPLEwatchlist_expiryeq_refPRIMARYPRIMARY4default.watchlist.wl_id1

The filesort looks bad, but this appears to be the same query plan that is currently used in MediaWiki.

The current query seems to be using wl_user index in production from what I can see on enwiki, so that's good

Not sure I get this, if we move to a model where we'd be purging based on the number of edits, we should be good now?
Say we purge every 1000 edits, if that operation fails, we'd be purging once it reaches 2000?

I think one retry at least is always good, servers can be overloaded, they could be connection issues, contention issues etc...so at least a retry should be attempted I believe.

Yes, I think I may have misunderstood the concern you've raised. I was thinking that if the problem is too many retries, then we can limit those because from the product perspective, we are flexible in waiting for the next time the "purge" happens, if needed.

Sounds good to me. Too many retries is also a problem, so limiting that is also a good plus.
To be clear, I think we have to retry that transaction and then, just wait for the next purge time.

I will be getting back to this question after further discussion with the team and our product manager and try to come up with a better estimate.

Sounds good - thanks.
I like the idea of setting a limit on the user side, to prevent possible abuses or things like that.

So, i'm getting back to you on this :)

We are going to to put two limitations that will effectively limit the size of the table:

  • Limit on how many expired items a user can have at any given moment. We are aiming towards being generous, but not as generous as the watchlist (which seems unwieldy). Our thought was towards 10,000-20,000 pages per user. We don't think there will be too many users that get to that number, but there's certainly several "super" users that request a lot of pages to be temporarily watched. Is this reasonable, or should we consider another limit?

That looks good, along with making sure we don't expire all the items at once, as discussed earlier in the thread.

  • Limit the expiration time. This would mean that rows are temporary in that table, and that the entries "rotate out" of that table (and in effect cause the watchlist table to shrink too, we hope significantly). The limit we want to put on the system is 6 months, maximum.

Once those rows have been deleted for the table, at some point we'd need to optimize the table from MySQL point of view, as otherwise we will not see the disk space claimed back in an OS level. But that's fine. It will be interesting to see how much space we are actually able to reclaim.

Note: We can make those limits configurable per-wiki, so they can potentially be changed or adjusted per-wiki, depending on the wiki size.

That's a great plus, thanks for doing it. As we have so many wikis, that makes lots of sense, as big ones will definitely have a different behaviour and will probably need a bit of tuning, specially those with big tables. Big tables can lead to weird query plans (as we've seen in the past). Small wikis are not really a concern, as the table tends to be small enough not to cause issues or to fit in memory.

Given these limitations, not only would users not be able to add millions of rows to this table, we are hopeful that it will help in reducing the size of the watchlist table itself by encouraging users to put their watchlist item on a rule that will eventually take them off the watchlist table, reducing its size.

We are planning to do some more statistic work on trying to estimate this, but I am not sure how well we can predict user behavior depending on how much people rush to put their watchlist items on expiry. On one side, that might be great because it will help reduce the size of the watchlist table, but on the other, it will make the expiry table bigger.

We're hoping our limits are sensible to try and get the best of both worlds...

Does this give enough estimate on the potential size of the table?

Yeah, that helps. Once you've done more stats, let me know so we can estimate better once we have some approximate number of items (rows) per user (average, of course)
Shrinking watchlist table is always good, I'd rather have two medium tables than 1 huge and 1 small :)

Will this table receive reads? Can you provide some example of those?

Yes, there are a couple of types of reads we will need to do, but we can examine and see whether we need to change this behavior in case they are problematic:

  • Every page for logged in users: When a logged in user views an article, we need to know whether the article exists at all in the expiry table so we can show the user an indication that it is temporarily watched (a partial star instead of a full star) We might show the expiration date within a tooltip on the star.

So for logged users, basically one on every visited page.It would be interesting to see the query indeed.

@MusikAnimal has added the queries in the previous reply.

There's one that I forgot to mention, which is when the user reads the Watchlist itself. MW already queries the watchlist table at that point, we will join with the expiry table, filter by expiry (so we don't display expired items, just in case) and display a note if an entry belongs to a page that will expire soon

Yep, saw them, and they look good :)

Can this feature be easily disabled if needed?

Yes! (though we really hope we don't have to :) -- we will be adding the entire feature behind a feature flag, to be able to enable it incrementally (and slowly/controlled) in wikis, per the release strategy that @ifried laid out above.

Usually, this kind of flag can be temporary and is supposedly removed when the feature is enabled everywhere, but I think with this feature it can also remain as a permanent configuration option for wikis that either don't need this, or should not be using this.

We will be able to toggle that config variable off and disable the feature per-wiki or overall, if needed.

Hopefully, we won't needed, but it is good to know that it can be turned off at any time if we feel we have to :-)
Just a good safety net!

  • Purge expired items:
DELETE  `watchlist`, `watchlist_expiry` FROM `watchlist_expiry` JOIN `watchlist` ON (`we_item`=`wl_id`) WHERE `we_expiry` < NOW();

Note that MediaWiki's database abstraction layer cannot generate a multi-table delete like this, and I'd probably oppose adding that functionality since multi-table delete seems MySQL-specific and is likely to go against patterns we encourage. Also the use of NOW() in there might not be replication-safe (or does replication account that?).

As noted in previous comments, unlimited bulk deletes like this are something we discourage for performance reasons. The usual pattern would be to do something like this in a job:

  1. Begin transaction
  2. Query a batch of wl_ids to delete, perhaps SELECT we_item FROM watchlist_expiry WHERE we_expiry < '$TS' ORDER BY we_expiry, we_item LIMIT $N.
  3. Delete the batch in two queries, DELETE FROM watchlist WHERE wl_id IN (...) and DELETE FROM watchlist_expiry WHERE we_item IN (...).
  4. Commit transaction and wait for replication
  5. Repeat if necessary (i.e. if $N rows were actually fetched), probably by rescheduling the job.

This pattern directly and strictly limits the sizes of the batches, even in cases where wiki users manage to somehow have millions of entries that all expire at the same time.

Why a new table instead of a new field on the existing table?

Most (or at least a large number) of watchlist items will not expire, so there would be lots of NULls in the table. Better to have a separate table with a foreign key, that only contains the required rows.

If I'm reading https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html correctly, if we're using InnoDB with anything except the "REDUNDANT" row format then adding a null column to watchlist wouldn't take up any additional space in rows where that field is NULL (since the table already has one NULL column). When typed as BINARY(14), it would add 14 bytes to rows where it's not NULL.

But there may be other concerns besides raw storage usage; I haven't closely read the rest of the discussion.

Thanks for the feedback y'all. I like the approach that @Anomie detailed above. It makes sense and seems much safer.

As for the discussion of new table vs. new field, we've covered that previously and we are moving forward with a new table. I don't think we want to reopen that part of the discussion.

Is there anything else we need to discuss before getting this change on the calendar?

Is there anything else we need to discuss before getting this change on the calendar?

Yes, is this table safe to be replicated to labs? Does it need filtering on any of the columns or the whole table?
Reminder, if there is stuff that can be replicated to labs, you'd need to file a task for cloud-services-team in order to be able to get a view on it.

@Marostegui Yes, it can be replicated as long as the watchlist table is replicated. This table only has value with the watchlist table data.

I'll make a note to create that task. Thanks for the reminder.

Watchlist isn't replicated/exposed in a view

So I guess replicating this table has little to no value without the associated watchlist table

The watchlist table is not replicated, so indeed we don't need to replicate watchlist_expiry.

Thanks for clarifying that, y'all.

From what I can see, watchlist is indeed private. So I am going to create the patch for the watchlist_expiry table to be fully filtered as well.
This require restarting MySQL on the sanitarium hosts before the table is created.

Change 570792 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] realm.pp: Add watchlist_expiry to private tables

https://gerrit.wikimedia.org/r/570792

Can I get a quick check on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570792/ ? - Once this is merged I will take care of restarting MySQL on sanitarium hosts and once done the table can be scheduled for creation.

Change 570792 merged by Marostegui:
[operations/puppet@production] realm.pp: Add watchlist_expiry to private tables

https://gerrit.wikimedia.org/r/570792

Mentioned in SAL (#wikimedia-operations) [2020-02-07T15:11:54Z] <marostegui> Restart all instances on db2094 and db2095 to pick up a new replication filter - T240094

Mentioned in SAL (#wikimedia-operations) [2020-02-07T15:18:42Z] <marostegui> Restart all instances on db1124 and db1125 to pick up a new replication filter - T240094

Replication filter in place:

[15:16:49] marostegui@cumin1001:~$ sudo cumin 'db209[4-5].codfw.wmnet,db112[4-5].eqiad.wmnet' 'cat /etc/my.cnf | grep "watchlist_"'
4 hosts will be targeted:
db[2094-2095].codfw.wmnet,db[1124-1125].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(4) db[2094-2095].codfw.wmnet,db[1124-1125].eqiad.wmnet
----- OUTPUT of 'cat /etc/my.cnf | grep "watchlist_"' -----
replicate-wild-ignore-table = %.watchlist_expiry

All sanitariums' MySQL have been restarted to pick up the change.
The table can be created whenever it is needed.

Change 570943 had a related patch set uploaded (by MusikAnimal; owner: MusikAnimal):
[mediawiki/core@master] Create new table for watchlist expiry

https://gerrit.wikimedia.org/r/570943

Change 570943 merged by jenkins-bot:
[mediawiki/core@master] Create new table for watchlist expiry

https://gerrit.wikimedia.org/r/570943

Change 571334 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Remove SQLite watchlist_expiry patch

https://gerrit.wikimedia.org/r/571334

Change 571334 merged by jenkins-bot:
[mediawiki/core@master] Remove SQLite watchlist_expiry patch

https://gerrit.wikimedia.org/r/571334

Change 571352 had a related patch set uploaded (by MusikAnimal; owner: MusikAnimal):
[mediawiki/core@master] Add ALTER SEQUENCE for Postgres patch-watchlist_expiry.sql

https://gerrit.wikimedia.org/r/571352

Change 571352 merged by jenkins-bot:
[mediawiki/core@master] Add ALTER SEQUENCE for Postgres patch-watchlist_expiry.sql

https://gerrit.wikimedia.org/r/571352

Note that MediaWiki's database abstraction layer cannot generate a multi-table delete like this, and I'd probably oppose adding that functionality since multi-table delete seems MySQL-specific and is likely to go against patterns we encourage. Also the use of NOW() in there might not be replication-safe (or does replication account that?).

Can IDatabase::deleteJoin() be used for the multi-table deletion?

Note that MediaWiki's database abstraction layer cannot generate a multi-table delete like this, and I'd probably oppose adding that functionality since multi-table delete seems MySQL-specific and is likely to go against patterns we encourage. Also the use of NOW() in there might not be replication-safe (or does replication account that?).

Can IDatabase::deleteJoin() be used for the multi-table deletion?

No. IDatabase::deleteJoin() is for deleting from just one table based on a join with a second. Your multi-table deletion is deleting from both tables.

The table has been created. Thank you for all the advice and guidance and help @Marostegui! And thanks, @Anomie, for the input and guidance on the best ways to approach the purging.

Followup can be tracked in the Expiring-Watchlist-Items and on our board.