Page MenuHomePhabricator

Create a purge routine for `watchlist_expiry` [medium]
Closed, ResolvedPublic

Description

Followup to creating the watchlist_expiry table in T240094: Create required table for new Watchlist Expiry feature, the purging mechanism should be added.

Considerations for this are following the comment https://phabricator.wikimedia.org/T240094#5852474 by @Anomie

Acceptance criteria

  • Create a purging operation that starts a job with batches of deletes (from the comment above, the process includes:)
    • Begin transaction
    • Query a batch of wl_ids to delete (with a set limit)
    • Delete the batch in two queries, DELETE FROM watchlist WHERE wl_id IN (...) and DELETE FROM watchlist_expiry WHERE we_item IN (...).
    • Commit transaction and wait for replication
    • Repeat if necessary (i.e. if $N rows were actually fetched), probably by rescheduling the job.
  • The purging operation should be done every X edits, where X is a configuration variable that has a default but can be set separately per wiki.
  • The operation must be contained behind a feature flag -- the same feature flag will be used for the entire product going forward.

Event Timeline

Restricted Application added a project: TCB-Team. · View Herald TranscriptFeb 10 2020, 10:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Reedy added a subscriber: Reedy.Feb 10 2020, 11:11 PM

The purging operation should be done every X edits, where X is a configuration variable that has a default but can be set separately per wiki.

I would suggest adding a maintenance script (as a thin wrapper around whatever code you write) that can be cronjob'd/similar too like we do in other situations

https://github.com/wikimedia/mediawiki/blob/master/maintenance/purgeExpiredUserrights.php and https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/mediawiki/manifests/maintenance/purge_expired_userrights.pp

Note: This process is already implemented for the RecentChanges table, which is being purged every few edits. see: https://gerrit.wikimedia.org/g/mediawiki/core/+/799eeb583a6747ad96888ec0fef8f76e45079129/includes/Storage/DerivedPageDataUpdater.php#1493

It looks like the method is creating a purge job through a random 1:10-chance routine to simulate a more-or-less every-10-edits run, to avoid having to store a count somewhere. We should note that while this is technically working, it isn't quite the same as running the purge on actually every 10 edits, but it looks like this is the strategy used.

We can create the configuration variable to control the pace of this purging operation by using the upper bound of the random numbers, like mt_rand( 0, $wgWatchlistExpiryPurgeRate ) or something like that.

The purging operation should be done every X edits, where X is a configuration variable that has a default but can be set separately per wiki.

I would suggest adding a maintenance script (as a thin wrapper around whatever code you write) that can be cronjob'd/similar too like we do in other situations

https://github.com/wikimedia/mediawiki/blob/master/maintenance/purgeExpiredUserrights.php and https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/mediawiki/manifests/maintenance/purge_expired_userrights.pp

That's a good idea as a backup, especially perhaps to wikis that have a lot less traffic.

Mooeypoo updated the task description. (Show Details)Feb 11 2020, 10:46 PM
ifried renamed this task from Create a purge routine for `watchlist_expiry` to Create a purge routine for `watchlist_expiry` [medium].Feb 12 2020, 12:28 AM
ifried moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.
ifried moved this task from Estimated to Kanban-Q3-2019-20 on the Community-Tech board.
Restricted Application edited projects, added Community-Tech; removed Community-Tech (Kanban-Q3-2019-20). · View Herald TranscriptFeb 12 2020, 12:29 AM

I would suggest adding a maintenance script (as a thin wrapper around whatever code you write) that can be cronjob'd/similar too like we do in other situations

There's an existing cleanupWatchlist.php maintenance script that "removes broken, unparseable titles in the watchlist table". Would it make sense to use this to also remove expired items? Either with a new parameter (e.g. --delete-expired) or just as part of its standard operation? To me, it sounds like a reasonable fit, and not adding yet another maintenance script would be nice.

aezell added a subscriber: aezell.Feb 13 2020, 4:17 PM

I think I like that idea. Any downsides we can foresee?

I think it's slight more work than adding a new maintenance script, because the existing one is a subclass of TableCleanup. I'll experiment and see how it goes.

Change 572147 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/core@master] Purge expired watchlist items

https://gerrit.wikimedia.org/r/572147

@ifried is the $wgWatchlistExpiry feature flag going to be temporary, or a permanent part of MediaWiki core?

There's an existing cleanupWatchlist.php maintenance script

How often does it run? I suppose it doesn't need to be often if we're also purging the watchlists after every N edits.

cleanupWatchlist.php doesn't look like it's currently run as a cronjob for Wikimedia wikis. And I've looked into it more and think the existing script is too tied into being a full table cleanup script: it goes through every row in the watchlist table. We could add a new flag to it --no-invalid, to not check for invalid entries, but that'll mean reimplementing lots what's already done by the TableCleanup class.

I'll stick with the separate script approach (purgeExpiredWatchlistItems.php), although cleanupWatchlist will need a small update to make sure it's also removing the matching records from the expiry table.

When we purge we also want to get rid of orphaned watchlist_expiry rows. This will be done in the maintenance script and also in the job, but it should only be done in the job if the maintenance script isn't used. The problem is that we don't have any way to know if the maintenance script is being run. One way would be to add a new config var e.g. $wgWatchlistExpiryCheckOrphans but that feels like overkill... I wonder if it's okay to accept that there will sometimes be orphaned expiry rows?

This is somewhat similar to the fact that the cleanupWatchlist.php maintenance script can be used to get rid of invalid titles from the watchlist table, and that if you don't run that script you might be left with those titles (but they wouldn't have any impact on anything).

Change 572147 merged by jenkins-bot:
[mediawiki/core@master] Purge expired watchlist items

https://gerrit.wikimedia.org/r/572147

And when do we want to update the release notes with all this? Maybe we should start a patch for that?

Change 583513 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/core@master] Add release notes for recently-merged watchlist expiry work

https://gerrit.wikimedia.org/r/583513

And when do we want to update the release notes with all this? Maybe we should start a patch for that?

Good thinking.

Change 583513 merged by jenkins-bot:
[mediawiki/core@master] Add release notes for recently-merged watchlist expiry work

https://gerrit.wikimedia.org/r/583513

dom_walden added a subscriber: dom_walden.
  • Create a purging operation that starts a job with batches of deletes (from the comment above, the process includes:)

We have three different places watchlist expiries can be removed:

  1. Automatically every N edits
  2. By running maintenance/cleanupWatchlist.php (existing script which we have modified in this task)
  3. By running maintenance/purgeExpiredWatchlistItems.php (new script added in this task)

The latter two require access to the server the MW instance is running on.

I have tested 1. on betawiki and 2. and 3. on my local wiki, and see they all remove expired items from the watchlist table and their respective expiries from the watchlist_expiry table.

You can test 1. on betawiki if you have access to the database (because expired items do not appear in places like Special:EditWatchlist):

  1. Watch article "Foobar" for one minute with the API (e.g. https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=watch&format=json&expiry=1%20minute&titles=foobar)
  2. Wait 1 minute
  3. Edit a random page over and over again until the "Foobar" item is removed from the database

...

  • Repeat if necessary (i.e. if $N rows were actually fetched), probably by rescheduling the job.

The first 100 watchlist items (plus expiries) are removed. Any items over 100 are added to the job queue and deleted in due course. I don't know much about the mechanism of this, but I have seen it happen on betawiki for 300 items.

  • The purging operation should be done every X edits, where X is a configuration variable that has a default but can be set separately per wiki.

$wgWatchlistPurgeRate which by default is 0.1, so happens on average every 10th edit.

  • The operation must be contained behind a feature flag -- the same feature flag will be used for the entire product going forward.

$wgWatchlistExpiry I believe is the feature flag which controls this.

ifried closed this task as Resolved.Thu, May 28, 3:46 PM
ifried moved this task from Product sign-off to Done on the Community-Tech (Kanban-2019-20-Q4) board.

The work has been merged. I'm marking it as Done.