Page MenuHomePhabricator

Purge unused watchlist rows
Open, MediumPublic

Description

See T252812: Investigate watchlist sizes (limiting or handling large ones properly) for context. This task documents ideas for ways to purge the watchlist. The end result may for instance be a maintenance script that DBAs can run occasionally.

Problem

According to the DBAs, the watchlist table has grown to be one of the most problematic tables. The extreme size can cause the query optimizer to malfunction.

Unlike tables like revision and logging, the watchlist table doesn't need to be static. It serves only to convenience the end user in monitoring pages. If the end user isn't using their watchlist, the corresponding rows in the table are needlessly occupying space and in the long run affect the health and performance of the table for all other users.

In addition, we have for years provided ways to automatically watch pages via preferences -- which are even on by default -- inviting scalability issues. Expiring-Watchlist-Items for the first time provides means to automatically unwatch, which is likely to help, but not eliminate the long-term issue.

Idea

(not a concrete proposal)

Create a maintenance script to purge the many millions of unused rows in the watchlist.

Here are some ideas for the criteria; some or all may not be safe assumptions and will surely need community input:

  • Bots: On WMF wikis, the "watch pages and files I create" preference is turned on by default. Consequently, bots have created millions of rows in the watchlist table when seemingly they are not actually be making use of the watchlist. Some examples:
    • commonswiki: 14% (~22 million rows) of the watchlist table are owed to bots, which appear to be bots that upload files and simply have the default preference set to watch them
    • wikidatawiki: 1.1% (~91 million rows) -- for instance bots that automatically create items after articles on Wikipedia are created
    • enwiki: 4.7% (~10.3 million rows) -- take counter-vandalism bots for instance; they create User talk pages when issuing warnings
    • mgwiktionary: 99.8% (~13.2 million rows) -- a single bot mass-created nearly every entry on the wiki
  • Foundation-banned users: They are explicitly not welcomed to return to our projects, so presumably their watchlist could safely be cleared
  • Community-banned/blocked users: After some grace period (say, 5 years) it may be acceptable to clear their watchlists
  • Retired users: Some accounts have left the project ages ago with no intention to return, but their rows in the watchlist table remain. This (sadly) may include deceased users, many of which were quite prolific.
  • Deleted pages: Users typically don't think to unwatch pages after they've been deleted. There are legitimate reasons to continue watching these titles (say to be notified when they're recreated), but the ratio of when they are recreated versus never return might be worth investigating. Perhaps some sort of grace period is reasonable -- say if the page was deleted 5 years ago, the purging routine could automatically unwatch them.
  • Humans using (semi-)automation: Similar issue to with bots. Some non-bot accounts use automation to mass-create pages, unknowingly watching each and every one of them. It will be difficult to identify such accounts, but for instance if a human is watching millions of pages, it's probably safe to assume they aren't getting much out of Special:Watchlist due to how slow it is, if it even loads at all.

Event Timeline

Other ideas... make use of Expiring-Watchlist-Items automatically. Say, when a page is deleted, MediaWiki can set the watchlist expiry to 5 years. We could send a push notification to the user, if we wanted, but regardless if they happen to check Special:EditWatchlist they would see the page listed as having an expiry and could change it if so desired. Same for indefinitely blocked users; MediaWiki could set a 5-year expiry on all of their watched items. This system, and the expiry duration, will of course need broad input from the community.

Also, I think the "add pages and files I created to my watchlist" preference should be turned OFF for bots, forcing the operator to re-enable it if desired. This seems to be among the easier and least controversial safeguards we could explore. I have created a task for that at T258108: Ignore auto-watchlist preferences for bots

  • Deleted pages: Users typically don't think to unwatch pages after they've been deleted. There are legitimate reasons to continue watching these titles (say to be notified when they're recreated), but the ratio of when they are recreated versus never return might be worth investigating. Perhaps some sort of grace period is reasonable -- say if the page was deleted 5 years ago, the purging routine could automatically unwatch them.

I'm not a fan of auto removing watched deleted pages for active (non-bot) users. If I want something removed from my watchlist, deleted or otherwise, I will remove it or set an expiry myself.

Marostegui triaged this task as Medium priority.Jul 16 2020, 4:58 AM
Marostegui edited projects, added User-Marostegui; removed DBA.

Most of these seem reasonable. The last 3 are questionable though...

  • I would change Retired users to Deceased users. Retired users quite often return, but deceased users don't.
  • Watching deleted pages is a fairly common use-case for admins (especially on Commons) so I would skip this one.
  • I would only remove items for humans using (semi-)automation if they explicitly agreed to this. In many of these cases, the creator is the only person watching the page and thus the person most likely to detect vandalism.

I watch a lot of deleted pages that are recreated by LTAs - please do not auto remove deleted pages

Regarding wikidata, 1.1% seems too small to me. We should double check. Also, regarding size of watchlist and improvements to it. I highly recommend normalizing the title part with the title table that's going to be introduced with normalizing links table (T222224: RFC: Normalize MediaWiki link tables)

Wikidata is definitely more than 1%. Just the top user in number of watchlist rows (a bot) is responsible for 30% of all rows of that table. Looking at numbers, my rough estimate is around >95%. A bot doesn't need a watchlist, humans need a watchlist.

Another thing I want to point out (sorry for posting comments back to back) is that watchlist itself is not big and not a scalability concern. For example in enwiki, it's the eighth biggest table with taking basically 3% of total space of the database. For wikidata it's the 18th biggest table. While I agree we should delete bots (and I already cleaned mine), I don't see the benefits of very small space freed in the database (maybe 0.01% at most) outweighing the cost of deleting, and maintaining list of wmf-banned users or deceased users

daniel subscribed.

Moving this to "feature request review". The question is mainly under what conditions we want to remove watchlist entries. While the motivation is technical (free up capacity), the question of criteria is not. We can't engineer a solution if we don't know the requirements.

I would suggest to set up automatically the expiring-watchlist not only for deleted pages, but also for pages which are moved (the watchlist adds the new title, but retains the old one): there can be a preference, enabled by default, that says that a redirect which is a result of a page move will automatically be un-watched after (e.g.) 1 year.

One of ideas is to introduce a "gravestone" table and moving any watchlists of inactive accounts there (and it is reversible). Banned accounts may still have their watchlist cleared.

I never use watchlist, so feel free to purge watchlists from any of my accounts.

Deleted pages: Users typically don't think to unwatch pages after they've been deleted.

Who says so? It would be very user-hostile to change this behaviour without telling users. Deletion is often an important moment in the lifecycle of a page, and in many wikis the surrounding events are very time-sensitive: for instance you might have only a short window of time to act when a page is recreated.

enwiki is now fully cleaned. I'll fix other wikis once I get some numbers on biggest watchlist tables across the fleet.

Test on wikidata:

-rw-rw---- 1 mysql mysql 11G Apr 28 12:22 watchlist.ibd
optimize
-rw-rw---- 1 mysql mysql 2.8G Apr 28 14:11 watchlist.ibd