Page MenuHomePhabricator

Add field watchlist.wl_page_id for watchlist join optimisation
Open, Needs TriagePublic

Description

At T406843#11285677, @Ladsgroup noted that 32% of the watchlist table on enwiki refers to non-existent pages. He proposes to add a wl_page_id field, which is zero for non-existent pages, and to join on it for Special:Watchlist queries. You would expect this to reduce the number of scanned rows by 32%.

This turns out to be more complicated than expected, due to the way rc_cur_id is used on categorization changes.

Current situation

Page deletion removes related rows from the recentchanges table, in LinksDeletionUpdate. However page moves do not update rc_namespace/rc_title. In enwiki there's about 26k rows with rc_namespace/rc_title not linking to anything in the page table, apparently due to the page moves with $createRedirect=false. A move followed by a deletion of the redirect would also not delete these rows since the deletion is done by rc_cur_id.

Watchlist rows are duplicated on page move, so any page move with redirects disabled creates a watchlist entry for a non-existent page.

Categorization changes put the category title in rc_namespace/rc_title and the categorized page in rc_cur_id. This is done so that those recentchanges rows will be deleted when the categorized page is deleted. So that's a spanner in the works. The field would have to be split.

From the user's perspective, watchlist rows for non-existent pages are mainly created by two features:

  • The "watch this page" checkbox on page deletion, which defaults to the watch status of the target.
  • The "Watch source page and target page" checkbox on page move, which defaults to the watch status of the source.

Both have an expiry selector which defaults to permanent. A simple way to reduce the growth of watchlist items for non-existent pages would be to have these expiry selectors default to some finite period.

New design

We could take the opportunity to clean up the archaic name of rc_cur_id. Add rc_page_id and rc_source_page_id, which will be the same except on categorization changes. rc_cur_id would be deprecated and removed. Deletions would be done by rc_source_page_id and watchlist queries would be done by rc_page_id.

I think _id suffixes on foreign key fields are generally redundant. We could have rc_page and let it be implied that it's a link to the primary key. But rc_page_id is more consistent.

I considered updating wl_namespace/wl_title on page move, and creating new watchlist rows for the redirect, instead of creating a new row for the move target. Retaining the wl_id <=> page_id association would allow watchlist expiry and the proposed watchlist labels to automatically move to the target page. However, this would make it harder to deal with existing watchlist rows for the target title. Currently WatchedItemStore::duplicateEntry is a simple multi-row REPLACE. Instead there would need to be a loop over all watching users. This is done in the main transaction. Avoiding deadlocks would be challenging. It's probably simpler to set wl_page_id=0 on the old rows and to insert new rows with REPLACE. There is already updateExpiriesAfterMove() to duplicate the watchlist expiries.

REPLACE conflict detection is currently done by the unique index on (wl_user, wl_namespace, wl_title) which is also needed for Special:EditWatchlist paging. I think this index needs to be retained.

There would need to be a new index on (wl_user, wl_page_id), for watched page highlighting on Special:RecentChanges, and for Special:Watchlist on large watchlists.

There should be an index on (rc_page_id, rc_timestamp), not just the ID alone as at present. This would avoid poor performance when there are many changes for a given watched page, and the query has an rc_timestamp range condition.

There will still be a few things that use the (rc_namespace, rc_title, rc_timestamp) index, like the "links from" option in RecentChangesLinked, which uses it to join on linktarget without a page join.

Plan

  • Do the schema changes.
  • Update WatchedItemStore::duplicateAllAssociatedEntries() to update and insert the new fields.
  • Update addWatchBatchForUser(). Despite the comment on addWatch(), it's typically called with a TitleValue, so the page_id is not supplied by the caller.
  • Run a migration script to populate the new fields.
  • Update WatchlistJoin. After T407087 is complete, this will be the only thing in core joining watchlist with recentchanges. There appear to be no extensions doing this join manually.
  • Some minor queries would potentially benefit from using the new field, like SpecialUnwatchedPages.