Page MenuHomePhabricator

Create watchlist labels database tables
Closed, ResolvedPublic

Description

Cross-join option

CREATE TABLE watchlist_label (
    -- unique id for row
    wll_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    -- user_id for label creator
    wll_user INT UNSIGNED NOT NULL,
    -- label text
    wll_name VARBINARY(255) NOT NULL,

    PRIMARY KEY(wll_id),
    INDEX wll_user_name (wll_user, wll_name)
);

CREATE TABLE watchlist_label_member (
    -- Link to wll_id
    wlm_label INT UNSIGNED NOT NULL,
    -- Link to wl_id
    wlm_item INT UNSIGNED NOT NULL,
    PRIMARY KEY(wlm_label, wlm_item),
    INDEX (wlm_item)
);

Sample queries

Display subquery

SELECT GROUP_CONCAT(wll_name SEPARATOR '|')
  FROM watchlist
  JOIN watchlist_label_member ON wlm_item=wl_id
  JOIN watchlist label ON wll_id=wlm_label
  WHERE wl_user=@user AND wl_namespace=rc_namespace AND wl_title=rc_title

Filter by several watchlist labels

SELECT DISTINCT ...
  FROM recentchanges
  JOIN watchlist ON wl_namespace=rc_namespace AND wl_title=rc_title AND wl_user=@user
  JOIN watchlist_label_member ON wlm_item=wl_id AND wlm_label IN (...)
  GROUP BY rc_timestamp, rc_id;

With denormalized namespace/title

CREATE TABLE `watchlist_def` (
    wld_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    -- A static enum, with 0 for private watchlist, 1 for shared list
    wld_type TINYINT UNSIGNED NOT NULL,
    -- A link to user_id, or zero for a public/shared list
    wld_owner INT UNSIGNED NOT NULL,
    -- The list/label name, or an empty string for the main watchlist union
    wld_name VARBINARY(255) NOT NULL,
    -- The number of members
    wld_count INT NOT NULL,

    PRIMARY KEY (wld_id),
    UNIQUE KEY (wld_type, wld_owner, wld_name)
);

CREATE TABLE `watchlist_member` (
    wlm_id INT UNSIGNED AUTO_INCREMENT NOT NULL,
    -- Link to watchlist_def.wld_id
    wlm_list INT UNSIGNED NOT NULL,
    -- Denormalized page_namespace
    wlm_namespace INT NOT NULL,
    -- Denormalized page_title
    wlm_title VARBINARY(255) NOT NULL,
    -- Earliest unseen revision timestamp
    wlm_notification_timestamp BINARY(14) DEFAULT NULL,

    -- For emulated joins, e.g. ActivityUpdateJob
    PRIMARY KEY (wlm_id),
    -- For WatchedItemStore::duplicateEntry(), CLSP filtering
    UNIQUE INDEX (wlm_namespace, wlm_title, wlm_list),
    -- For Special:EditWatchlist sorting, could also be used for CLSP filtering
    INDEX (wlm_list, wlm_namespace, wlm_title),
    -- For Special:EditWatchlist sorting by age
    INDEX (wlm_list, wlm_id),
    -- ApiQueryWatchlistRaw changed filter
    INDEX (wlm_list, wlm_notification_timestamp)
);

Sample queries

Watchlist default filter

SET @wld_id = (SELECT wld_id FROM watchlist_def WHERE wld_type=0 AND wld_owner=@user AND wld_name='');
SELECT ... 
  FROM recentchanges 
  JOIN watchlist_label_member ON rc_namespace=pt_namespace AND rc_title=pt_title 
  WHERE wlm_list=@wld_id;

Display subquery:

SELECT GROUP_CONCAT(wld_name SEPARATOR '|') 
  FROM watchlist_member
  JOIN watchlist_def ON wld_id=wlm_list AND wld_type=0 AND wld_owner=@user 
  WHERE wlm_namespace=rc_namespace AND wlm_title=rc_title

Filter by several watchlist labels:

SELECT DISTINCT ...
  FROM recentchanges
  JOIN watchlist_member ON wlm_namespace=rc_namespace AND wlm_title=rc_title AND wlm_list IN (...)
  GROUP BY rc_timestamp, rc_id;

Event Timeline

Consider these implementation options:

  • A cross-join table as proposed by TheresNoTime, N:N linking wl_id to label_id.
  • A new table like the watchlist table, but with a label ID instead of wl_user identifying the list. When labelling, remove the item from the default watchlist and add it to the labelled watchlist. Adding a second label to an item is done by copying the row.
  • As above, but when labelling, copy the row into the new table instead of moving it.
Cross-joinMoveCopy
enwiki DB size impact assuming 5% of rows are labeled0.1 GB0.05 GB1.8 GB
Special:Watchlist default filter, small watchlist.FastSlower, temporary table needed to union the lists.Fast
Special:Watchlist default filter, large watchlist.No changeMinor degradationNo change
ChangesListSpecialPage highlight/displayGroup concat like change_tags. Minor degradation.Degradation can be mitigated by migrating default watchlist to the new tableAs for move
ChangesListSpecialPage, filter by label, small label, small watchlistMinor degradationFastFast
ChangesListSpecialPage, filter by label, small label, large watchlistPotentially very slow depending on optimizer join orderFastFast
ChangesListSpecialPage, filter by label, large label, large watchlistConstant factor degradation due to extra joinFastFast
Special:EditWatchlist displaySimple, can use group concat like ts_tagsUnion, more complex, slowerSimple group concat
Adding a label to an itemSimple INSERTMore complex DELETE/INSERT, potential lock contentionSimple INSERT
Removing a label from an itemSimple DELETEMore complex DELETE/INSERTSimple DELETE
Public watchlists T9467Precluded, would be a separate system and yet another join on ChangesListSpecialPageDoableDoable

Thanks for tagging me. I will review the schema and queries ASAP. Note that I'm on clinic duty this week and might be somewhat slow

I'm leaning towards cross-join option granted the performance optimizations get implemented. I still hope we had a way to limit the number of pages a user can watch (we could build a probabilistic counter with hyperloglog or count min sketch)

I'm leaning towards cross-join option granted the performance optimizations get implemented. I still hope we had a way to limit the number of pages a user can watch (we could build a probabilistic counter with hyperloglog or count min sketch)

I like the idea of a generic page_tag table, to keep the join count down, and to solve the "filter small label from a large watchlist" problem, and to provide a path towards implementing shared watchlists.

If the space is your main concern, we could take the namespace/title fields out of it. My main problem with the cross-join table is that it links to wl_id rather than page_id. That binds it to the current conception of a watchlist and requires an extra join during filtering.

I have a couple of issues that are not too big but we need to think about them:

  • page_tag is generic, my problem is that it's too generic. For example, what would stop ML products to start tagging articles with topics (Article "Moon" has tag of "Astronomy") and then it turn into a kitchen sink that user_properties currently is.
  • Conceptually, it's too similar to page_properties table (I know page_properties requires key and value but many properties actually basically don't rely on the value. e.g. Disambiguator tags).
  • Last part that you touched on too is determining a page based on ns + title. I don't have concerns on the space, it's small enough. But my problem is data modelling. A page must be determined using its page_id. If you want to add support for deleted pages or not-yet created pages, that means you're talking about linktargets. The problem with joining with linktarget is that we will be moving that table to a dedicated cluster in commons (alongside the rest of links tables) so you wouldn't be able to join in commons.
    • It always bothered me we do this with watchlist table. I know we want to keep the ability that users could watch a deleted page (to make sure it's not recreated) but that can be resolved with two tables: watchlist + watchlist_links (or whatever you name it). The number of non-existent pages in watchlist table is quite small and we can get away with not normalizing them. It would also mean joins in Special:Watchlist would be faster (plus giving the ability to the users to flush watchlist of non-existent page). I know it's a bit unrelated here. Just wanted to mention it.

That being said, I like that it enables shared watchlists. So we can think about it more and see what we can do. We probably can keep the owner (or allow write owners vs read owners).

Actually this is interesting:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from watchlist left join page on wl_namespace = page_namespace and wl_title = page_title where page_id is null limit 5;
+----------+
| count(*) |
+----------+
| 98586435 |
+----------+
1 row in set (9 min 15.046 sec)

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> select count(*) from watchlist;
+-----------+
| count(*)  |
+-----------+
| 307374061 |
+-----------+
1 row in set (3 min 35.321 sec)

32% of watchlist table are for non-existent pages. I wonder if we do a less radical approach and add wl_page_id column being 0 for non-existent ones and in join with rc table, we use the page id instead which would automatically reduce the size of the watchlist of the users. I don't have any benchmarking to back it up though.

32% of watchlist table are for non-existent pages. I wonder if we do a less radical approach and add wl_page_id column being 0 for non-existent ones and in join with rc table, we use the page id instead which would automatically reduce the size of the watchlist of the users. I don't have any benchmarking to back it up though.

I created T407719 for that.

How about the copy option but with more specific names? I'll edit the task description along those lines.

After researching rc_cur_id for T407719, I don't know if I want to use it for this anymore. There's an annoying detail. We could simply duplicate the watchlist table, except that instead of wl_user there would be a link to a new table.

Sounds good to me for now. I have had this radical idea of watchlist_changes table in which every rc_id gets multipled by each watcher of that page (and gets purged alongside the related rc row) but that requires bringing some order into the mess of how many people are watching pages and rc table

Mentioned in SAL (#wikimedia-operations) [2025-10-21T23:45:02Z] <TimStarling> on db2202 creating table watchlist_member_T406843 for T406843 performance investigation

I added wld_count for limiting the size of a watchlist. The page move REPLACE query is a potentially tricky detail.

For parity between watchlist and watchlist_member:

  • Added wlm_notification_timestamp. I think I remember someone complaining that it doesn't belong with watchlist, but I can't find that comment right now. It could be split out to a separate table, but I know Ladsgroup is keen to minimise the number of tables.
  • Added wlm_id by analogy with wl_id.

So watchlist_member has 5 fields now, same as watchlist. The idea would be to migrate the existing watchlist data to the new table, to reduce the number of joins in ChangesListSpecialPage queries, and to allow wld_count to limit the size of the main watchlist.

Summary of differences between watchlist and proposed watchlist_member

  • Made the namespace/title/list index be unique instead of list/namespace/title. This is so that both the SELECT FOR UPDATE and the REPLACE in WatchedItemStore::duplicateEntry() use the same index. Probably faster and less likely to deadlock, although I don't see any deadlocks in the logs for the current schema.
  • Necessarily per the above, I added wlm_list to the end of the namespace/title index. Filtering queries which match all three fields will have the option of using this index instead of the list/namespace/title index. Unclear whether this will have any impact on performance. If the list/namespace/title index is relegated to Special:EditWatchlist only, then it may go cold and save some buffer pool space. But if select queries randomly use both indexes, then that could put more pressure on buffer pool space.
  • Added wlm_list/wlm_id index for Special:EditWatchlist sorting like T209773.

I created the proposed table on db2202 as a copy of the enwiki watchlist table. I was surprised to find that the file on disk was almost twice the size of the original. So then I created it again but this time with ROW_FORMAT=COMPRESSED like on the existing watchlist table. Then it was only 0.7% bigger.

For a Special:Watchlist style query, with watchlist first, the old schema uses the wl_user/wl_notificationtimestamp index whereas the new schema uses the wlm_list/wlm_id index, which was supposed to be for Special:EditWatchlist age sorting. Evidently it only needs the user/list ID and either chooses arbitrarily between the options, or chooses the index with the shortest entries.

For a Special:RecentChanges style query with a left join, no label feature, the old schema uses wl_user/wl_namespace/wl_title whereas the new schema uses the wlm_namespace/wlm_title/wlm_list index.

Daniel points out that watchlist expiry is awkward if you want an expiry time to apply to all labels on a page but we_item links to wlm_id.

Just one other thing to consider here - we have a wish to allow people to add a note when they're adding something to a watchlist. No need for it to be searchable or anything, just a note that gets displayed as part of the row in Special:Watchlist

Multiple watchlists alone might meet the need behind the wish, but in case it doesn't - I'd hope we can do the table design in such a way as to not make this impossible

@tstarling just want to make sure I understand - the main argument for "copy" instead of "cross-join" is to build the label into the indices to make filtering faster, is that correct?

Daniel points out that watchlist expiry is awkward if you want an expiry time to apply to all labels on a page but we_item links to wlm_id.

I don't really understand this. Are you envisioning a user labelling a bunch of watched pages and saying "expiry all pages from my watchlist with label X at time Y"? I don't see why we'd need this, but maybe I have the wrong end of the stick ... but either way, would migrating the expiry data into the new table solve this?

@tstarling just want to make sure I understand - the main argument for "copy" instead of "cross-join" is to build the label into the indices to make filtering faster, is that correct?

Yes, the idea of putting namespace/title in the label table is to make filtering faster. Although I really should verify that.

Are you envisioning a user labelling a bunch of watched pages and saying "expiry all pages from my watchlist with label X at time Y"?

The question is whether expiry is associated with the main watchlist entry or with a label. So does a watchlist entry expire or does a label expire? In terms of user experience we probably want the main watchlist entry to expire, and all associated labels should be deleted when that happens. But treating labels as essentially separate watchlists naturally leads to expiries being associated with labels. Then it's extra work to preserve the desired user experience.

Has there been any discussion of being able to expire labels? That seems like a new feature that no one's asking for. Similarly with shared watchlists. I just wonder if we'd be preparing for future work that might never come.

Multiple watchlists alone might meet the need behind the wish, but in case it doesn't - I'd hope we can do the table design in such a way as to not make this impossible

That'd be adding an extra column to the watchlist_label_member table (or watchlist_member in the denormalized schema), you mean? There's also a wish about assigning a colour to a label, which would be a similar column.

The list/label name, or an empty string for the main watchlist union

SET @wld_id = (SELECT wld_id FROM watchlist_def WHERE wld_type=0 AND wld_owner=@user AND wld_name='');
SELECT ...
FROM recentchanges
JOIN watchlist_label_member ON rc_namespace=pt_namespace AND rc_title=pt_title
WHERE wlm_list=@wld_id;

Regarding this filtering query: would we need to be joining on an empty string to find non-labelled items? I'd been thinking that for that case we'd do a left join and where wdl_id is null. That way, the members table only gets data as items are labelled and doesn't need to be filled for entire watchlists for people who never label anything.

Also, and probably it doesn't make much performance difference, but we could have the filter value provided by ID already, e.g. Special:Watchlist?labels=3,78,213 and so the initial query the ID wouldn't be needed. The labels themselves aren't necessarily friendly for including in URLs and we aren't going to have a 'slug' form for people to set.

Has there been any discussion of being able to expire labels? That seems like a new feature that no one's asking for.

No, nobody wants that, that's my point. Ideally the data model would follow from the UX.

Similarly with shared watchlists. I just wonder if we'd be preparing for future work that might never come.

Well, that was requested in T9467 and T23223.

Regarding this filtering query: would we need to be joining on an empty string to find non-labelled items? I'd been thinking that for that case we'd do a left join and where wdl_id is null. That way, the members table only gets data as items are labelled and doesn't need to be filled for entire watchlists for people who never label anything.

My idea is to drop the watchlist table. Both labelled and unlabelled items would be in watchlist_member with wld_name=''. Labelled items would additionally have a row with non-empty wld_name. It's really multiple watchlists rather than labelling, and the legacy watchlist sits alongside the new watchlists.

But like I say, I'll check my assumptions and see if this is really worth doing.

Also, and probably it doesn't make much performance difference, but we could have the filter value provided by ID already, e.g. Special:Watchlist?labels=3,78,213 and so the initial query the ID wouldn't be needed. The labels themselves aren't necessarily friendly for including in URLs and we aren't going to have a 'slug' form for people to set.

I don't think that would make a difference.

My idea is to drop the watchlist table. Both labelled and unlabelled items would be in watchlist_member with wld_name=''. Labelled items would additionally have a row with non-empty wld_name. It's really multiple watchlists rather than labelling, and the legacy watchlist sits alongside the new watchlists.

Oh, that makes sense. So basically it's adding a wlm_list column to watchlist?

Oh, that makes sense. So basically it's adding a wlm_list column to watchlist?

I would say I'm replacing wl_user with wlm_list, adding an extra layer of indirection between lists and users. So each user can have many lists.

tstarling updated the task description. (Show Details)

On db2202 I created the cross-join table watchlist_label_member_T406843 using the definition from the task description. Then I populated it with some rows as follows:

MariaDB [enwiki]> select count(*) from watchlist where wl_user = 19410507;
+----------+
| count(*) |
+----------+
|   612314 |
+----------+
1 row in set (0.295 sec)

MariaDB [enwiki]> insert into watchlist_label_member_T406843 (wlm_label,wlm_item) select 1, wl_id from watchlist where wl_user=19410507 and rand()<0.01;
Query OK, 6009 rows affected (0.218 sec)
Records: 6009  Duplicates: 0  Warnings: 0

MariaDB [enwiki]> select count(*) from recentchanges join watchlist on wl_namespace=rc_namespace and wl_title=rc_title and wl_user=19410507 join watchlist_label_member_T406843 on wlm_label=1 and wlm_item=wl_id;
+----------+
| count(*) |
+----------+
|     5122 |
+----------+
1 row in set (0.075 sec)

Then I tested the filter query:

MariaDB [enwiki]> analyze select rc_id from recentchanges join watchlist on wl_namespace=rc_namespace and wl_title=rc_title and wl_user=19410507 join watchlist_label_member_T406843 on wlm_label=1 and wlm_item=wl_id order by rc_timestamp desc limit 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: watchlist_label_member_T406843
         type: ref
possible_keys: PRIMARY,wlm_item
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 3004
       r_rows: 6009.00
     filtered: 100.00
   r_filtered: 100.00
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: watchlist
         type: eq_ref
possible_keys: PRIMARY,wl_user,wl_user_notificationtimestamp,wl_namespace_title
          key: PRIMARY
      key_len: 4
          ref: enwiki.watchlist_label_member_T406843.wlm_item
         rows: 1
       r_rows: 1.00
     filtered: 0.40
   r_filtered: 100.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: ref
possible_keys: rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp,rc_name_source_patrolled_timestamp
          key: rc_namespace_title_timestamp
      key_len: 261
          ref: enwiki.watchlist.wl_namespace,enwiki.watchlist.wl_title
         rows: 3
       r_rows: 0.85
     filtered: 100.00
   r_filtered: 100.00
        Extra: Using index
3 rows in set (0.075 sec)

So I guess that's fine? Not sure how I imagined this going wrong to be honest. If the timestamp range is short enough then it will put recentchanges first, and that could be a performance problem if the statistics are wrong or if the limit is small compared to the number of labelled changes in the whole recentchanges table.

MariaDB [enwiki]> select max(rc_timestamp) from recentchanges;
+-------------------+
| max(rc_timestamp) |
+-------------------+
| 20251027004619    |
+-------------------+
1 row in set (0.001 sec)

MariaDB [enwiki]> explain select rc_id from recentchanges join watchlist on wl_namespace=rc_namespace and wl_title=rc_title and wl_user=19410507 join watchlist_label_member_T406843 on wlm_label=1 and wlm_item=wl_id where rc_timestamp >= '20251027004600' order by rc_timestamp desc\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: range
possible_keys: rc_timestamp,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp,rc_name_source_patrolled_timestamp
          key: rc_timestamp
      key_len: 14
          ref: NULL
         rows: 347
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: watchlist
         type: eq_ref
possible_keys: PRIMARY,wl_user,wl_user_notificationtimestamp,wl_namespace_title
          key: wl_user
      key_len: 265
          ref: const,enwiki.recentchanges.rc_namespace,enwiki.recentchanges.rc_title
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: watchlist_label_member_T406843
         type: eq_ref
possible_keys: PRIMARY,wlm_item
          key: PRIMARY
      key_len: 8
          ref: const,enwiki.watchlist.wl_id
         rows: 1
        Extra: Using index
3 rows in set (0.001 sec)

Let me know if you have any ideas for more realistic tests along these lines.

Expiry handling is definitely a reason to link to wl_id rather than rc_namespace/rc_title.

My only remaining concerns are join proliferation and the fact that we are not enabling shared watchlists. But shared watchlists could be done another way.

Change #1198725 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add watchlist label tables

https://gerrit.wikimedia.org/r/1198725

The suggested schema looks good to me. For now, let's not add main group as a label so the most common usecase won't end up with two more tables to join.

Yes if they're 2-byte characters. If they're 4 byte characters it'll allow 63. This is perhaps not very equitable between writing scripts, but we have to have some sort of limit.

Change #1198725 merged by jenkins-bot:

[mediawiki/core@master] Add watchlist label tables

https://gerrit.wikimedia.org/r/1198725

Please don't forget to catalog the tables.

Change #1201842 had a related patch set uploaded (by Samwilson; author: Samwilson):

[operations/puppet@production] mediawiki tables-catalog: Add watchlist labels tables

https://gerrit.wikimedia.org/r/1201842

Change #1201842 merged by Ladsgroup:

[operations/puppet@production] mediawiki tables-catalog: Add watchlist labels tables

https://gerrit.wikimedia.org/r/1201842

Restarted sanitariums hosts to make sure the tables don't get replicated to the cloud. You are good to go.

So this is done? Can we close it?

As we want to create these tables for testwiki initially, and then all wikis after, I intend to run

  • mwscript sql.php --wiki=testwiki (/path/to)/sql/mysql/patch-watchlist_label.sql

at some point today, and then

  • foreachwiki sql.php (/path/to)/sql/mysql/patch-watchlist_label.sql

at a future date

Mentioned in SAL (#wikimedia-operations) [2026-01-21T14:50:01Z] <TheresNoTime> [samtar@deploy2002 ~]$ mwscript sql.php --wiki=testwiki /srv/mediawiki/php-1.46.0-wmf.12/sql/mysql/patch-watchlist_label.sql for T406843

wikiadmin2023@10.192.11.10(testwiki)> show tables like "watchlist_label%";
+---------------------------------------+
| Tables_in_testwiki (watchlist_label%) |
+---------------------------------------+
| watchlist_label                       |
| watchlist_label_member                |
+---------------------------------------+
2 rows in set (0.001 sec)

Mentioned in SAL (#wikimedia-operations) [2026-01-21T15:33:35Z] <TheresNoTime> started [samtar@deploy2002 ~]$ foreachwiki sql.php /srv/mediawiki/php-1.46.0-wmf.12/sql/mysql/patch-watchlist_label.sql for T406843

Mentioned in SAL (#wikimedia-operations) [2026-01-21T15:38:40Z] <TheresNoTime> foreachwiki ... completed for T406843

@TheresNoTime is this done for all wikis now? Can we close the ticket?