Page MenuHomePhabricator

[Timebox - 8 hours] Investigation: Watchlist Expiry
Closed, ResolvedPublic

Description

Value proposition:

As an editor who frequently patrols pages, I want to have the option to be able to watch a page temporarily so that I can keep an eye on some pages briefly without having them on my permanent list of pages to watch.

Open questions:

  • What is the status of what WMDE started working on; did they run into notable challenges we should know about before stopping?
  • What database columns are in the watchlist table (and/or watchlist properties) that can be used to store and read a timestamp? Are they available or do they need to be added? What are the options for this? do we need to change schema, add a separate table, or work with what we have?
  • What populates the Watchlist entries? Is it a single place (on load?) or multiple places? If we want to add an operation to validate watched pages, how many places do we need to add it to?
  • If we keep expired pages for a month after expiration, can we control what populates the watchlist table during that month? Is that done on load of the watchlist? Can we stop populating revisions of those expired pages even if the user did not visit the page?
  • Is there an API that is impacted by this? Do we have an API for watchlist, and/or that touches the watchlist? Do we need to include any operation there as well?
  • Notifications: For flow (or any other place that notifies you for watched pages) do we need another operation? Can we easily implement the expiration for those as well or do they require a different operation, seeing as the Event (in Echo) is created on the spot (after saving Flow reply) can we take into account expiration of the watched page/topic?

Helpful links:

Event Timeline

Niharika created this task.
Niharika renamed this task from Investigation: Watchlist Expiry to [Timebox - 8 hours] Investigation: Watchlist Expiry.May 28 2019, 11:21 PM

What is the status of what WMDE started working on; did they run into notable challenges we should know about before stopping?

They've done some WatchedItem etc refactoring, added wl_id to the schema, and ran the RFC (T124752) past ArchCom, though its outcome hasn't been implemented.

What database columns are in the watchlist table (and/or watchlist properties) that can be used to store and read a timestamp? Are they available or do they need to be added? What are the options for this? do we need to change schema, add a separate table, or work with what we have?

CREATE TABLE /*_*/watchlist (
  wl_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- Key to user.user_id
  wl_user int unsigned NOT NULL,

  -- Key to page_namespace/page_title
  -- Note that users may watch pages which do not exist yet,
  -- or existed in the past but have been deleted.
  wl_namespace int NOT NULL default 0,
  wl_title varchar(255) binary NOT NULL default '',

  -- Timestamp used to send notification e-mails and show "updated since last visit" markers on
  -- history and recent changes / watchlist. Set to NULL when the user visits the latest revision
  -- of the page, which means that they should be sent an e-mail on the next change.
  wl_notificationtimestamp varbinary(14)
)

So nothing in the schema can be used for our purposes yet. The aforementioned RFC discussed adding a new column to the watchlist, creating a generic table for watchlist item properties and creating a specialized table just for expiry timestamps. The RFC was 3 years ago, so all its outcome (which ended with a preference for a more generic solution) might be out of touch with today's reality - for example, space usage on DB servers has become more of a problem since then, causing lots of work on normalization instead of having another column. My personal thoughts:

  • Modifying the main watchlist table sounds silly as most rows will presumably still be without expiry, making the column just store NULLs most of the time.
  • Generic watchlist properties are a bit of a can of worms: the only other use for it anybody can think of is for multiple watchlists, however without any development work on it planned and no requirements know for sure, it might as well be red herring.
  • My personal preference would be a dedicated table, as a minimal solution that would work best and not plagued by having to support some mythical future feature, something like that:
CREATE TABLE watchlist_expiry (
    -- Key to wl_id
    we_entry int unsigned NOT NULL PRIMARY KEY,
    -- Expiry timestamp
    we_expiry varbinary(14),
    -- For batch expiry
    KEY(we_expiry)
)

What populates the Watchlist entries? Is it a single place (on load?) or multiple places?

Watchlist internally keeps a list of pages each user is watching, when the user requests Special:Watchlist, their watchlist is joined against recentchanges to produce a list of pages recently modified.

If we want to add an operation to validate watched pages, how many places do we need to add it to?

In the core:

  • API watch, watchlist, watchlistraw and setnotificationtimestamp using the new WMDE-developed services
  • API feedwatchlist calling the above internally (AAAAAAA!)
  • Specials: Watchlist, RecentChanges, RecentChangesLinked, UnwatchedPages querying directly
  • Some less prominent places like jobs
  • MysqlUpdater doing upgrades

Besides core, there are the following extensions, all querying the table directly:

  • Echo for notifying watching users of page changes
  • FlaggedRevs in a few places
  • Flow for notifying users watching a thread
  • LiquidThreads, same as above
  • MobileFrontend has its own version of watchlist, yay

If we keep expired pages for a month after expiration, can we control what populates the watchlist table during that month? Is that done on load of the watchlist? Can we stop populating revisions of those expired pages even if the user did not visit the page?

Is there an API that is impacted by this? Do we have an API for watchlist, and/or that touches the watchlist? Do we need to include any operation there as well?

I explained above how the watchlist works. We either need to update lots of places to take expiry into account or document it as "a watchlist item is not expired until it has been purged by a cronjob, even if technically it's past its time"

Notifications: For flow (or any other place that notifies you for watched pages) do we need another operation? Can we easily implement the expiration for those as well or do they require a different operation, seeing as the Event (in Echo) is created on the spot (after saving Flow reply) can we take into account expiration of the watched page/topic?

Flow is not different from any other places, see above.

The approach @MaxSem outlines above seems to boil down to our trying to cause the least amount of impact on other services. I agree that having a separate watchlist_expiry makes the most sense. That allows us to have a cron which cleans the watchlist table on some schedule. All the rest of the code could remain the same. I agree with Max that we could by convention say that the expiration is "best effort" and that it might be an hour or two off. That depends on how often we think we could have the cron run.

In summary, this is what I understand:

  1. We need a new DB table to store the expiration timestamp and watchlist item ID.
  2. We need a cron job that queries that new table and decides which watchlist items to remove/delete.
  3. This approach requires the creation of a new cron job but does not require modification of much (if any) of the existing watchlist code.

Should the expiration itself should be logged? If so, what should be contained in that log line?

We discussed the investigation today in the engineering meeting and I was asked to share a guideline for what the product ought to do. The following is a very rough idea of what the user workflow should be and what the product should do ideally. The things I consider more important than others have a star next to them.

On Special:Watchlist:
  • User can see which pages are being temporarily watched (with an icon perhaps) ✴️
  • User can see how long before the pages expire ✴️
  • User can modify the expiry for pages which are being temporarily watched ✴️
  • Pages which expire "soon" (for some definition of soon - maybe 2-3 days) have a prominent indicator ✴️
  • User can see pages which expired in the past X days (say, a month)
  • If a user adds a page to their watchlist for three days but only comes back to check the watchlist on Day 5, they should still be able to see edits to that page made over the first three days.
    • Note that we discussed this and we all agreed that this is not feasible as the watchlist table goes by page and not revisions. Keeping this in the list for record-keeping purposes.
Setting an expiry:
  • This can be done from three places -
    • From Special:Watchlist (noted above) ✴️
    • From the star icon on the page (could be a dropdown icon next to the star that opens a date/time selection menu), without altering the default behavior of clicking the star ✴️
    • From the edit summary menu (for all editors) ✴️

Here are a couple of mocks that WMDE made for the UI. We may not necessarily follow these but they are good to get a sense of what this could look like -

Clicking the starFrom the editor
image.png (1×1 px, 228 KB)
image.png (1×1 px, 214 KB)
  • Note that the user should also be able to modify a previously-set expiry from each of these places too. ✴️
Other things:
  • Regular watched pages have a blue-filled star on the page. For expiring pages, we could use a different color and also present information about when the page expires from their watchlist. ✴️
  • We don't need to be very granular about the expiry. Restricting to a date is fine.
  • The API should also provide information about page expiry, if set.

Note: I'll update this list if I come across other things I missed!

From the engineering meeting, we hashed out next steps for this:

  • Adding another table for storing the expiry
  • Purge on write (every time user adds/removes a page to the watchlist)
  • Add cron job for secondary cleaning
  • On Special:Watchlist and Special:EditWatchlist: Filter out the rows that are expired
  • On some reads, detect that there are expired items and send an API request to purge them <-- We need to work this out
  • NOTE: If the cron job never runs and a user doesn't edit their watchlist, items will never expire
  • Fix emails sent about the watchlist in core
  • Go over instances where core notifies user about watchlist items

@Niharika and @ifried, please note that this has some product implications, notably that we probably can't promise when, exactly, an expired item literally vanishes from the watchlist, which means other tools that read the watchlist (and there are many) may show that page for a little longer than they should, while we clean up. We can't completely trust on secondary cleaning with the cron job, so in case it didn't work, the expired page may stick around until the next run (every 24h)

Your input on this is welcome :)

@Mooeypoo Okay. I have some questions about this. I am sure Ilana does too. Let's have a meeting to iron out the details. I'll set something up.

  • Adding another table for storing the expiry

We need to make sure the indexes for this are setup correctly so we can filter with WHERE by the expiration timestamp.

We have launched a project page to begin collecting community feedback on Watchlist Expiry. For this reason, I'm marking this work as 'Done.' We may proceed with further investigations once we collect feedback from users, but this particular investigation is complete.