Page MenuHomePhabricator

Store information on page watcher count in the Recent Changes table
Open, Stalled, MediumPublic3 Estimated Story Points

Description

Author: mk45654

Description:
Background: Vandalism is a problem on all pages, but it is a much larger problem on pages that, on average, have fewer or zero people watching them. One solution to this problem is Special:UnwatchedPages. However this page is only accessible to administrators, is updated infrequently, and is therefore very difficult (impossible) to clean out.

This problem could be addressed by allowing users to filter Special:RecentChanges to edits made to pages with no page watchers. This would necessitate storing this information in the Recent Changes table.

This would allow a "recent changes in unwatched articles" page. This page would catch vandalism that would usually have gone unnoticed for long periods of time, and encourage editors to take these pages into their watch-lists. It would eliminate the need for Special:UnwatchedPages. Vandalism of pages on this list would not be a problem, as such edits would be bumped to the top of the list and be twice as likely to be caught, and substantially more likely than at recentchanges due to lower volume.

See T380719: [SPIKE] How could we enable editors to filter RecentChanges for pages with no watchers? [16H] for the suggested technical approach.


See Also:

Details

Reference
bz18790
Related Changes in Gerrit:

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:37 PM
bzimport set Reference to bz18790.
bzimport added a subscriber: Unknown Object (MLST).

happy.melon.wiki wrote:

Per bug11181#c11, no booleans --> bitfields (I only recently found that comment from Brion, using rc_type was originally my idea at enwiki VPR). The performance increase from having an rc_watched column is offset by having to *populate* an rc_watched column... which is harder on the servers, querying once for many pages for the Special:RecentChanges/unwatched, or querying many times for one page every time an edit is saved to the recentchanges table?

Doing the query on page save would likely be less load and if a schema change was done, it would be the only way; populating it on-demand would just be kind of odd and hard to do. If only 1 in 5 edits is to an unwatched page, you'd have to load the data for at least 250 to be able to hopefully generate a list of 50.

Counting only active editors would be possible, but it might add too much to the query to be usable.

mk45654 wrote:

Another option might be to add page_watchers to the page table itself, which would be changed whenever a user watches/unwatches. This might be much less useful, since it would miss articles that have been abandoned.

An optimization is to populate the field with active users only when the watchers count is under 20 or 100. If there are that many watchers we can assume that at least a few of them are active. This would cut down on the need to join to users, though it would be a strange or double query.

(In reply to comment #1)

Per bug11181#c11, no booleans --> bitfields (I only recently found that comment
from Brion, using rc_type was originally my idea at enwiki VPR). The
performance increase from having an rc_watched column is offset by having to
*populate* an rc_watched column... which is harder on the servers, querying
once for many pages for the Special:RecentChanges/unwatched, or querying many
times for one page every time an edit is saved to the recentchanges table?

The load for populating rc_watched on edit time shouldn't be too bad: determining whether a single page is watched is a very simple and very fast query.

mk45654 wrote:

(In reply to comment #4)

The load for populating rc_watched on edit time shouldn't be too bad:
determining whether a single page is watched is a very simple and very
fast query.

This would be more useful if we had more information than just a boolean 'has watchers'. Can it be made to count the number of watchers without much strain, and in particular, autoconfirmed watchers active in the past 7 (or 60) days?

Keep in mind that an rc_watched column will only be a page's watch status _at_ the time the action occurred. We won't go back and change old entries when they no longer are watched (or become watched).

Just something to keep in mind.

(In reply to comment #6)

Keep in mind that an rc_watched column will only be a page's watch status _at_
the time the action occurred. We won't go back and change old entries when they
no longer are watched (or become watched).

Just something to keep in mind.

I believe that would be the preferred scenario anyway: we want to know about changes that weren't being watched when they were made, not before or after.

mk45654 wrote:

At the enwiki VPR discussion for this feature (which, it should be noted, currently has 22 unanimous supporters), a couple of editors were concerned that making this feature immediately public might expose unwatched articles to vandalism, were the list subsequently shut down due to bugs. I don't think this is a problem - if even 1000 unwatched articles were edited during this time, editors would have little trouble watching all of them. But the concern deserves mention here. The discussion and straw poll:

http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Recent_unwatched_changes_straw_poll

(In reply to comment #7)

I believe that would be the preferred scenario anyway: we want to know about
changes that weren't being watched when they were made, not before or after.

This is my view also.

What steps would need to be taken to have this feature implemented? If someone were to try to create a patch, what files should they look at, what should they be aware of, and where should they begin? (Though this seems to be something for those familiar with the code, perhaps a few steps can be taken care of by others.)

Samwalton9-WMF renamed this task from Unwatched recent changes page via rc_watched to Store information on page watcher count in the Recent Changes table.Nov 20 2024, 12:07 PM
Samwalton9-WMF moved this task from Inbox to Product backlog on the Moderator-Tools-Team board.
Samwalton9-WMF raised the priority of this task from Low to Medium.Feb 4 2025, 3:51 PM
Samwalton9-WMF updated the task description. (Show Details)
Scardenasmolinar changed the task status from Open to In Progress.Feb 6 2025, 3:23 AM
Scardenasmolinar claimed this task.
Scardenasmolinar moved this task from Ready to In Progress on the Moderator-Tools-Team (Kanban) board.

Change #1118858 had a related patch set uploaded (by Scardenasmolinar; author: Scardenasmolinar):

[mediawiki/core@master] Add watch count on recentchanges table

https://gerrit.wikimedia.org/r/1118858

Thinking about @stjn's comment at T313581#10550455, should the page watcher counts being added to the recentchanges table be restricted/filtered out on the Data-Services wiki replica databases?

That's a great point - it hadn't occurred to me that this was an option. It would still presumably be available via API?

That's a great point - it hadn't occurred to me that this was an option. It would still presumably be available via API?

For clarity — are you asking if the contents of the new table column would be available via the API? If so, do not take my word for this, but - as long as the current patch is only adding the new column and not (at this stage) adding a new filter/doing anything else - I don’t think it would be, as there won’t have been anything added to the API to query it. I think.

jsn.sherman subscribed.

@A_smart_kitten I talked over your suggestion with data services and redaction sounds like the way to go. We should do a sanity check regarding the API, but I think you are correct. @Scardenasmolinar I'm moving this to "waiting for changes" so that we can figure out the redaction process before we make any changes.

@Scardenasmolinar we'll need to go through the schema change process as documented here:
https://wikitech.wikimedia.org/wiki/Schema_changes
that might be a good time to also ask about redaction for the replicas

Hi @Ladsgroup @jcrespo, I'm including you in this DB schema change before I create the change-specific Phab ticket. TLDR: We are adding a watch count column to the RecentChanges table. We found no performance concerns in our spike investigation (T380719: [SPIKE] How could we enable editors to filter RecentChanges for pages with no watchers? [16H]) and will not include the column in the replica databases.

Hi!
TLDR: Please don't do that yet.

Long version: RC table is basically the worst table in production in terms of issues caused, it caused outages until we put a cap on almost all queries to it. I wrote T307328: Scalability issues of recentchanges table long time ago. Adding more columns will make matters even worse (also you need to add an index for that column too). That's why we have blocked many additions to this table for example T179010: [C-DIS][SW] Re-enable Wikidata Recent Changes integration on Commons. This table is quite bad in Wikidata and Commons. On top of that, a lot of random usages have been dumped on this table without thought being put on the long-term longevity of them. For example, category changes add a row in RC table while not really being queried and this is taking a lot of space. Similarly for wikidata changes (which has good usecases but it's all drown in noise caused by bad subscription of the data via lua modules)

I'm quite flexible on what we can put in relational databases but this table clearly stretching the limits of mariadb. It really shouldn't be there and should be redesigned from ground up but that'd be a gargantuan task. In the mean time, don't put this data in RC table and let's find a better place for it.

I'm not a DBA anymore, you should talk to Amir, Manuel or Federico.

Hi!
TLDR: Please don't do that yet.

Long version: RC table is basically the worst table in production in terms of issues caused, it caused outages until we put a cap on almost all queries to it. I wrote T307328: Scalability issues of recentchanges table long time ago. Adding more columns will make matters even worse (also you need to add an index for that column too). That's why we have blocked many additions to this table for example T179010: [C-DIS][SW] Re-enable Wikidata Recent Changes integration on Commons. This table is quite bad in Wikidata and Commons. On top of that, a lot of random usages have been dumped on this table without thought being put on the long-term longevity of them. For example, category changes add a row in RC table while not really being queried and this is taking a lot of space. Similarly for wikidata changes (which has good usecases but it's all drown in noise caused by bad subscription of the data via lua modules)

I'm quite flexible on what we can put in relational databases but this table clearly stretching the limits of mariadb. It really shouldn't be there and should be redesigned from ground up but that'd be a gargantuan task. In the mean time, don't put this data in RC table and let's find a better place for it.

Thanks for this context. Is there an obvious other place to put this?

One thought I have is that we were talking about potentially extracting the ML-based Recent Changes filters from the ORES extension, so that that extension could focus on retrieval and storage of ML scores. Perhaps this warrants a 'RC filters' extension for both this and that? Feel free to ignore that suggestion if it doesn't make sense.

Samwalton9-WMF changed the task status from In Progress to Stalled.Mar 18 2025, 5:06 PM

Thanks for this context. Is there an obvious other place to put this?

I don't think so. I can set up some time and come up with a set of proposals with different trade-offs but that requires my manager to carve out time/okr for this. I don't think I can do this in this quarter.

One thought I have is that we were talking about potentially extracting the ML-based Recent Changes filters from the ORES extension, so that that extension could focus on retrieval and storage of ML scores. Perhaps this warrants a 'RC filters' extension for both this and that? Feel free to ignore that suggestion if it doesn't make sense.

I'm not following the proposal. ores extension has its own tables that get joined with rc table.

Thanks for this context. Is there an obvious other place to put this?

I don't think so. I can set up some time and come up with a set of proposals with different trade-offs but that requires my manager to carve out time/okr for this. I don't think I can do this in this quarter.

edited to add
If we end up completely stalled on adding new data for rc filtering, then I can start a decision brief to get more structured conversation rolling between teams.

One thought I have is that we were talking about potentially extracting the ML-based Recent Changes filters from the ORES extension, so that that extension could focus on retrieval and storage of ML scores. Perhaps this warrants a 'RC filters' extension for both this and that? Feel free to ignore that suggestion if it doesn't make sense.

I'm not following the proposal. ores extension has its own tables that get joined with rc table.

The context there is that we have been strongly discouraged from rolling out ores to more wikis, so we're having conversations with ML about how to move forward. If adding a column to the rc table is a no-go, but using a one-to-one related table is okay, then using ores (or an ores-like extension) would be fine by us.

Thanks for this context. Is there an obvious other place to put this?

I don't think so. I can set up some time and come up with a set of proposals with different trade-offs but that requires my manager to carve out time/okr for this. I don't think I can do this in this quarter.

edited to add
If we end up completely stalled on adding new data for rc filtering, then I can start a decision brief to get more structured conversation rolling between teams.

This is a technical problem and requires a lot of time to investigate and experiment and so on. Once that's done, there wouldn't be much left to do. I don't think a decision brief would help.

One thought I have is that we were talking about potentially extracting the ML-based Recent Changes filters from the ORES extension, so that that extension could focus on retrieval and storage of ML scores. Perhaps this warrants a 'RC filters' extension for both this and that? Feel free to ignore that suggestion if it doesn't make sense.

I'm not following the proposal. ores extension has its own tables that get joined with rc table.

The context there is that we have been strongly discouraged from rolling out ores to more wikis, so we're having conversations with ML about how to move forward. If adding a column to the rc table is a no-go, but using a one-to-one related table is okay, then using ores (or an ores-like extension) would be fine by us.

So I would like to know why "we have been strongly discouraged from rolling out ores to more wikis". Is it a model number problem (could we use the wiki-agnostic model instead?)? Code quality problem? infrastructure issues (table sizes?).

Thanks for this context. Is there an obvious other place to put this?

I don't think so. I can set up some time and come up with a set of proposals with different trade-offs but that requires my manager to carve out time/okr for this. I don't think I can do this in this quarter.

edited to add
If we end up completely stalled on adding new data for rc filtering, then I can start a decision brief to get more structured conversation rolling between teams.

This is a technical problem and requires a lot of time to investigate and experiment and so on. Once that's done, there wouldn't be much left to do. I don't think a decision brief would help.

One thought I have is that we were talking about potentially extracting the ML-based Recent Changes filters from the ORES extension, so that that extension could focus on retrieval and storage of ML scores. Perhaps this warrants a 'RC filters' extension for both this and that? Feel free to ignore that suggestion if it doesn't make sense.

I'm not following the proposal. ores extension has its own tables that get joined with rc table.

The context there is that we have been strongly discouraged from rolling out ores to more wikis, so we're having conversations with ML about how to move forward. If adding a column to the rc table is a no-go, but using a one-to-one related table is okay, then using ores (or an ores-like extension) would be fine by us.

So I would like to know why "we have been strongly discouraged from rolling out ores to more wikis". Is it a model number problem (could we use the wiki-agnostic model instead?)? Code quality problem? infrastructure issues (table sizes?).

I'm going to paraphrase @calbon here: the primary reason is that the extension is outside the ML team's wheelhouse; they want to be stewards of the ML score data side of things, but not have the risk involved with impacting the UX. They are are gathering information to decide how to move forward, perhaps with a pretty different design for how the score data is accessed.

If you believe that adding filter data to an extension table won't exacerbate the RC performance concerns, then that would be great to hear. The reason I believe that a decision brief could be useful is that we have been tasked with improving RC by adding some requested filters. You've told us not to do this in core, which is reasonable. ML has asked us not to do this in ORES, which is also reasonable. At the moment, those two reasonable positions make it look like we cannot accomplish our goals for the quarter. We don't own ORES, and we are not the DB experts, so I suggested a decision brief because this situation seems to be a reasonable fit for the criteria in our internal documentation:

Signs you need a brief:

    An email thread/PR is going on and on with no clear end in sight.
    Decision owner is ambiguous – no one feels confident to make the final call
    You need agreement between cross-functional teams
    Multiple people have put forth proposals, but no proposal is sufficiently researched or comprehensive enough to address everyone’s criteria

You probably don’t need a brief if:

    The decision is minor and impacts a small group of people who work closely together
    Everyone is already in agreement
    You are confident you can quickly resolve the decision with a short conversation

improving RC by adding some requested filters. You've told us not to do this in core, which is reasonable. ML has asked us not to do this in ORES, which is also reasonable.

What I'm asking is to not add more filters to RC regardless of whether it's in core or extension (new or old). Specially a new table that adds a lot of overload (T359309#9614963), adds more joins to the core the RC query which can break in many different ways (finding join order for RDBMS gets really hard after five six tables joining, RC queries are already hitting that limit) and require setting a dedicated setup to purge them after 30 days.

This ticket is 16 years old, the reason for that is not just noone has gone around to do it. It's complex and messy.

There are a lot of ways that can be improved some might be counter-intuitive, for example, getting rid of wikidata updates or category updates to reduce number of rows in RC table. But it needs a db expert to spend time and research on the problem.

improving RC by adding some requested filters. You've told us not to do this in core, which is reasonable. ML has asked us not to do this in ORES, which is also reasonable.

What I'm asking is to not add more filters to RC regardless of whether it's in core or extension (new or old). Specially a new table that adds a lot of overload (T359309#9614963), adds more joins to the core the RC query which can break in many different ways (finding join order for RDBMS gets really hard after five six tables joining, RC queries are already hitting that limit) and require setting a dedicated setup to purge them after 30 days.

This ticket is 16 years old, the reason for that is not just noone has gone around to do it. It's complex and messy.

There are a lot of ways that can be improved some might be counter-intuitive, for example, getting rid of wikidata updates or category updates to reduce number of rows in RC table. But it needs a db expert to spend time and research on the problem.

I really appreciate this response! We have other filters we were wanting to add to ORES in addition to this work, so that would clearly exacerbate the problem.