Page MenuHomePhabricator

Scalability issues of recentchanges table
Open, Needs TriagePublic

Description

Recentchanges table powers one of core functionalities of Wikipedia and Co. It is the basically the backbone of every patrolling tool. From Special:RecentChanges to Watchlist to Huggle, and so on.

Since most of these tools provide a practically free-form SQL query builder, it easily can lead to "full table scan" queries. That is partially part of the design of rc table. It should small and nimble but pretty hot and being read in different ways as it has 10 indexes atm. The problem starts when a wiki gets really large due to bot edits or just the size of the wiki. For example Wikidata's rc table right now has 22M rows which is clearly bigger than what it should be. Currently, only three wikis have issues with recentchanges table: English Wikipedia, Wikidata, and Wikimedia Commons.

Here are several bugs caused by rc table getting large:

It is also blocking adding more features such as T179010: [C-DIS][SW] Re-enable Wikidata Recent Changes integration on Commons

The underlying problem is that RC table is trying to juggle two things:

  • Vandalism and problematic edits detection, which requires complex queries on non-patrolled edits (which are a small subset of edits). e.g. get me all edits that have high ores score but also certain tag.
  • General pulse check of what's going on. For example, watchlist (=What is happening on pages I care about). This requires seeing all edits but on a more simpler, narrower query. No complex magic but needing the firehose of edits.

Proposed solutions (some mutually exclusive, some not):

  1. Normalizing the table: This is not a good design and won't help much. This table is the de-normalization table of mediawiki. It's a summary table.
  2. Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not came from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.
  3. Split autopatrolled[1] actions out of rc table, into something like patrolled_recentchanges. In which querying it would be much limited (e.g. no tag filtering, no ores scores filtering, etc.)
  4. Reduce the time to store autopatrolled actions. This is basically hybrid of solution 2 & 3. That way we reduce the size without losing much.
  5. Store one week of data in one table, the rest of the month in another table (or one week for the first table and all of the month in the other meaning first week would get duplicated.) and make the query on the second table more restricted For example no tag filtering, no ores filtering, etc.
  6. Move it to a NoSQL solution. While I can see the reasoning, I'm not sure it would fix anything. These massive firehose of data would be slow to query in any system.
    1. That being said, Using Elastic sounds like a good idea (see T307328#7894820)

[1] "autopatrolled" can be useful for wikidata and commons only because "RC patrolling" is not enabled in English Wikipedia. RCPatrolling is a feature of mediawiki to find vandalism and reduce the work of patrollers in English Wikipedia. I find it disheartening it's not enabled there (which can be done with a bit community consultation and small design fixes)

Details

Related Changes in Gerrit:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I would consider moving it to Elastic and scaling it out. Lucene/ElasticSearch is optimised for large intersections and can supposedly scan faster than MySQL, and sharding is easier.

That seems like the best long-term solution to me. Elasticsearch is intended for exactly the kind of complex ad-hoc queries that Special:Recentchanges tries to provide. Enwiki has 55M pages and even relatively slow search operations like fulltext search run basically immediately on it; I'm sure the 20M watchlist entries wouldn't be a problem. There are drawbacks (robustness, hardware costs) but they seem manageable.

Converting recent changes to Elasticsearch would be a big engineering project though - not something that's likely to happen any time soon.

@Ladsgroup how urgent of a problem is this to solve? Is this something that should be discussed in annual planning that is happening right now, for example?

In principle, Elasticsearch looks like a good idea. A few constraints in our context:

  • We probably don't want to mix this use case with the current Full Text Search cluster, so we would need to set up an additional ES cluster.
  • We treat Elasticsearch as a strictly secondary storage solution. Update can (and will) be lost. We need to be able to recreate the data from a primary source at all times.
  • This use case probably does not fit into our current update pipeline, another pipeline will need to be developed (or the current one needs to be changed significantly)

Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not came from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.

It's 30 days because it's used for watchlists (and to a lesser extent, RecentChangesLinked). If it was just RC then it could be shorter. […]

Since I got a full time job, I've not been able to review my Wikipedia watchlist weekly, it's more like once every 3-4 weeks. I think as we grow further and try to spread the load more to more different patrollers each doing a little bit, it would make sense to plan for maintaining or growing this further, not shrinking it. With every day you shrink the watchlist length, you essentially require a smaller set of more widespread and active reviewers which seems contrary to (my interpretation of) the Movement Strategy.

I imagine that once in Elasticsearch it might be fairly trivial to ramp up RC age from 30 days back up to the 90 days we were on before, at relatively little cost as it would no longer scale linearly based on what can be scanned at once.

I guess we could approach this somewhat similarly to Prefixsearch and Full-text search which we also do through a service class that CirussSearch can override. We'd need to build something similar for RecentChanges.

@Ladsgroup how urgent of a problem is this to solve? Is this something that should be discussed in annual planning that is happening right now, for example?

I think it needs a bit of intervention ASAP to specially handle Wikidata's case but for rest of wikis, it can wait for 23-24FY. So IMHO, a band-aid solution for wikidata + a long term solution would be good.

Thanks for flagging this @Ladsgroup, Is there anything we can do on our side of things atm?

Do people on Wikidata and Commons rely on watchlists much? Messing with the enwiki watchlist seems scary, but Commons / WD have atypical QA/anti-abuse needs and workflows so maybe just reducing the RC period unconditionally is less of an issue there? Or just excluding bot edits, which is most edits on Wikidata, and they don't show up on most people's watchlists anyway?

Wikidata RecentChange and Watchlist would be benefit from more granular filtering (see T43686: [Story] Ability to filter recent changes and watchlist on Wikidata and subtasks). There are a number of other task about filtering of such lists (random examples: T23674, T42535, T12379, T124214, T45901).

Another thing to consider: if recentchanges is moved out of SQL it may be a pain to query changes for a large watchlist (about 100k).

I rely heavily on my Wikidata Watchlist. I generally just look at the last 7 days but longer is useful sometimes. I think the proposed 3 and 4 solutions above (splitting out autopatrolled changes) would be a good solution for my purposes - 99% of edits I check are the unpatrolled ones, so if we could keep 30 days of unpatrolled changes and cut the other to 14 days or less I think that would be fine for me. By the way I find ORES essentially useless to evaluate the quality of Wikidata edits.

In case of commons, 71% of 19M rows of the rc table is from mw.categorize source (T9148: Watch edits which add or remove pages from a category) Maybe that can be disabled in commons?

In case of commons, 71% of 19M rows of the rc table is from mw.categorize source (T9148: Watch edits which add or remove pages from a category) Maybe that can be disabled in commons?

I took a look around to see whether this is working as expected or whether something is going wrong, i.e. is 70% a big and unexpected number?

Firstly, I had completely forgotten RC_CATEGORIZE still exists and how it works. In a nut shell, when a links update job is re-rendering a page, it will emit RC events for effective additions and removals to the category links table.

This largely reflects four kinds of changes:

  1. Amplify each regular edit that directly changes the categories on a page. E.g. edit page X and add [[Category:Foo]] and {{Bar}}, this will emit 1 RC event for the edit as usual, and 2 more RC events, one for Category:Foo and one for Category:Pages_with_Bar with an edit summary describing that X was added to the category.
  2. Amplify each page creation and upload. When uploading a file on Commons, it will typically be in several categories right away. e.g. "CC-BY-4.0", "Media missing source", "Media missing description", and either "Media missing regular categories" or one or more regular categories as specified by the uploader. This upload will then produce N categorize events, one for each of these categories.
  3. Amplify edits to templates and Wikidata entities. If a commonly used template is modified to involve some kind of new hidden category or to modify how it uses these, that will cause a long chain reaction of many categorize events. Similarly, a Wikidata edit will emit events on Commons. E.g. a category about a person has categories based on the Wikidata infobox such as "actors from Kentucky", "actors of Belgian descent", "Fantastic Beasts cast", etc.
  4. Report on completion of maintenance by removing from an automatic hidden maintenance category. E.g. uploads with automatic category "Media missing regular categories" will emit an edit to remove from that category when an edit happens to the file that adds a regular category, similar for edits that fill in missing source, author, description, etc.

Note specifically that RC categorizez events use the category as the subject. This means if you watch an article on Wikipedia, and an indirect change from Wikidata or a template modifies which categories it is in, you will not see that on your watchlist even if you enable showing "Category changes". These changes are only provided to users that watch from the perspective of the category.

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

Only passing by to support everything Tim said (watchlist are the main limiting factor, RC by itself could be trimmed) and to make sure T171027 is in people's minds when thinking about refactoring- both as an example of thing broken due to the lack of scability, but also to have into account the needs from users when some functionality was temporarily disabled. RC and Watchlists are something that many power users rely on, so extra care should be taken about it. :-)

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

Thanks for the through investigation. I looked to see if removing that would help. Maybe the query is wrong somehow but it's only 30K rows :/

mysql:research@s4-analytics-replica.eqiad.wmnet [commonswiki]> select count(*) from recentchanges join page_props on pp_page = rc_cur_id where rc_source = 'mw.categorize' and pp_propname = 'hiddencat' limit 5;
+----------+
| count(*) |
+----------+
|    30004 |
+----------+
1 row in set (2.335 sec)

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

I usually work on Commons protected edit requests for interface administrators through my watchlist – I find it much more useful to see the additions and removals there than to look at the category in full (which mostly contains pages that have been there for a while and that I can’t handle for one reason or another).

I also agree with @ArthurPSmith that options 3 and 4 seem good; I suspect option 4 is much easier to implement, and could be a good shorter-term solution (with ElasticSearch as a longer-term perspective).

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". […]

Thanks for the through investigation. […] Maybe the query is wrong somehow but it's only 30K rows :/

mysql:research@s4-analytics-replica.eqiad.wmnet [commonswiki]> select count(*) from recentchanges join page_props on pp_page = rc_cur_id where rc_source = 'mw.categorize' and pp_propname = 'hiddencat' limit 5;
+----------+
| count(*) |
+----------+
|    30004 |
+----------+
1 row in set (2.335 sec)

I found two issues.

  1. It seems a lot of maintenance categories that are marked as hidden in wikitext (sometimes very indirectly via layers of templates), are missing the hiddencat page property.

Perhaps this is fallout from the link table deletion incident (ref T299095 etc), or perhaps it is due to loss of refreshLinks jobs at some point after a template that was broken and then the edit fixing it had its jobs lost.

page_idpage_titlecat_files_hiddencat
105635278Media_needing_categories_as_of_22_May_2021214710
105157987Media_needing_categories_as_of_11_May_2021208440
105077809Media_needing_categories_as_of_9_May_2021205260
105589748Media_needing_categories_as_of_21_May_2021205010
104920655Media_needing_categories_as_of_5_May_2021197540
105194480Media_needing_categories_as_of_12_May_2021184340
104834505Media_needing_categories_as_of_3_May_2021184010
105227601Media_needing_categories_as_of_13_May_2021180270
105543640Media_needing_categories_as_of_20_May_2021175290

I ran refreshLinks.php on commonswiki for the parent of the Media_needing_categories_as_of_* categories (there were only 2000 or so and they're very tiny and quick to re-parse). This fixed those specifically, but there may be more of these. However, via Quarry I found no categories that are named like a maintenance category that have no hiddencat and have > 50 files side-wide, so theres likely nothing else major that we'd see affect recent changes.

After that, re-running your query returned about 5% more relatively, but still very small in the absolute sense.

From https://quarry.wmcloud.org/query/38624 (rounded to two digits)
join page_props on pp_page = rc_cur_id
where rc_source = 'mw.categorize'
-- ...
31,000categorize-hiddencat
12,000,000categorize-all

The following takes your query and is meant to show which (hidden) categories dominate RC categorization events:

SELECT count(*), rc_cur_id, rc_title
FROM recentchanges JOIN page_props ON pp_page = rc_cur_id
WHERE rc_source = 'mw.categorize'
GROUP BY rc_cur_id
-- ...
_countrc_cur_idrc_title
1526925685Non-empty_category_redirects
1431761180Pages_with_object_coordinates_from_Wikidata
1412879996Non-empty_category_redirects
14117676627Hidden_categories
14117624850Non-empty_category_redirects
1346725206Non-empty_category_redirects
1333280299Hidden_categories
13108085116Hidden_categories

This result exposes our second problem. Despite the group by clause, we still see the same multiple times. What's going on?

rc_titlerc_cur_idrc_this_oldidrc_sourcerc_params
CC-Zero116846263648488001mw.categorizea:2:{s:10:"hidden-cat";b:1;s:5:"added";b:1;}
Footbridges_in_Schleswig-Holstein116846265648488004mw.categorizea:2:{s:10:"hidden-cat";b:0;s:5:"added";b:1;}
CC-Zero116846264648488002mw.categorizea:2:{s:10:"hidden-cat";b:1;s:5:"added";b:1;}

Note how the rc_cur_id values differ even for the same title. And indeed https://commons.wikimedia.org/wiki/?curid=116846263 is the file page, not the category page.

So despite categorize events materializing in the user interface as being having the category as their subject, and this being reflected in rc_namespace/rc_title, the rc_curid column is actually populated based on the target page (or possibly the page causing the chain reaction, e.g. sometimes a template). I imagine you thought, like me, that this was probably in rc_params or rc_comment instead of it frankenstein-ed like this.

From https://quarry.wmcloud.org/query/63402 (rounded to two digits)
8,300,000categorize-hiddencat
12,000,000categorize-all

8.3/12=69.2%

I've corrected it to join page_props via page.page_id instead of rc_cur_id, which shows hiddencat categorization are responsible for 70% of rc-categorize events (as opposed to 0.25%).

Thanks for the investigation!!!

To move forward:

  • For wikidata only: let's reduce time to store autopatrolled actions to two weeks if that's okay with @Lydia_Pintscher
  • For commons: Let's stop injecting rc change when the changed category is a hidden cat. If there is a hidden cat being used in watchlists, you can simply make it not hidden.

Are people okay with this?

Looking at the data, it doesn't look like we are getting any more RC entries than we have done in the past 3 years

image.png (877×1 px, 145 KB)

Is there something causing this to become more of an issue now?
Is there a quantity of rows we should strive to avoid going over?
Or are we just in the same status quo that we have been for the past years and would generally like to do something to see less timeouts etc?

Some notes:

  • The total number of rows in rc table is not the full story, for example it's being used heavily, any increase in pressure from data in other tables can drastically increase the pressure on rc table. i.e. everything is connected.
  • Just looking at the number wouldn't tell how is the distribution of the data. For example, an increase in non-bot edits can have a different impact than increase in bot edits given the different indexes and access patterns.
  • Being large a couple of years ago also doesn't mean it wasn't a problem back then. We used to have basically an outage a month caused by the wikidata's db and we still have issues with that. Just last Sunday all of Wikipedia went down for a couple of minutes because of the pressure on s8.
  • RC table is designed to be small and nimble, for example it has ten indexes. Breaking that assumption causes all sorts of unexpected issues. I want to keep that assumption true not just for current usecases but also for future ones.
  1. The de-normalized nature of the table makes it convenient to use e.g. for tools. And some info in the table even seems hard to get otherwise (e.g. whether an edit was performed by a user that had a bot flag at the time of making the edit). I guess a lot of tools and analytics efforts might rely on this very table and reducing its scope might break stuff.
  1. Having too much faith in autopatrolled actions sounds problematic to me (What exactly are the criteria for an edit to become autopatrolled?). On the other hand, my assumption would be that for most on-wiki use cases it would practically not matter if we reduced the time of storage or limit the number of lines in the table. I am however not sure about off-wiki use-cases.

-> Do we have a good understanding of the consequences of these suggestions for tools and analytics? I would prefer us to be able to point to working alternatives where necessary.

To move forward:

  • For wikidata only: let's reduce time to store autopatrolled actions to two weeks if that's okay with @Lydia_Pintscher

I'm trying to understand the implications of this. Would it basically mean these changes (all that have been autopatrolled and are older than 2 weeks) no longer show up on recent changes and watchlist? And https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&viewPanel=7&from=1655308199823&to=1655912999823 would no longer be a good measure for how many edits are made in a month and how that changes over time?

Since this has been blocking some more features, I've been thinking about this again and I brain dump more ideas that are low hanging fruits:

  • On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours. This way, no matter how complex the filters, the maximum number of rows being scanned will be cut to 1/30th.
  • category membership changes take a lot of rows, they are also not conceptually the same. Move them to another table.
  • wikidata entity change rows in client wikis are an absolute mess. Investigate why and fix that.

On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours. This way, no matter how complex the filters, the maximum number of rows being scanned will be cut to 1/30th.

The default time range of Recent Changes is 7 days, but user can select a longer one for up to 30 days. Which feature use recentchange table without time range?

On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours. This way, no matter how complex the filters, the maximum number of rows being scanned will be cut to 1/30th.

The default time range of Recent Changes is 7 days, but user can select a longer one for up to 30 days.

I am aware. I literally said "if there is no condition on rev_timestamp". This is to improve "default" mode which improves UX a lot already. If the person chose a complex filter and explicitly picks a long time span. It's fine if the query gets killed.

Some ideas that I believe are still worth looking into before going too drastic:

Redundancy of rc_new, rc_type and rc_source
MariaDB [wikidatawiki_p]> SELECT DISTINCT rc_new, rc_type, rc_source FROM recentchanges;
+--------+---------+---------------+
| rc_new | rc_type | rc_source     |
+--------+---------+---------------+
|      0 |       3 | mw.log        |
|      0 |       0 | mw.edit       |
|      1 |       1 | mw.new        |
|      0 |       6 | mw.categorize |
|      0 |       5 | wb            |
+--------+---------+---------------+

See also T36320#8440663, T244080 and T74157.

Wait... why is there rc_source = 'wb' on Wikidata?

obrazek.png (474×976 px, 50 KB)

MariaDB [wikidatawiki_p]> SELECT AVG(IF(rc_source = 'wb', 1, 0)) AS ratio FROM recentchanges;
+--------+
| ratio  |
+--------+
| 0.0346 |
+--------+

3.5% of entries in Wikidata's recent changes are unreachable for everyone.

Rewrite filters with "is not equal"

Some queries select on rc_type != #. Perhaps they can be rewritten to select rc_type IN (...) and possibly use the rc_name_type_patrolled_timestamp index.

Use the "enum trick" for other indexed fields
// rc_new is not an ENUM, but adding a redundant rc_new IN (0,1) gives mysql enough
// knowledge to use an index merge if it wants (it may use some other index though).
$conds += [ 'rc_new' => [ 0, 1 ] ];

rc_type and rc_patrolled can be indexed, too.

Allow only one change tag to filter for

Filtering for multiple (at least two) change tags requires DISTINCT and GROUP BY. It doesn't work in "strict mode" (T294775). It can be rewritten using a subquery, but it will probably perform worse than JOIN.

Abuse filters can be updated so that they tag changes with an "umbrella" tag in addition to their distinctive tags.

This would not concern negative tag filters (exclusion) because they use LEFT JOIN.

See also T298225.

Thanks!

FWIW:

mysql:research@s1-analytics-replica.eqiad.wmnet [enwiki]> select rc_source, count(*) from recentchanges group by rc_source ;
+---------------+----------+
| rc_source     | count(*) |
+---------------+----------+
| mw.categorize |  3045948 |
| mw.edit       |  4654606 |
| mw.log        |   250956 |
| mw.new        |   197564 |
| wb            |  1030227 |
+---------------+----------+
5 rows in set (15.148 sec)

mysql:research@s1-analytics-replica.eqiad.wmnet [enwiki]> select count(*) from recentchanges;
+----------+
| count(*) |
+----------+
|  9179513 |
+----------+
1 row in set (2.042 sec)

recentchanges currently contains a number of unindexed columns (e.g. rc_old_len, rc_new_len, rc_log_type, rc_log_action, rc_params), which simply mirrors data in revision and logging table (and due to polymorphism of recentchanges entry, they are in many cases empty). Is it intended that data be duplicated instead of querying from original?

About rc_ip:

This column is used to retroactive autoblock a user when CheckUser is not installed (CheckUser have its own implementation of PerformRetroactiveAutoblock hook), and is otherwise unrelated to Recent Changes. However, to perform autoblock we do not need to add IP to recentchanges table; instead we can create a dedicated table (such as "recentip") for it with three columns (rip_user, rip_ip, rip_timestamp). This table will be unused and empty in WMF production (but not Beta Cluster).


A further idea (not directly related to this task and only marginally related to rc_ip) is we can also move autoblock data to another table ("autoblock") - which will only require four columns (ab_id, ab_block_id, ab_timestamp, ab_ip_hex). Spliting autoblock out of block table to a dedicated feature may prevent a number of issues related to autoblock such as https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1129606. It added a new table though.

Change #1180243 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/core@master] WIP: Introduce RecentChangeSelectQueryBuilder

https://gerrit.wikimedia.org/r/1180243

recentchanges currently contains a number of unindexed columns (e.g. rc_old_len, rc_new_len, rc_log_type, rc_log_action, rc_params), which simply mirrors data in revision and logging table (and due to polymorphism of recentchanges entry, they are in many cases empty). Is it intended that data be duplicated instead of querying from original?

As I said above, the table is the de-normalization/summary table. The approach to improving its performance is different than what we usually do and those changes won't give much benefit while actually making things worse.

How are rc_log_type, rc_log_action and rc_params used? There are no indices on these columns, and it seems no filters at Special:RecentChanges uses these columns.

Query. For example, Nuke uses the first two to search for new uploads, API action=query&list=allusers uses rc_log_type to implement the activeusers parameter (yuck! that’s an unindexed query in an API GET request). Wikibase queries by rc_params – which don’t seem to be log parameters; contrary to the documentation, rc_params stores other kind of data for non-log RC entries.

(1) Nuke should instead query other tables such as filerevision or logging, to parallel T379147: Update Nuke to use the Revision table instead of Recent Changes.
(2) I see querying rc_log_type != newusers as a common pattern. Why not introduce a new rc_source value for that? (Alternatively we can exclude all logs from query, but this will change how active users is defined.)
(3) AbuseFilter's buildTestConditions can be fixed as follows: (1) if we want edits, then we don't need rc_log_type, rc_log_action and rc_params; (2) if we need logs, query the logging table instead. (3) if we want both, then query both.
(4) NewFilesPager has a condition to rc_log_type - this condition can simply be removed.
(5) Wikibase usage of rc_params: see T177802: Move (at least Wikibase) rc_params into a separate table, allow for de-duplication.

Change #1180243 abandoned by Zabe:

[mediawiki/core@master] Introduce RecentChangeSelectQueryBuilder

https://gerrit.wikimedia.org/r/1180243