Page MenuHomePhabricator

Scalability issues of recentchanges table
Open, Needs TriagePublic

Description

Recentchanges table powers one of core functionalities of Wikipedia and Co. It is the basically the backbone of every patrolling tool. From Special:RecentChanges to Watchlist to Huggle, and so on.

Since most of these tools provide a practically free-form SQL query builder, it easily can lead to "full table scan" queries. That is partially part of the design of rc table. It should small and nimble but pretty hot and being read in different ways as it has 10 indexes atm. The problem starts when a wiki gets really large due to bot edits or just the size of the wiki. For example Wikidata's rc table right now has 22M rows which is clearly bigger than what it should be. Currently, only three wikis have issues with recentchanges table: English Wikipedia, Wikidata, and Wikimedia Commons.

Here are several bugs caused by rc table getting large:

It is also blocking adding more features such as T179010: [C-DIS][SW] Re-enable Wikidata Recent Changes integration on Commons

The underlying problem is that RC table is trying to juggle two things:

  • Vandalism and problematic edits detection, which requires complex queries on non-patrolled edits (which are a small subset of edits). e.g. get me all edits that have high ores score but also certain tag.
  • General pulse check of what's going on. For example, watchlist (=What is happening on pages I care about). This requires seeing all edits but on a more simpler, narrower query. No complex magic but needing the firehose of edits.

Proposed solutions (some mutually exclusive, some not):

  1. Normalizing the table: This is not a good design and won't help much. This table is the de-normalization table of mediawiki. It's a summary table.
  2. Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not came from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.
  3. Split autopatrolled[1] actions out of rc table, into something like patrolled_recentchanges. In which querying it would be much limited (e.g. no tag filtering, no ores scores filtering, etc.)
  4. Reduce the time to store autopatrolled actions. This is basically hybrid of solution 2 & 3. That way we reduce the size without losing much.
  5. Store one week of data in one table, the rest of the month in another table (or one week for the first table and all of the month in the other meaning first week would get duplicated.) and make the query on the second table more restricted For example no tag filtering, no ores filtering, etc.
  6. Move it to a NoSQL solution. While I can see the reasoning, I'm not sure it would fix anything. These massive firehose of data would be slow to query in any system.
    1. That being said, Using Elastic sounds like a good idea (see T307328#7894820)

[1] "autopatrolled" can be useful for wikidata and commons only because "RC patrolling" is not enabled in English Wikipedia. RCPatrolling is a feature of mediawiki to find vandalism and reduce the work of patrollers in English Wikipedia. I find it disheartening it's not enabled there (which can be done with a bit community consultation and small design fixes)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not came from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.

It's 30 days because it's used for watchlists (and to a lesser extent, RecentChangesLinked). If it was just RC then it could be shorter. Domas shortened it and wanted to make it shorter still, but we didn't want to go any shorter because of the impact on watchlists.

Move it to a NoSQL solution. While I can see the reasoning, I'm not sure it would fix anything. These massive firehose of data would be slow to query in any system.

I would consider moving it to Elastic and scaling it out. Lucene/ElasticSearch is optimised for large intersections and can supposedly scan faster than MySQL, and sharding is easier.

You have to consider the benefits of a feature alongside the costs. We have more money now per edit than we did in ~2005 when the system was at its most stressed. So just being expensive is not necessarily a reason to cut back on features.

We should probably have a closer look at Wikidata use cases and optimisation. Does anyone use watchlists on Wikidata? The expiry time doesn't have to be the same everywhere.

I could definitely see shortening the expiry on some wikis being a workable idea. As an enwiki admin I rarely (if ever) look back past 1 week on RecentChanges or Watchlists.

On smaller Wikis, though, 30 days is definitely more adequate. When I did some testing recently, even on a medium-sized wiki, just applying an unregistered-edits filter made RecentChanges display more than a week's worth of edits.

Does anyone use watchlists on Wikidata? The expiry time doesn't have to be the same everywhere.

I did when I edited there. Mostly to catch vandalism to widely used items, BLP items, and to monitor community discussions. (Though I was also one of the pathological sorts who used the 'add pages you edit to watchlist', I couldn't tell you the reason why I had that option on.)

As an enwiki admin I rarely (if ever) look back past 1 week on RecentChanges or Watchlists.

I take longer (wiki)vacations, which would be the primary reason to support past a week. Once you're out past a month, there are probably better ways to catch up (mostly just read the pages that have changed in that month). "Once a weekers" are also a the classical reason that RFA is a 1-week process. T4877: Obtain a list of edits to watched pages since last visit and T10681: Do not group changes by day in enhanced recent changes seem tangentially relevant.

One clarification: What I'm proposing here is about only three wikis: enwiki, wikidatawiki, commonswiki. I have no desire or plan to change anything outside of these wikis and solution for each of these three wikis can be different the other one.

Adding developers of Wikidata team (and its PM) for the specific issue of rc table in wikidata, maybe we can spin it off to its own ticket.

I would consider moving it to Elastic and scaling it out. Lucene/ElasticSearch is optimised for large intersections and can supposedly scan faster than MySQL, and sharding is easier.

That seems like the best long-term solution to me. Elasticsearch is intended for exactly the kind of complex ad-hoc queries that Special:Recentchanges tries to provide. Enwiki has 55M pages and even relatively slow search operations like fulltext search run basically immediately on it; I'm sure the 20M watchlist entries wouldn't be a problem. There are drawbacks (robustness, hardware costs) but they seem manageable.

Converting recent changes to Elasticsearch would be a big engineering project though - not something that's likely to happen any time soon.

@Ladsgroup how urgent of a problem is this to solve? Is this something that should be discussed in annual planning that is happening right now, for example?

In principle, Elasticsearch looks like a good idea. A few constraints in our context:

  • We probably don't want to mix this use case with the current Full Text Search cluster, so we would need to set up an additional ES cluster.
  • We treat Elasticsearch as a strictly secondary storage solution. Update can (and will) be lost. We need to be able to recreate the data from a primary source at all times.
  • This use case probably does not fit into our current update pipeline, another pipeline will need to be developed (or the current one needs to be changed significantly)

Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not came from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.

It's 30 days because it's used for watchlists (and to a lesser extent, RecentChangesLinked). If it was just RC then it could be shorter. […]

Since I got a full time job, I've not been able to review my Wikipedia watchlist weekly, it's more like once every 3-4 weeks. I think as we grow further and try to spread the load more to more different patrollers each doing a little bit, it would make sense to plan for maintaining or growing this further, not shrinking it. With every day you shrink the watchlist length, you essentially require a smaller set of more widespread and active reviewers which seems contrary to (my interpretation of) the Movement Strategy.

I imagine that once in Elasticsearch it might be fairly trivial to ramp up RC age from 30 days back up to the 90 days we were on before, at relatively little cost as it would no longer scale linearly based on what can be scanned at once.

I guess we could approach this somewhat similarly to Prefixsearch and Full-text search which we also do through a service class that CirussSearch can override. We'd need to build something similar for RecentChanges.

@Ladsgroup how urgent of a problem is this to solve? Is this something that should be discussed in annual planning that is happening right now, for example?

I think it needs a bit of intervention ASAP to specially handle Wikidata's case but for rest of wikis, it can wait for 23-24FY. So IMHO, a band-aid solution for wikidata + a long term solution would be good.

Thanks for flagging this @Ladsgroup, Is there anything we can do on our side of things atm?

Do people on Wikidata and Commons rely on watchlists much? Messing with the enwiki watchlist seems scary, but Commons / WD have atypical QA/anti-abuse needs and workflows so maybe just reducing the RC period unconditionally is less of an issue there? Or just excluding bot edits, which is most edits on Wikidata, and they don't show up on most people's watchlists anyway?

Wikidata RecentChange and Watchlist would be benefit from more granular filtering (see T43686: [Story] Ability to filter recent changes and watchlist on Wikidata and subtasks). There are a number of other task about filtering of such lists (random examples: T23674, T42535, T12379, T124214, T45901).

Another thing to consider: if recentchanges is moved out of SQL it may be a pain to query changes for a large watchlist (about 100k).

I rely heavily on my Wikidata Watchlist. I generally just look at the last 7 days but longer is useful sometimes. I think the proposed 3 and 4 solutions above (splitting out autopatrolled changes) would be a good solution for my purposes - 99% of edits I check are the unpatrolled ones, so if we could keep 30 days of unpatrolled changes and cut the other to 14 days or less I think that would be fine for me. By the way I find ORES essentially useless to evaluate the quality of Wikidata edits.

In case of commons, 71% of 19M rows of the rc table is from mw.categorize source (T9148: Watch edits which add or remove pages from a category) Maybe that can be disabled in commons?

In case of commons, 71% of 19M rows of the rc table is from mw.categorize source (T9148: Watch edits which add or remove pages from a category) Maybe that can be disabled in commons?

I took a look around to see whether this is working as expected or whether something is going wrong, i.e. is 70% a big and unexpected number?

Firstly, I had completely forgotten RC_CATEGORIZE still exists and how it works. In a nut shell, when a links update job is re-rendering a page, it will emit RC events for effective additions and removals to the category links table.

This largely reflects four kinds of changes:

  1. Amplify each regular edit that directly changes the categories on a page. E.g. edit page X and add [[Category:Foo]] and {{Bar}}, this will emit 1 RC event for the edit as usual, and 2 more RC events, one for Category:Foo and one for Category:Pages_with_Bar with an edit summary describing that X was added to the category.
  2. Amplify each page creation and upload. When uploading a file on Commons, it will typically be in several categories right away. e.g. "CC-BY-4.0", "Media missing source", "Media missing description", and either "Media missing regular categories" or one or more regular categories as specified by the uploader. This upload will then produce N categorize events, one for each of these categories.
  3. Amplify edits to templates and Wikidata entities. If a commonly used template is modified to involve some kind of new hidden category or to modify how it uses these, that will cause a long chain reaction of many categorize events. Similarly, a Wikidata edit will emit events on Commons. E.g. a category about a person has categories based on the Wikidata infobox such as "actors from Kentucky", "actors of Belgian descent", "Fantastic Beasts cast", etc.
  4. Report on completion of maintenance by removing from an automatic hidden maintenance category. E.g. uploads with automatic category "Media missing regular categories" will emit an edit to remove from that category when an edit happens to the file that adds a regular category, similar for edits that fill in missing source, author, description, etc.

Note specifically that RC categorizez events use the category as the subject. This means if you watch an article on Wikipedia, and an indirect change from Wikidata or a template modifies which categories it is in, you will not see that on your watchlist even if you enable showing "Category changes". These changes are only provided to users that watch from the perspective of the category.

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

Only passing by to support everything Tim said (watchlist are the main limiting factor, RC by itself could be trimmed) and to make sure T171027 is in people's minds when thinking about refactoring- both as an example of thing broken due to the lack of scability, but also to have into account the needs from users when some functionality was temporarily disabled. RC and Watchlists are something that many power users rely on, so extra care should be taken about it. :-)

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

Thanks for the through investigation. I looked to see if removing that would help. Maybe the query is wrong somehow but it's only 30K rows :/

mysql:research@s4-analytics-replica.eqiad.wmnet [commonswiki]> select count(*) from recentchanges join page_props on pp_page = rc_cur_id where rc_source = 'mw.categorize' and pp_propname = 'hiddencat' limit 5;
+----------+
| count(*) |
+----------+
|    30004 |
+----------+
1 row in set (2.335 sec)

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". To my knowledge as a user and administrator of Wikimedia Commons, these are primarily monitored (if at all) through navigating to that category to work off the list in real-time, e.g. categories with missing information or requests for speedy deletion etc., not through one's watchlist where it would remain long after the request for information or admin action is dealt with.

I usually work on Commons protected edit requests for interface administrators through my watchlist – I find it much more useful to see the additions and removals there than to look at the category in full (which mostly contains pages that have been there for a while and that I can’t handle for one reason or another).

I also agree with @ArthurPSmith that options 3 and 4 seem good; I suspect option 4 is much easier to implement, and could be a good shorter-term solution (with ElasticSearch as a longer-term perspective).

I think with that context, it might make sense to exclude maintenance categories and other categories marked as "hidden". […]

Thanks for the through investigation. […] Maybe the query is wrong somehow but it's only 30K rows :/

mysql:research@s4-analytics-replica.eqiad.wmnet [commonswiki]> select count(*) from recentchanges join page_props on pp_page = rc_cur_id where rc_source = 'mw.categorize' and pp_propname = 'hiddencat' limit 5;
+----------+
| count(*) |
+----------+
|    30004 |
+----------+
1 row in set (2.335 sec)

I found two issues.

  1. It seems a lot of maintenance categories that are marked as hidden in wikitext (sometimes very indirectly via layers of templates), are missing the hiddencat page property.

Perhaps this is fallout from the link table deletion incident (ref T299095 etc), or perhaps it is due to loss of refreshLinks jobs at some point after a template that was broken and then the edit fixing it had its jobs lost.

page_idpage_titlecat_files_hiddencat
105635278Media_needing_categories_as_of_22_May_2021214710
105157987Media_needing_categories_as_of_11_May_2021208440
105077809Media_needing_categories_as_of_9_May_2021205260
105589748Media_needing_categories_as_of_21_May_2021205010
104920655Media_needing_categories_as_of_5_May_2021197540
105194480Media_needing_categories_as_of_12_May_2021184340
104834505Media_needing_categories_as_of_3_May_2021184010
105227601Media_needing_categories_as_of_13_May_2021180270
105543640Media_needing_categories_as_of_20_May_2021175290

I ran refreshLinks.php on commonswiki for the parent of the Media_needing_categories_as_of_* categories (there were only 2000 or so and they're very tiny and quick to re-parse). This fixed those specifically, but there may be more of these. However, via Quarry I found no categories that are named like a maintenance category that have no hiddencat and have > 50 files side-wide, so theres likely nothing else major that we'd see affect recent changes.

After that, re-running your query returned about 5% more relatively, but still very small in the absolute sense.

From https://quarry.wmcloud.org/query/38624 (rounded to two digits)
join page_props on pp_page = rc_cur_id
where rc_source = 'mw.categorize'
-- ...
31,000categorize-hiddencat
12,000,000categorize-all

The following takes your query and is meant to show which (hidden) categories dominate RC categorization events:

SELECT count(*), rc_cur_id, rc_title
FROM recentchanges JOIN page_props ON pp_page = rc_cur_id
WHERE rc_source = 'mw.categorize'
GROUP BY rc_cur_id
-- ...
_countrc_cur_idrc_title
1526925685Non-empty_category_redirects
1431761180Pages_with_object_coordinates_from_Wikidata
1412879996Non-empty_category_redirects
14117676627Hidden_categories
14117624850Non-empty_category_redirects
1346725206Non-empty_category_redirects
1333280299Hidden_categories
13108085116Hidden_categories

This result exposes our second problem. Despite the group by clause, we still see the same multiple times. What's going on?

rc_titlerc_cur_idrc_this_oldidrc_sourcerc_params
CC-Zero116846263648488001mw.categorizea:2:{s:10:"hidden-cat";b:1;s:5:"added";b:1;}
Footbridges_in_Schleswig-Holstein116846265648488004mw.categorizea:2:{s:10:"hidden-cat";b:0;s:5:"added";b:1;}
CC-Zero116846264648488002mw.categorizea:2:{s:10:"hidden-cat";b:1;s:5:"added";b:1;}

Note how the rc_cur_id values differ even for the same title. And indeed https://commons.wikimedia.org/wiki/?curid=116846263 is the file page, not the category page.

So despite categorize events materializing in the user interface as being having the category as their subject, and this being reflected in rc_namespace/rc_title, the rc_curid column is actually populated based on the target page (or possibly the page causing the chain reaction, e.g. sometimes a template). I imagine you thought, like me, that this was probably in rc_params or rc_comment instead of it frankenstein-ed like this.

From https://quarry.wmcloud.org/query/63402 (rounded to two digits)
8,300,000categorize-hiddencat
12,000,000categorize-all

8.3/12=69.2%

I've corrected it to join page_props via page.page_id instead of rc_cur_id, which shows hiddencat categorization are responsible for 70% of rc-categorize events (as opposed to 0.25%).

Thanks for the investigation!!!

To move forward:

  • For wikidata only: let's reduce time to store autopatrolled actions to two weeks if that's okay with @Lydia_Pintscher
  • For commons: Let's stop injecting rc change when the changed category is a hidden cat. If there is a hidden cat being used in watchlists, you can simply make it not hidden.

Are people okay with this?

Looking at the data, it doesn't look like we are getting any more RC entries than we have done in the past 3 years

image.png (877×1 px, 145 KB)

Is there something causing this to become more of an issue now?
Is there a quantity of rows we should strive to avoid going over?
Or are we just in the same status quo that we have been for the past years and would generally like to do something to see less timeouts etc?

Some notes:

  • The total number of rows in rc table is not the full story, for example it's being used heavily, any increase in pressure from data in other tables can drastically increase the pressure on rc table. i.e. everything is connected.
  • Just looking at the number wouldn't tell how is the distribution of the data. For example, an increase in non-bot edits can have a different impact than increase in bot edits given the different indexes and access patterns.
  • Being large a couple of years ago also doesn't mean it wasn't a problem back then. We used to have basically an outage a month caused by the wikidata's db and we still have issues with that. Just last Sunday all of Wikipedia went down for a couple of minutes because of the pressure on s8.
  • RC table is designed to be small and nimble, for example it has ten indexes. Breaking that assumption causes all sorts of unexpected issues. I want to keep that assumption true not just for current usecases but also for future ones.
  1. The de-normalized nature of the table makes it convenient to use e.g. for tools. And some info in the table even seems hard to get otherwise (e.g. whether an edit was performed by a user that had a bot flag at the time of making the edit). I guess a lot of tools and analytics efforts might rely on this very table and reducing its scope might break stuff.
  1. Having too much faith in autopatrolled actions sounds problematic to me (What exactly are the criteria for an edit to become autopatrolled?). On the other hand, my assumption would be that for most on-wiki use cases it would practically not matter if we reduced the time of storage or limit the number of lines in the table. I am however not sure about off-wiki use-cases.

-> Do we have a good understanding of the consequences of these suggestions for tools and analytics? I would prefer us to be able to point to working alternatives where necessary.

To move forward:

  • For wikidata only: let's reduce time to store autopatrolled actions to two weeks if that's okay with @Lydia_Pintscher

I'm trying to understand the implications of this. Would it basically mean these changes (all that have been autopatrolled and are older than 2 weeks) no longer show up on recent changes and watchlist? And https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&refresh=1m&viewPanel=7&from=1655308199823&to=1655912999823 would no longer be a good measure for how many edits are made in a month and how that changes over time?