Page MenuHomePhabricator

[SPIKE] Confirm which recentchanges db-related changes are required for KR 1.3[8H]
Closed, ResolvedPublicSpike

Description

DBA support for RecentChanges (Phab T307328)

Why it matters
The 1.3 KR is all about guiding editors to patrolling tasks and demonstrating impact. The features we develop for this KR are going to be working with recent changes which has run into DB scalability limitations that require SRE-owned schema and/or SRE expertise. We currently have an "avoid" list of wikis when it comes to adding more joins against the recentchanges table:

# top 25 wikis in terms of recentchanges; to avoid
# see https://phabricator.wikimedia.org/P74607
avoid_wikis = [ "sv", "ko", "pt", "pl", "fa", "frwiktionary", "ca", "uk", "mgwiktionary", "ur", "ja", "zh", "arz", "enwiktionary", "it", "de", "thwiktionary", "vi", "es", "ru", "ar", "fr", "en", "wikidatawiki", "commonswiki", ]

Proposal impact summary

proposal from T307328improvementuser impacteffort
On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hoursmediumnonesmall
drop rc_new & rc_type, keep rc_sourcemediumnonemedium
Identify and resolve wikibase ingest issuessmallnonemedium
Move category membership changes to another tablelargemediumlarge
Store one week of data in one table, the rest of the month in another tablex-largelargex-large
Reduce the time of storage rc actions to one weekx-largelargesmall
Split autopatrolled actions out of rc tablemediumx-largelarge
Reduce the time to store autopatrolled actions.mediumx-largelarge
Move it to a NoSQL solutionunknownunknownx-large
Normalizing the tablenonenonelarge

Recommended proposals from the mentioned ticket

[...] - On large wikis, if there is no condition on rc_timestamp, just add a condition to be the last 24 hours. This way, no matter how complex the filters, the maximum number of rows being scanned will be cut to 1/30th. [...]

This might warrant adding a ui hint where it actually comes into play, but seems like an easy win considering that busy wikis only display edits from the last few minutes or seconds due to the volume of edits. This would help reduce the db pressure brought by the recent changes page, but wouldn't reduce the load brought by lookups / joins against the recentchanges tables from more specific queries, either on the recent changes page or from other features we build that query that table

[...]

  1. Redundancy of rc_new, rc_type and rc_source
MariaDB [wikidatawiki_p]> SELECT DISTINCT rc_new, rc_type, rc_source FROM recentchanges;
+--------+---------+---------------+
| rc_new | rc_type | rc_source     |
+--------+---------+---------------+
|      0 |       3 | mw.log        |
|      0 |       0 | mw.edit       |
|      1 |       1 | mw.new        |
|      0 |       6 | mw.categorize |
|      0 |       5 | wb            |
+--------+---------+---------------+

[...]

The recommendation is to keep rc_source and drop rc_new and rc_type. rc_type would be most efficient, but difficult to keep extensions from using colliding values.

[...] category membership changes take a lot of rows, they are also not conceptually the same. Move them to another table [...]

    • This type of rc (type 6, RC_CATEGORIZE) represents a significant portion of rows on the wikis that are currently on our "avoid" list [1]
    • I did a little digging to see how the filters for this change type impact usability
      • based on a reading of our research, category changes seem to be more confusing than useful to many editors on the rc page.
      • a preliminary look at the output of webrequest-rc_showcategorization.py indicates that the "hide category changes" default is often removed, meaning that the functionality of that filter needs to exist somewhere
  • @Ladsgroup proposed moving category changes to a new special page to prevent having to join a new category changes table against the recent changes table. There isn't really a use case for viewing category changes alongside all of the rest, which is why they are hidden by default. I think this proposal makes a lot of sense. This would be joint work that would involve
    • adding the new table; writing category changes to recent changes + the new table for a period
    • adding the new special page
    • removing references to category changes from recentchanges / watchlist pages
    • an open question is how would this impact watchlists?
  • Another possibility would be to keep the category changes in the RecentChanges UI, but make them exclusive of other changes. Eg. selecting category changes deselects all other kinds of changes; selecting any kind of change besides category changes deselects category changes. Avoiding presenting the two types together should preclude the need to join category changes and recent changes tables in queries. This would increase the techincal debt of the recent changes page in order to serve "non-rc" content sometimes.

Areas of inquiry

[...] wikidata entity change rows in client wikis are an absolute mess. Investigate why and fix that [...]

Moderator tools should investigate (or bring in another team to help investigate) why some wikis have so many wikibase changes (type 5):

1('frwiki', {'mw.edit': 766644, 'wb': 790923}) 1.0316691971762644
2('arwiki', {'mw.edit': 321440, 'wb': 763293}) 2.3746049029367846
3('eswiki', {'mw.edit': 515816, 'wb': 665724}) 1.290623012857298
4('ruwiki', {'mw.edit': 498632, 'wb': 550427}) 1.103874199810682
5('itwiki', {'mw.edit': 371579, 'wb': 512170}) 1.3783609945664315
6('ukwiki', {'mw.edit': 149335, 'wb': 476468}) 3.190598319215187
7('arzwiki', {'mw.edit': 372773, 'wb': 424782}) 1.1395192248365626
8('plwiki', {'mw.edit': 191479, 'wb': 384598}) 2.0085649079011274
9('cawiki', {'mw.edit': 110444, 'wb': 353002}) 3.196208033030314
10('zhwiki', {'mw.edit': 389674, 'wb': 350039}) 0.8982867730461873
11('kowiki', {'mw.edit': 214961, 'wb': 326289}) 1.5178985955591944
12('nowiki', {'mw.edit': 44170, 'wb': 318408}) 7.208693683495585
13('jawiki', {'mw.edit': 339167, 'wb': 318023}) 0.9376590293277353
14('trwiki', {'mw.edit': 95043, 'wb': 295747}) 3.111717854023968
15('svwiki', {'mw.edit': 152640, 'wb': 285560}) 1.8708071278825995
16('euwiki', {'mw.edit': 27394, 'wb': 273159}) 9.971490107322772
17('fawiki', {'mw.edit': 209369, 'wb': 272267}) 1.3004169671727905
18('ptwiki', {'mw.edit': 143803, 'wb': 265419}) 1.845712537290599
19('rowiki', {'mw.edit': 48112, 'wb': 239137}) 4.97042317924842
20('srwiki', {'mw.edit': 143324, 'wb': 235977}) 1.646458374033658
21('idwiki', {'mw.edit': 111068, 'wb': 227945}) 2.052301292901646
22('huwiki', {'mw.edit': 70273, 'wb': 227307}) 3.23462780868897
23('cswiki', {'mw.edit': 72716, 'wb': 227030}) 3.1221464326970683
24('hewiki', {'mw.edit': 179019, 'wb': 208286}) 1.163485440092951
25('nlwiki', {'mw.edit': 137928, 'wb': 206277}) 1.4955411519053419
26('hywiki', {'mw.edit': 38306, 'wb': 187199}) 4.886936772307211
27('ttwiki', {'mw.edit': 2411, 'wb': 176835}) 73.34508502695977
28('fiwiki', {'mw.edit': 119381, 'wb': 168641}) 1.4126284752179996
29('dawiki', {'mw.edit': 16460, 'wb': 163255}) 9.918286755771568
30('cewiki', {'mw.edit': 3613, 'wb': 156935}) 43.436202601716026
31('elwiki', {'mw.edit': 49603, 'wb': 152838}) 3.081224925911739
32('eowiki', {'mw.edit': 19537, 'wb': 148286}) 7.590008701438297
33('viwiki', {'mw.edit': 68114, 'wb': 147524}) 2.1658396218104943
34('glwiki', {'mw.edit': 26123, 'wb': 143201}) 5.481797649580829
35('bgwiki', {'mw.edit': 24598, 'wb': 133214}) 5.4156435482559555
36('bewiki', {'mw.edit': 9790, 'wb': 131979}) 13.48100102145046
37('cebwiki', {'mw.edit': 238852, 'wb': 128013}) 0.5359511329191299
38('metawiki', {'mw.edit': 99025, 'wb': 126785}) 1.2803332491795
39('specieswiki', {'mw.edit': 26582, 'wb': 117869}) 4.434165976976901
40('azbwiki', {'mw.edit': 1063, 'wb': 117451}) 110.4901222953904
41('cywiki', {'mw.edit': 67181, 'wb': 111262}) 1.6561527812923296
42('azwiki', {'mw.edit': 27503, 'wb': 104802}) 3.81056612005963
43('astwiki', {'mw.edit': 9552, 'wb': 104274}) 10.916457286432161
44('urwiki', {'mw.edit': 94659, 'wb': 103290}) 1.091179919500523
45('simplewiki', {'mw.edit': 58007, 'wb': 95614}) 1.6483183064112952
46('slwiki', {'mw.edit': 13096, 'wb': 89652}) 6.8457544288332315
47('shwiki', {'mw.edit': 8215, 'wb': 89314}) 10.872063298843578

It's not wrong in itself, but to have so many is an indicator that something is wrong.
My quickly jotted notes from talking to @Ladsgroup:

  • wikibase client is subscribing to things it shouldn't
  • for a given wiki with lots of wb changes, look when the source is wb - look at page and match item / entry
  • try to identify the cause:
    • it gets this because of x aspect
    • deprecating old aspect
    • subscribes to old aliases

6 Move it to a NoSQL solution. While I can see the reasoning, I'm not sure it would fix anything. These massive firehose of data would be slow to query in any system.
A. That being said, Using Elastic sounds like a good idea (see T307328#7894820)

(also see T276699#7686389)
We're not the experts on all of the tradeoffs involved there, but this could be an opportunity to think big about how rc data is handled.

  1. As mentioned in T307328, there are really 2 different kinds of needs being served by recentchanges. It could be worth exploring how these might be teased apart;
    • Vandalism and problematic edits detection, which requires complex queries on non-patrolled edits (which are a small subset of edits). e.g. get me all edits that have high ores score but also certain tag.
    • General pulse check of what's going on. For example, watchlist (=What is happening on pages I care about). This requires seeing all edits but on a more simpler, narrower query. No complex magic but needing the firehose of edits.

Discarded proposals from the mentioned ticket

  1. Normalizing the table: This is not a good design and won't help much. This table is the de-normalization table of mediawiki. It's a summary table.

The irrelevance of this proposal is included directly in its language.

  1. Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not come from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.

Watchlist support seems to be the reason for longer rc expiration times on multiple wikis; purging rows from RC after one week means purging them from all watchlists as well [2].

  1. Split autopatrolled[1] actions out of rc table, into something like patrolled_recentchanges. In which querying it would be much limited (e.g. no tag filtering, no ores scores filtering, etc.)

This is similar to the proposed move for category changes; as we've previously discussed, the way that autopatrol is used varies significantly per wiki and would be a less uniform win in my opinion.

  1. Reduce the time to store autopatrolled actions. This is basically hybrid of solution 2 & 3. That way we reduce the size without losing much.

Similar autopatrol concerns as #3 above

  1. Store one week of data in one table, the rest of the month in another table (or one week for the first table and all of the month in the other meaning first week would get duplicated.) and make the query on the second table more restricted For example no tag filtering, no ores filtering, etc.

Same watchlist expiration concerns as #2 above

Notes
[1]. recentchanges row count by type on "avoid" wikis:

svwiki

count(*)        rc_type
159945          0
4875            1
6782            3
253947          5
59902           6

kowiki

count(*)        rc_type
200395          0
38495           1
6869            3
326027          5
317856          6

ptwiki

count(*)        rc_type
141412          0
8457            1
15195           3
261027          5
97431           6

plwiki

count(*)        rc_type
182767          0
9920            1
9587            3
374297          5
78009           6

fawiki

count(*)        rc_type
158955          0
7873            1
38327           3
286643          5
153435          6

frwiktionary

count(*)        rc_type
63783           0
26650           1
1461            3
1702            5
137483          6

cawiki

count(*)        rc_type
116990          0
3853            1
1802            3
353500          5
48416           6

ukwiki

count(*)        rc_type
196450          0
10176           1
14258           3
454176          5
271760          6

mgwiktionary

count(*)        rc_type
255259          0
98848           1
245             3
299             5
206549          6

urwiki

count(*)        rc_type
75383           0
19717           1
1895            3
98048           5
91838           6

jawiki

count(*)        rc_type
332987          0
10538           1
15419           3
320860          5
110978          6

zhwiki

count(*)        rc_type
413616          0
24899           1
21332           3
331235          5
271015          6

arzwiki

count(*)        rc_type
160550          0
3587            1
1347            3
505093          5
160308          6

enwiktionary

count(*)        rc_type
385053          0
53878           1
5383            3
14329           5
596855          6

itwiki

count(*)        rc_type
482969          0
20298           1
17271           3
547776          5
163660          6

dewiki

count(*)        rc_type
748660          0
23701           1
33462           3
168259          5
184209          6

thwiktionary

count(*)        rc_type
35917           0
99619           1
121             3
3099            5
476349          6

viwiki

count(*)        rc_type
74264           0
2460            1
5468            3
122165          5
42521           6

eswiki

count(*)        rc_type
486088          0
19541           1
74064           3
679602          5
151764          6

ruwiki

count(*)        rc_type
460206          0
29495           1
265743          3
561156          5
314731          6

arwiki

count(*)        rc_type
263276          0
24212           1
14137           3
806320          5
564149          6

frwiki

count(*)        rc_type
734515          0
32472           1
44214           3
810730          5
579062          6

enwiki

count(*)        rc_type
4673647         0
183335          1
230626          3
1160455         5
2978653         6

wikidatawiki

count(*)        rc_type
14279623        0
476160          1
19744           3
531624          5
71423           6

commonswiki

count(*)        rc_type
10217747        0
181814          1
2348705         3
26422243        6

[2]. https://www.mediawiki.org/wiki/Help:Watching_pages#Watchlist

Spike

  • Confirm exactly which DB changes are required.
  • Meet Amir to map out work required work
  • Once we identify what DB work is required, draft an SRE-owned hypothesis under 1.3

Details

Due Date
Jun 30 2025, 10:00 PM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 10 2025, 3:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jsn.sherman renamed this task from Confirm exactly which recentchanges DB changes are required for KR 1.3 to [SPIKE] Confirm exactly which recentchanges DB changes are required for KR 1.3.Jun 10 2025, 3:02 PM
jsn.sherman set Due Date to Jun 30 2025, 10:00 PM.
jsn.sherman moved this task from Inbox to To be estimated on the Moderator-Tools-Team board.
Scardenasmolinar renamed this task from [SPIKE] Confirm exactly which recentchanges DB changes are required for KR 1.3 to [SPIKE] Confirm exactly which recentchanges DB changes are required for KR 1.3[8H].Jun 10 2025, 3:08 PM
Scardenasmolinar moved this task from To be estimated to Estimated on the Moderator-Tools-Team board.
jsn.sherman changed the task status from Open to In Progress.Jun 10 2025, 4:55 PM
jsn.sherman moved this task from Ready to In Progress on the Moderator-Tools-Team (Kanban) board.
jsn.sherman renamed this task from [SPIKE] Confirm exactly which recentchanges DB changes are required for KR 1.3[8H] to [SPIKE] Confirm which recentchanges db-related changes are required for KR 1.3[8H].Jun 23 2025, 8:09 PM
jsn.sherman updated the task description. (Show Details)

The script about requests to recentchanges table has a bug. There are many calls to RC without any parameters (RC issues a redirect to default but maybe it doesn't sometimes) and when the argument of hidecategorization=0 is not set, the category changes are not shown. So we have a trinary of: 1- hidecategorization=0 2- hidecategorization=1 3- nothing in parameters. 2 and 3 won't show them but 1 does. So you need to check for lack of that. For example I ran the query on kowiki:

SELECT
  uri_host,
  sum(if(instr(uri_query, "hidecategorization=0") = 0, 1, 0)) as hidecategorization,
  sum(if(instr(uri_query, "hidecategorization=0") > 0, 1, 0)) as showcategorization2
FROM wmf.webrequest
  WHERE webrequest_source = "text" AND year = 2025 AND month = 6 AND day = 20 AND hour = 10
  AND uri_host = "ko.wikipedia.org"
  AND is_pageview = true
  AND x_analytics_map['special'] = "Recentchanges"
GROUP BY uri_host;

You can play with it, 1 is zero 2 is 101 hits and 3 is 17. So the showing of category changes is actually zero not 17 (unlike what the script says)

Also worth looking into ideas brought up in T307328#10715066 specially the first one.

You can play with it, 1 is zero 2 is 101 hits and 3 is 17. So the showing of category changes is actually zero not 17 (unlike what the script says)

i.e. I do think that feature is barely used.

The script about requests to recentchanges table has a bug. There are many calls to RC without any parameters (RC issues a redirect to default but maybe it doesn't sometimes) and when the argument of hidecategorization=0 is not set, the category changes are not shown. So we have a trinary of: 1- hidecategorization=0 2- hidecategorization=1 3- nothing in parameters. 2 and 3 won't show them but 1 does. So you need to check for lack of that. For example I ran the query on kowiki:

SELECT
  uri_host,
  sum(if(instr(uri_query, "hidecategorization=0") = 0, 1, 0)) as hidecategorization,
  sum(if(instr(uri_query, "hidecategorization=0") > 0, 1, 0)) as showcategorization2
FROM wmf.webrequest
  WHERE webrequest_source = "text" AND year = 2025 AND month = 6 AND day = 20 AND hour = 10
  AND uri_host = "ko.wikipedia.org"
  AND is_pageview = true
  AND x_analytics_map['special'] = "Recentchanges"
GROUP BY uri_host;

You can play with it, 1 is zero 2 is 101 hits and 3 is 17. So the showing of category changes is actually zero not 17 (unlike what the script says)

Thank you for that; I wasn't aware that there were non-redirect situations and so I though is_pageview would filter out any of the param-free hits since it limits to status=200

I'm changing to

sum(if(instr(uri_query, "hidecategorization=0") > 0, 1, 0)) as showcategorization

which should should now include only that 1st state; again, thanks!

[...] when the argument of hidecategorization=0 is not set, the category changes are not shown. [...]

I just double checked this on enwiki and that does not line up with what you are describing; When I select category changes and deselect the other types, I get the following url:
https://en.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidepageedits=1&hidenewpages=1&hideWikibase=1&hidelog=1&hidenewuserlog=1&limit=50&days=0.5&enhanced=1&urlversion=2

which does not have a hidecategorization parameter. I am seeing the category changes:

image.png (970×1 px, 222 KB)

That's a bug honestly. MediaWiki should not show categorization changes by default.

It looks like the redirect is happening for the js-enabled version only; the nojs web form version does not redirect. I'll amend my query to exclude views that have no query parameters so that the numbers within a row are at least comparable.

Okay, I've brought these learnings into a slightly larger notebook that is checking usage of all of the available parameters because that is a thing we are interested in:
https://gitlab.wikimedia.org/jsn/notebooks/-/blob/main/webrequest-rc_filter_usage.py

1('frwiki', {'mw.edit': 766644, 'wb': 790923}) 1.0316691971762644
2('arwiki', {'mw.edit': 321440, 'wb': 763293}) 2.3746049029367846
3('eswiki', {'mw.edit': 515816, 'wb': 665724}) 1.290623012857298
4('ruwiki', {'mw.edit': 498632, 'wb': 550427}) 1.103874199810682
5('itwiki', {'mw.edit': 371579, 'wb': 512170}) 1.3783609945664315
6('ukwiki', {'mw.edit': 149335, 'wb': 476468}) 3.190598319215187
7('arzwiki', {'mw.edit': 372773, 'wb': 424782}) 1.1395192248365626
8('plwiki', {'mw.edit': 191479, 'wb': 384598}) 2.0085649079011274
9('cawiki', {'mw.edit': 110444, 'wb': 353002}) 3.196208033030314
10('zhwiki', {'mw.edit': 389674, 'wb': 350039}) 0.8982867730461873
11('kowiki', {'mw.edit': 214961, 'wb': 326289}) 1.5178985955591944
12('nowiki', {'mw.edit': 44170, 'wb': 318408}) 7.208693683495585
13('jawiki', {'mw.edit': 339167, 'wb': 318023}) 0.9376590293277353
14('trwiki', {'mw.edit': 95043, 'wb': 295747}) 3.111717854023968
15('svwiki', {'mw.edit': 152640, 'wb': 285560}) 1.8708071278825995
16('euwiki', {'mw.edit': 27394, 'wb': 273159}) 9.971490107322772
17('fawiki', {'mw.edit': 209369, 'wb': 272267}) 1.3004169671727905
18('ptwiki', {'mw.edit': 143803, 'wb': 265419}) 1.845712537290599
19('rowiki', {'mw.edit': 48112, 'wb': 239137}) 4.97042317924842
20('srwiki', {'mw.edit': 143324, 'wb': 235977}) 1.646458374033658
21('idwiki', {'mw.edit': 111068, 'wb': 227945}) 2.052301292901646
22('huwiki', {'mw.edit': 70273, 'wb': 227307}) 3.23462780868897
23('cswiki', {'mw.edit': 72716, 'wb': 227030}) 3.1221464326970683
24('hewiki', {'mw.edit': 179019, 'wb': 208286}) 1.163485440092951
25('nlwiki', {'mw.edit': 137928, 'wb': 206277}) 1.4955411519053419
26('hywiki', {'mw.edit': 38306, 'wb': 187199}) 4.886936772307211
27('ttwiki', {'mw.edit': 2411, 'wb': 176835}) 73.34508502695977
28('fiwiki', {'mw.edit': 119381, 'wb': 168641}) 1.4126284752179996
29('dawiki', {'mw.edit': 16460, 'wb': 163255}) 9.918286755771568
30('cewiki', {'mw.edit': 3613, 'wb': 156935}) 43.436202601716026
31('elwiki', {'mw.edit': 49603, 'wb': 152838}) 3.081224925911739
32('eowiki', {'mw.edit': 19537, 'wb': 148286}) 7.590008701438297
33('viwiki', {'mw.edit': 68114, 'wb': 147524}) 2.1658396218104943
34('glwiki', {'mw.edit': 26123, 'wb': 143201}) 5.481797649580829
35('bgwiki', {'mw.edit': 24598, 'wb': 133214}) 5.4156435482559555
36('bewiki', {'mw.edit': 9790, 'wb': 131979}) 13.48100102145046
37('cebwiki', {'mw.edit': 238852, 'wb': 128013}) 0.5359511329191299
38('metawiki', {'mw.edit': 99025, 'wb': 126785}) 1.2803332491795
39('specieswiki', {'mw.edit': 26582, 'wb': 117869}) 4.434165976976901
40('azbwiki', {'mw.edit': 1063, 'wb': 117451}) 110.4901222953904
41('cywiki', {'mw.edit': 67181, 'wb': 111262}) 1.6561527812923296
42('azwiki', {'mw.edit': 27503, 'wb': 104802}) 3.81056612005963
43('astwiki', {'mw.edit': 9552, 'wb': 104274}) 10.916457286432161
44('urwiki', {'mw.edit': 94659, 'wb': 103290}) 1.091179919500523
45('simplewiki', {'mw.edit': 58007, 'wb': 95614}) 1.6483183064112952
46('slwiki', {'mw.edit': 13096, 'wb': 89652}) 6.8457544288332315
47('shwiki', {'mw.edit': 8215, 'wb': 89314}) 10.872063298843578

Also worth looking into ideas brought up in T307328#10715066 specially the first one.

Agree!

[...]

Redundancy of rc_new, rc_type and rc_source
MariaDB [wikidatawiki_p]> SELECT DISTINCT rc_new, rc_type, rc_source FROM recentchanges;
+--------+---------+---------------+
| rc_new | rc_type | rc_source     |
+--------+---------+---------------+
|      0 |       3 | mw.log        |
|      0 |       0 | mw.edit       |
|      1 |       1 | mw.new        |
|      0 |       6 | mw.categorize |
|      0 |       5 | wb            |
+--------+---------+---------------+

See also T36320#8440663, T244080 and T74157.
[...]

The recommendation is to keep rc_source and drop rc_new and rc_type. I see this as shared work between moderator tools and sre. We (moderator tools) would update lookups to replace rc_new & rc type to use rc_source. sre would modify the database afterwards.

1('frwiki', {'mw.edit': 766644, 'wb': 790923}) 1.0316691971762644
2('arwiki', {'mw.edit': 321440, 'wb': 763293}) 2.3746049029367846
3('eswiki', {'mw.edit': 515816, 'wb': 665724}) 1.290623012857298
4('ruwiki', {'mw.edit': 498632, 'wb': 550427}) 1.103874199810682
5('itwiki', {'mw.edit': 371579, 'wb': 512170}) 1.3783609945664315
6('ukwiki', {'mw.edit': 149335, 'wb': 476468}) 3.190598319215187
7('arzwiki', {'mw.edit': 372773, 'wb': 424782}) 1.1395192248365626
8('plwiki', {'mw.edit': 191479, 'wb': 384598}) 2.0085649079011274
9('cawiki', {'mw.edit': 110444, 'wb': 353002}) 3.196208033030314
10('zhwiki', {'mw.edit': 389674, 'wb': 350039}) 0.8982867730461873
11('kowiki', {'mw.edit': 214961, 'wb': 326289}) 1.5178985955591944
12('nowiki', {'mw.edit': 44170, 'wb': 318408}) 7.208693683495585
13('jawiki', {'mw.edit': 339167, 'wb': 318023}) 0.9376590293277353
14('trwiki', {'mw.edit': 95043, 'wb': 295747}) 3.111717854023968
15('svwiki', {'mw.edit': 152640, 'wb': 285560}) 1.8708071278825995
16('euwiki', {'mw.edit': 27394, 'wb': 273159}) 9.971490107322772
17('fawiki', {'mw.edit': 209369, 'wb': 272267}) 1.3004169671727905
18('ptwiki', {'mw.edit': 143803, 'wb': 265419}) 1.845712537290599
19('rowiki', {'mw.edit': 48112, 'wb': 239137}) 4.97042317924842
20('srwiki', {'mw.edit': 143324, 'wb': 235977}) 1.646458374033658
21('idwiki', {'mw.edit': 111068, 'wb': 227945}) 2.052301292901646
22('huwiki', {'mw.edit': 70273, 'wb': 227307}) 3.23462780868897
23('cswiki', {'mw.edit': 72716, 'wb': 227030}) 3.1221464326970683
24('hewiki', {'mw.edit': 179019, 'wb': 208286}) 1.163485440092951
25('nlwiki', {'mw.edit': 137928, 'wb': 206277}) 1.4955411519053419
26('hywiki', {'mw.edit': 38306, 'wb': 187199}) 4.886936772307211
27('ttwiki', {'mw.edit': 2411, 'wb': 176835}) 73.34508502695977
28('fiwiki', {'mw.edit': 119381, 'wb': 168641}) 1.4126284752179996
29('dawiki', {'mw.edit': 16460, 'wb': 163255}) 9.918286755771568
30('cewiki', {'mw.edit': 3613, 'wb': 156935}) 43.436202601716026
31('elwiki', {'mw.edit': 49603, 'wb': 152838}) 3.081224925911739
32('eowiki', {'mw.edit': 19537, 'wb': 148286}) 7.590008701438297
33('viwiki', {'mw.edit': 68114, 'wb': 147524}) 2.1658396218104943
34('glwiki', {'mw.edit': 26123, 'wb': 143201}) 5.481797649580829
35('bgwiki', {'mw.edit': 24598, 'wb': 133214}) 5.4156435482559555
36('bewiki', {'mw.edit': 9790, 'wb': 131979}) 13.48100102145046
37('cebwiki', {'mw.edit': 238852, 'wb': 128013}) 0.5359511329191299
38('metawiki', {'mw.edit': 99025, 'wb': 126785}) 1.2803332491795
39('specieswiki', {'mw.edit': 26582, 'wb': 117869}) 4.434165976976901
40('azbwiki', {'mw.edit': 1063, 'wb': 117451}) 110.4901222953904
41('cywiki', {'mw.edit': 67181, 'wb': 111262}) 1.6561527812923296
42('azwiki', {'mw.edit': 27503, 'wb': 104802}) 3.81056612005963
43('astwiki', {'mw.edit': 9552, 'wb': 104274}) 10.916457286432161
44('urwiki', {'mw.edit': 94659, 'wb': 103290}) 1.091179919500523
45('simplewiki', {'mw.edit': 58007, 'wb': 95614}) 1.6483183064112952
46('slwiki', {'mw.edit': 13096, 'wb': 89652}) 6.8457544288332315
47('shwiki', {'mw.edit': 8215, 'wb': 89314}) 10.872063298843578

Moderator tools should investigate (or bring in another team to help investigate) why some wikis have so many wikibase changes (type 5).
It's not wrong in itself, but to have so many is an indicator that something is wrong.
My quickly jotted notes from talking to @Ladsgroup:

  • wikibase client is subscribing to things it shouldn't
  • for a given wiki with lots of wb changes, look when the source is wb - look at page and match item / entry
  • try to identify the cause:
    • it gets this because of x aspect
    • deprecating old aspect
    • subscribes to old aliases

@Ladsgroup proposed moving category changes to a new special page to prevent having to join a new category changes table against the recent changes table. There isn't really a use case for viewing category changes alongside all of the rest, which is why they are hidden by default. I think this proposal makes a lot of sense. This would be joint work that would involve

  • adding the new table; writing category changes to recent changes + the new table for a period
  • adding the new special page
  • removing references to category changes from recentchanges / watchlist pages
  • an open question is how would this impact watchlists?

@Ladsgroup proposed moving category changes to a new special page to prevent having to join a new category changes table against the recent changes table. There isn't really a use case for viewing category changes alongside all of the rest, which is why they are hidden by default. I think this proposal makes a lot of sense. This would be joint work that would involve

  • adding the new table; writing category changes to recent changes + the new table for a period
  • adding the new special page
  • removing references to category changes from recentchanges / watchlist pages
  • an open question is how would this impact watchlists?

This sounds like a great place to start. I'm curious what @OTichonova would think about the design of breaking out categories to a new special page?
I'd imagine that people would still want to be able to add category pages to watchlist, so maybe this is where we start introducing a cap on the watchlist or introducing multiple watchlists?

Thank you, @jsn.sherman and @Ladsgroup, for all of your work on this ticket! I have a few comments on some follow-up work/tickets.

adding the new special page

I would create two hypotheses for Mod Tools to work on this: one for design (whether to pursue a new design or use the current RC and Watchlist design) and one for engineering work

removing references to category changes from recentchanges / watchlist pages

Let's create a follow-up task for this and add it to the required hypothesis work.

an open question is how would this impact watchlist?

Let's create a spike for this. I'm not sure this would be a Mod Tools spike, a DB spike or an effort from both teams.

@Ladsgroup proposed moving category changes to a new special page to prevent having to join a new category changes table against the recent changes table. There isn't really a use case for viewing category changes alongside all of the rest, which is why they are hidden by default. I think this proposal makes a lot of sense. This would be joint work that would involve

  • adding the new table; writing category changes to recent changes + the new table for a period
  • adding the new special page
  • removing references to category changes from recentchanges / watchlist pages
  • an open question is how would this impact watchlists?

This sounds like a great place to start. I'm curious what @OTichonova would think about the design of breaking out categories to a new special page?
I'd imagine that people would still want to be able to add category pages to watchlist, so maybe this is where we start introducing a cap on the watchlist or introducing multiple watchlists?

Yes, and @Ladsgroup has mentioned a number of additional watchlist related performance issues:

  • identifying watchlists attached to large bot accounts and clearing them out (after reaching out to their owners)
  • revisit including talk pages when watching; this doubles the row count for a behavior that might not even be desirable
  • normalize watch page targets to link target table to increase perf

When I think about multiple custom watchlists as a feature, I really begin to think that watchlists should just be restarted from scratch. I think this is all definitely something for @JWheeler-WMF and friends to consider for commtech's upcoming work in this area.

Thank you, @jsn.sherman and @Ladsgroup, for all of your work on this ticket! I have a few comments on some follow-up work/tickets.

adding the new special page

I would create two hypotheses for Mod Tools to work on this: one for design (whether to pursue a new design or use the current RC and Watchlist design) and one for engineering work

That sounds reasonable; I've created two placeholders under 1.3, one for engineering and one for design. We discuss as a team and decide if that's how we want to structure it and wordsmith hypotheses.

removing references to category changes from recentchanges / watchlist pages

Let's create a follow-up task for this and add it to the required hypothesis work.

agree

an open question is how would this impact watchlist?

Let's create a spike for this. I'm not sure this would be a Mod Tools spike, a DB spike or an effort from both teams.

Agree; I think this would be an "us" spike.

Regarding the rows coming from wikibase, I looked at arwiki that has around 760K rows from wikidata (2.3 times the actual number of edits in the wiki). I found a couple of issues:

  • O aspect and aliases. Somehow many pages are subscribing to the O aspect and as result, they are getting all alias changes for that item even in languages that has nothing to do with them. That alone is responsible for ~20% of wikidata injected rows.
  • Statement modifier collapse is a problem. C.P1, ..., collapse to C in general after twenty statements (I think 20, not sure). This is causing a lot of random edits on statements being injected on rc We probably should bump the threshold a bit.
  • Changes to qualifiers and references. They are also triggering a lot of injections while they don't change anything (and I don't think client wikis actually load the references, at least it's quite rare AFAIK). That alone is also responsible for another 20% of rows being injected.

I ping Lydia and some other folks in WMDE so it could get investigated and improved.

Thank you, @jsn.sherman and @Ladsgroup, for all of your work on this ticket! I have a few comments on some follow-up work/tickets.

adding the new special page

I would create two hypotheses for Mod Tools to work on this: one for design (whether to pursue a new design or use the current RC and Watchlist design) and one for engineering work

removing references to category changes from recentchanges / watchlist pages

Let's create a follow-up task for this and add it to the required hypothesis work.

an open question is how would this impact watchlist?

Let's create a spike for this. I'm not sure this would be a Mod Tools spike, a DB spike or an effort from both teams.

Categorization changes (CatWatch) design has always been somewhat weird. I can't imagine when I would want to see regular changes mixed with category changes. Usually just one of those. So I wouldn't expect many people to rely on them in the recent changes. I think people have them displayed mainly in the watchlist (you choose which categories are relevant for you). They can also be displayed in related changes (RecentChangesLinked), but there is "that one flaw" (T89582).

So perhaps the design choices made 10 years ago did not meet users' expectations, and a redesign would be appreciated.

On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours

I think this is a copy-paste of a typo. There is no condition on rev_timestamp whatsoever.

[...]

Redundancy of rc_new, rc_type and rc_source
MariaDB [wikidatawiki_p]> SELECT DISTINCT rc_new, rc_type, rc_source FROM recentchanges;
+--------+---------+---------------+
| rc_new | rc_type | rc_source     |
+--------+---------+---------------+
|      0 |       3 | mw.log        |
|      0 |       0 | mw.edit       |
|      1 |       1 | mw.new        |
|      0 |       6 | mw.categorize |
|      0 |       5 | wb            |
+--------+---------+---------------+

See also T36320#8440663, T244080 and T74157.
[...]

The recommendation is to keep rc_source and drop rc_new and rc_type. I see this as shared work between moderator tools and sre. We (moderator tools) would update lookups to replace rc_new & rc type to use rc_source. sre would modify the database afterwards.

Note that rc_type is a normalized integer, whereas rc_source is a string. rc_type is potentially more efficient.

On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours

I think this is a copy-paste of a typo. There is no condition on rev_timestamp whatsoever.

Yup, it's on rc_timestamp. The idea is when you go to recentchanges (with js enabled) you are redirected to default value which is currently 30 days (I see days=30) that is useless but if we set the default to days=1 for large wikis, while in reality it won't make a difference (you at most can see a couple of hours), it makes sure with complex filtering or other problems (optimizer bugs, etc.), mariadb would still only scan 1/30th of rows.

The recommendation is to keep rc_source and drop rc_new and rc_type. I see this as shared work between moderator tools and sre. We (moderator tools) would update lookups to replace rc_new & rc type to use rc_source. sre would modify the database afterwards.

Note that rc_type is a normalized integer, whereas rc_source is a string. rc_type is potentially more efficient.

I considered it but decided against it since then now you have to build a system to make sure extensions introducing new rc_types don't overlap which is a non-trivial task also the gain is small since the text is short (int takes 4 bytes, a two letter string takes three bytes, they are not that different) and last but not least rc table isn't that large and is the de-normalization table, it's broken in different ways. So it's not worth the overhead of normalizing (specially adding a new table for normalization)

jsn.sherman moved this task from Eng review to Done on the Moderator-Tools-Team (Kanban) board.

Closing out since the work scoped for the spike is complete; followup conversations are now happening in several venues.