DBA support for RecentChanges (Phab T307328)
Why it matters
The 1.3 KR is all about guiding editors to patrolling tasks and demonstrating impact. The features we develop for this KR are going to be working with recent changes which has run into DB scalability limitations that require SRE-owned schema and/or SRE expertise. We currently have an "avoid" list of wikis when it comes to adding more joins against the recentchanges table:
# top 25 wikis in terms of recentchanges; to avoid # see https://phabricator.wikimedia.org/P74607 avoid_wikis = [ "sv", "ko", "pt", "pl", "fa", "frwiktionary", "ca", "uk", "mgwiktionary", "ur", "ja", "zh", "arz", "enwiktionary", "it", "de", "thwiktionary", "vi", "es", "ru", "ar", "fr", "en", "wikidatawiki", "commonswiki", ]
Proposal impact summary
| proposal from T307328 | improvement | user impact | effort |
|---|---|---|---|
| On large wikis, if there is no condition on rev_timestamp, just add a condition to be the last 24 hours | medium | none | small |
| drop rc_new & rc_type, keep rc_source | medium | none | medium |
| Identify and resolve wikibase ingest issues | small | none | medium |
| Move category membership changes to another table | large | medium | large |
| Store one week of data in one table, the rest of the month in another table | x-large | large | x-large |
| Reduce the time of storage rc actions to one week | x-large | large | small |
| Split autopatrolled actions out of rc table | medium | x-large | large |
| Reduce the time to store autopatrolled actions. | medium | x-large | large |
| Move it to a NoSQL solution | unknown | unknown | x-large |
| Normalizing the table | none | none | large |
Recommended proposals from the mentioned ticket
This might warrant adding a ui hint where it actually comes into play, but seems like an easy win considering that busy wikis only display edits from the last few minutes or seconds due to the volume of edits. This would help reduce the db pressure brought by the recent changes page, but wouldn't reduce the load brought by lookups / joins against the recentchanges tables from more specific queries, either on the recent changes page or from other features we build that query that table
The recommendation is to keep rc_source and drop rc_new and rc_type. rc_type would be most efficient, but difficult to keep extensions from using colliding values.
- This type of rc (type 6, RC_CATEGORIZE) represents a significant portion of rows on the wikis that are currently on our "avoid" list [1]
- I did a little digging to see how the filters for this change type impact usability
- based on a reading of our research, category changes seem to be more confusing than useful to many editors on the rc page.
- a preliminary look at the output of webrequest-rc_showcategorization.py indicates that the "hide category changes" default is often removed, meaning that the functionality of that filter needs to exist somewhere
- @Ladsgroup proposed moving category changes to a new special page to prevent having to join a new category changes table against the recent changes table. There isn't really a use case for viewing category changes alongside all of the rest, which is why they are hidden by default. I think this proposal makes a lot of sense. This would be joint work that would involve
- adding the new table; writing category changes to recent changes + the new table for a period
- adding the new special page
- removing references to category changes from recentchanges / watchlist pages
- an open question is how would this impact watchlists?
- Another possibility would be to keep the category changes in the RecentChanges UI, but make them exclusive of other changes. Eg. selecting category changes deselects all other kinds of changes; selecting any kind of change besides category changes deselects category changes. Avoiding presenting the two types together should preclude the need to join category changes and recent changes tables in queries. This would increase the techincal debt of the recent changes page in order to serve "non-rc" content sometimes.
Areas of inquiry
Moderator tools should investigate (or bring in another team to help investigate) why some wikis have so many wikibase changes (type 5):
It's not wrong in itself, but to have so many is an indicator that something is wrong.
My quickly jotted notes from talking to @Ladsgroup:
- wikibase client is subscribing to things it shouldn't
- for a given wiki with lots of wb changes, look when the source is wb - look at page and match item / entry
- try to identify the cause:
- it gets this because of x aspect
- deprecating old aspect
- subscribes to old aliases
6 Move it to a NoSQL solution. While I can see the reasoning, I'm not sure it would fix anything. These massive firehose of data would be slow to query in any system.
A. That being said, Using Elastic sounds like a good idea (see T307328#7894820)
(also see T276699#7686389)
We're not the experts on all of the tradeoffs involved there, but this could be an opportunity to think big about how rc data is handled.
- As mentioned in T307328, there are really 2 different kinds of needs being served by recentchanges. It could be worth exploring how these might be teased apart;
- Vandalism and problematic edits detection, which requires complex queries on non-patrolled edits (which are a small subset of edits). e.g. get me all edits that have high ores score but also certain tag.
- General pulse check of what's going on. For example, watchlist (=What is happening on pages I care about). This requires seeing all edits but on a more simpler, narrower query. No complex magic but needing the firehose of edits.
Discarded proposals from the mentioned ticket
- Normalizing the table: This is not a good design and won't help much. This table is the de-normalization table of mediawiki. It's a summary table.
The irrelevance of this proposal is included directly in its language.
- Reduce the time of storage rc actions in large wikis. Currently it's 30 days. This number has not come from any research and it's used just because it's round. In my role of Wikipedia volunteer, I barely went to older than a week.
Watchlist support seems to be the reason for longer rc expiration times on multiple wikis; purging rows from RC after one week means purging them from all watchlists as well [2].
- Split autopatrolled[1] actions out of rc table, into something like patrolled_recentchanges. In which querying it would be much limited (e.g. no tag filtering, no ores scores filtering, etc.)
This is similar to the proposed move for category changes; as we've previously discussed, the way that autopatrol is used varies significantly per wiki and would be a less uniform win in my opinion.
- Reduce the time to store autopatrolled actions. This is basically hybrid of solution 2 & 3. That way we reduce the size without losing much.
Similar autopatrol concerns as #3 above
- Store one week of data in one table, the rest of the month in another table (or one week for the first table and all of the month in the other meaning first week would get duplicated.) and make the query on the second table more restricted For example no tag filtering, no ores filtering, etc.
Same watchlist expiration concerns as #2 above
Notes
[1]. recentchanges row count by type on "avoid" wikis:
svwiki count(*) rc_type 159945 0 4875 1 6782 3 253947 5 59902 6 kowiki count(*) rc_type 200395 0 38495 1 6869 3 326027 5 317856 6 ptwiki count(*) rc_type 141412 0 8457 1 15195 3 261027 5 97431 6 plwiki count(*) rc_type 182767 0 9920 1 9587 3 374297 5 78009 6 fawiki count(*) rc_type 158955 0 7873 1 38327 3 286643 5 153435 6 frwiktionary count(*) rc_type 63783 0 26650 1 1461 3 1702 5 137483 6 cawiki count(*) rc_type 116990 0 3853 1 1802 3 353500 5 48416 6 ukwiki count(*) rc_type 196450 0 10176 1 14258 3 454176 5 271760 6 mgwiktionary count(*) rc_type 255259 0 98848 1 245 3 299 5 206549 6 urwiki count(*) rc_type 75383 0 19717 1 1895 3 98048 5 91838 6 jawiki count(*) rc_type 332987 0 10538 1 15419 3 320860 5 110978 6 zhwiki count(*) rc_type 413616 0 24899 1 21332 3 331235 5 271015 6 arzwiki count(*) rc_type 160550 0 3587 1 1347 3 505093 5 160308 6 enwiktionary count(*) rc_type 385053 0 53878 1 5383 3 14329 5 596855 6 itwiki count(*) rc_type 482969 0 20298 1 17271 3 547776 5 163660 6 dewiki count(*) rc_type 748660 0 23701 1 33462 3 168259 5 184209 6 thwiktionary count(*) rc_type 35917 0 99619 1 121 3 3099 5 476349 6 viwiki count(*) rc_type 74264 0 2460 1 5468 3 122165 5 42521 6 eswiki count(*) rc_type 486088 0 19541 1 74064 3 679602 5 151764 6 ruwiki count(*) rc_type 460206 0 29495 1 265743 3 561156 5 314731 6 arwiki count(*) rc_type 263276 0 24212 1 14137 3 806320 5 564149 6 frwiki count(*) rc_type 734515 0 32472 1 44214 3 810730 5 579062 6 enwiki count(*) rc_type 4673647 0 183335 1 230626 3 1160455 5 2978653 6 wikidatawiki count(*) rc_type 14279623 0 476160 1 19744 3 531624 5 71423 6 commonswiki count(*) rc_type 10217747 0 181814 1 2348705 3 26422243 6
[2]. https://www.mediawiki.org/wiki/Help:Watching_pages#Watchlist
Spike
- Confirm exactly which DB changes are required.
- Meet Amir to map out work required work
- Once we identify what DB work is required, draft an SRE-owned hypothesis under 1.3
