Page MenuHomePhabricator

Decide on scalable approach for watchlist integration of Wikidata
Open, MediumPublic

Description

Originally, we inserted a recentchanges row for each local page affected by a change in a connected wikibase repo (e.g. if the label of Q159 was used on 10000 pages, we inserted 10000 rc rows when that label was changed). However, this was found to generate too much load (see e.g. T171027), so a hard cut-off was introduced as a quick fix, see https://gerrit.wikimedia.org/r/#/c/383384/

That situation is however not satisfactory. We at least want to be smarter which pages to "ping" via the recentchanges mechanism - e.g. insert rc rows for the most watched pages that are affected by the change.

Event Timeline

Small correction/clarification on "this was found to generate too much load" as an ops, I interpret load as throughput/backlog work. The insertion [load] itself was not the problem (the spikes on inserts were too large, but something that could be smoothed); the problem lies in the proportion of wb-originated changes vs. others and the size of the recentchanges table itself. Literally, issues could be solved by making the 2 million different query patterns of recentchanges better, but I am going to assume that is more difficult than changing the wb rows behaviour :-).

The idea is a more accurate phrasing would be "this causes some recentchanges and watchlist-related queries to have >60 seconds of latency and the table became operationaly unmaintainable for some wikis".