Page MenuHomePhabricator

don't dispatch changes to all affected pages for highly used items
Closed, ResolvedPublic

Description

When a highly-used item is edited we are creating too many recent changes entries on the client wikis. To reduce this noise we should limit the creation of recent changes entries for these cases to the pages that are linked via a sitelink.

So if more than N pages on the current client wiki are affected by a repo change, the change should only be in the RC for the directly linked page. Other types of updates should probably not be affected (page purges, …). N should be configurable.

Open question: Where should the cut-off initially be?

Event Timeline

What was the original idea behind pushing changes to an item to all linked items? This doesn't happen for MediaWiki and seems useless to me (N=1 seems perfect to me right now), but perhaps I'm missing something.

Not all link items but all articles that make use of the data in that item. The idea is that everyone using that data should see when it changes.

Isn't it possible to determine the data used and segment the RC feed based on that? For instance, very few articles make use of the language links except the article linked to the item. On the other hand, many articles use only the localized labels of items, which means it's probably not interesting to push changes to item properties to those articles.

For example, [[:ro:Pola Illéry]] is the only article using the IW data for [[:d:Q459912]]. It also make uses of some properties in [[:d:Q271395]], but only the label in [[:d:Q30]]. So recent changes to IW for [[:d:Q459912]] should not go to other articles except [[:ro:Pola Illéry]] and property changes for [[:d:Q30]] should not go to [[:ro:Pola Illéry]].

Yes that's what @hoo is working on. It's more complicated than it sounds though.

Note that this means there may be a specifically-relevant change to a watched page (if you're not watching the directly-linked page for the entity) that affects that page's rendering, but doesn't show on your Watchlist.

This is probably already intended to be temporary, and I understand it's a necessary performance workaround, but I think this should be flagged to be updated later.

It's probably fine for Special:RecentChanges.

I think the materialization is a wrong approach, and we should try to materialize the changes only on query. E.g.:

Watchlist x page x wikidata_usage JOINS being queried on real time. Of course, that makes the query an order of magnitude slower and more complex but- some tricks could be used, and memory is probably going to be faster than pure on-disk materialization. That would be the general idea, I do not think it is easy, but at least it would not affect queries not requiring including Wikidata changes and it would not be dangerous for other reasons (see NDA related tickets).

Recentchanges has already been affected because bad performance queries where possible in the past, with the new materialized entries, those are now taking over 60 seconds to execute and fail to return to the user (you can see the rate of connection lost on API queries to recentchanges).

This is the rate on the last 7 days of Apiqueryrecentchanges:

https://logstash.wikimedia.org/goto/7a5cebd94f0120ead4ca10a34f6b2b54

Coincidentally, it is much larger on ruwiki and commons, even if ruwiki was added as backend a borrowed large server.

@jcrespo I see no good way to implement your suggestion in Wikibase. If I understand correctly, you are essentially proposing to get rid of the recentchanges table altogether. That may be the right thing to do, but it's not possible in the short term. In fact, it seems like a long term project that will need a lot of deliberation.

So, while we can push in that direction, we need a solution that we can implement her and now.

you are essentially proposing to get rid of the recentchanges table altogether

No, I was asking to keep recentchanges as it was before, with the previous load/meaning and implement "new features" on separate tables, or on separate servers. I just gave a couple of imperfect ideas on the other ticket, if they are wrong or impossible to implement, I apologize. I was just trying to not just disable things but give options to move forward (trying to be more constructive). As I said on other ticket, I will finish the purge at T177772 and let other people discuss the right way to move forward. I wonder though, if this issue could happen on similar multiplexing patterns like MCR- and if someone has evaluate its impact. After all, even if very dissimilar situations, both start from the idea of implementing table inheritance "wrongly". But completely offtopic- that should be discussed elsewhere. CC @brion .

One last thought, it is early to say, but commonswiki seem to have stopped having insert spikes of 100x the normal rate: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1068&var-port=9104&from=now-7d&to=now (click "inserts") Something to take into account as a replication performance issue.

@jcrespo wrote:

No, I was asking to keep recentchanges as it was before, with the previous load/meaning and implement "new features" on separate tables, or on separate servers.

Well... it's not really a new feature. The recentchanges table is the standard mechanism for surfacing changes to users via the recentchanges and watchlist UIs and APIs. IF we don't use the recentchanges tbale for this, we have to implement "on the fly join" logic in several places, and make it work efficiently with filtering and paging. This would be a major overhaul, it's not doable on short notice.

Also, since you said that you "think the materialization is a wrong approach", we should probably consider getting rid of recentchanges altogether in that case.

This can't be done short term, and probably not even mid term. This transition would probably take a couple of years (MCR scale). We can't live without notifications for so long, that would probably kill the SDC project. So we need an alternative solution that is feasible as a mid term solution. Do you have an idea?

both start from the idea of implementing table inheritance "wrongly".

How is that? In my mind, "table inheritance" refers to the attempt to store entities that share only some properties in a relational database - correctly using references, or wrongly using nullable fields. But how is this the case here? Recentchanges rows that are triggered by wikidata do not have different properties from regular entries. The only difference is in how they are triggered.

Change 383384 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Quick fix for flooding of the RC table.

https://gerrit.wikimedia.org/r/383384

"Open question: Where should the cut-off initially be?" - the problem I have here is that it's particularly useful to see changes to highly-used pages, as it's those same pages that need to be watched closer to make sure that vandalism is quickly reverted. E.g. if a country name is vandalised, that needs to be caught quickly. If that's not feasible, then I'm not sure I see the benefit of having a middle-region, perhaps it should only appear for the directly linked page... Or maybe it should show in RelatedChanges rather than RecentChanges.

"Open question: Where should the cut-off initially be?" - the problem I have here is that it's particularly useful to see changes to highly-used pages, as it's those same pages that need to be watched closer to make sure that vandalism is quickly reverted. E.g. if a country name is vandalised, that needs to be caught quickly. If that's not feasible, then I'm not sure I see the benefit of having a middle-region, perhaps it should only appear for the directly linked page... Or maybe it should show in RelatedChanges rather than RecentChanges.

RelatedChanges, RecentChanges, and watchlist are all the same behind the scenes. They all have the same feasibility concerns.

RelatedChanges, RecentChanges, and watchlist are all the same behind the scenes. They all have the same feasibility concerns.

Sure, I thought that this might help reduce the amount of duplicate info stored, since presumably you'd just store the change with the linked item and then be able to pick up that linked item change through RelatedChanges. Apologies if I'm missing the problem.

Change 383384 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Quick fix for flooding of the RC table

https://gerrit.wikimedia.org/r/383384

Change 383551 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Add tracking for fallout from fix for T177707

https://gerrit.wikimedia.org/r/383551

Change 383551 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add tracking for fallout from fix for T177707

https://gerrit.wikimedia.org/r/383551

As those two patches got merged as a workaround and as T178063 exists to find a more sustainable approach, is this task still "Unbreak now" priority?

Lydia_Pintscher lowered the priority of this task from Unbreak Now! to High.Oct 16 2017, 1:12 PM

I added tracking for the number of discarded recentchanges entries to the WikidataClient change handling dashboard, see https://grafana.wikimedia.org/dashboard/db/wikidataclient-change-handling-wikipageupdater?orgId=1