Page MenuHomePhabricator

Updating user edit count is pretty expensive in some wikis
Open, MediumPublic

Description

While investigating the load on s8 databases (T246415: Investigate a different db load groups for wikidata / wikibase), one of the queries with the highest "total latency" on master came up this:
UPDATE user SET user_editcount = user_editcount + ? WHERE user_id = ? AND ( user_editcount IS NOT NULL )
Which in total is responsible for 6% of all latency on master of s8 (for comparison, updating term store is only 3%).

This definitely can be improved. We can maybe push it jobs and deduplicate it (somehow without losing the value) given that that in some wikis, bots edit at really high rate (sometimes a bot edits up to 500 edits per minute) making lots of locking happening at the same on their row in user table. We can also go at the direction of site_stats and have it in WAN that flushes it from time to time.

Event Timeline

Afaik these are already deduplicated, and already deferred or in a job. Are these queries notably slow, or are they responsible for a lot of db time in aggregate perhaps because they're (naturally) some of the most common writes?

Is there an observed issue with db load or end-user latency?

Afaik these are already deduplicated, and already deferred or in a job.

I assume aggregation of these updates help. Also, I think it's worth double checking if it works as intended (there might be something buggy there).

Are these queries notably slow, or are they responsible for a lot of db time in aggregate perhaps because they're (naturally) some of the most common writes?

No, Let me explain this way, term store is also gets updated on every edit and it's massive (billions of rows across six tables) and yet its load pressure is half of this. wb_changes is being written twice per edit and yet its load is lower than this (while having a less than optimal design).

My suspicion here is that when a handful of users edit really fast (sometimes 10/sec for hours). Multiple threads wait for the lock of just one row which would make it slow, also maybe there are missing indexes which would have made it faster (specially on user count) given that this table is pretty big on wikidatawiki.

Is there an observed issue with db load or end-user latency?

Yes, I was asked by @Marostegui to create this ticket.

I can imagine a number of different solutions for this issue. Deciding which one is a question of system architecture, I think. Possible solutions include:

  • deduplicated sql queries in a job, as suggested by amir. Would probably require us to add a field to the user table that contains some kind of marker (e.g. revision ID and/or timestamp) of the last update, so we can calculate the number of edits since that marker.
  • I supsect we can easily get the number of edits per user from the analytics cluster. Perhaps we can just copy that the the user table every now and then?
  • Maintain the counter outside of MySQL entirely, and keep it updated using stream processing or some similar technique.

Which approach we can use also depends on how urgent the solution is.

I can imagine a number of different solutions for this issue. Deciding which one is a question of system architecture, I think. Possible solutions include:

  • deduplicated sql queries in a job, as suggested by amir. Would probably require us to add a field to the user table that contains some kind of marker (e.g. revision ID and/or timestamp) of the last update, so we can calculate the number of edits since that marker.

I doubt that adding the field would be necessary because edit count is never supposed to be 100% accurate and it's even explicitly written in several places. I think any job that's not failed can add it and if a job fail then it's a not a big deal (I hope it's less than 1% of the jobs).

  • I supsect we can easily get the number of edits per user from the analytics cluster. Perhaps we can just copy that the the user table every now and then?

I don't know if it's possible to move data between the analytics cluster and production (IIRC it's not) but that's a good idea in general to have a pipeline from these two so we can update some tables from analytics cluster (like QueryPage special pages and querycache table)

  • Maintain the counter outside of MySQL entirely, and keep it updated using stream processing or some similar technique.

Which approach we can use also depends on how urgent the solution is.

There are some other options (I'm brain dumping, sorry if some sounds so weird):

  • Moving the field to a dedicated table. Than it wouldn't need to lock that row in user table. I assume it would make things quite faster (and reduce storage as most users won't have row in the new table since most users don't edit at all)
  • Removing this functionality from core (behind a feature flag of course so it wouldn't break for third parties) and let AQS handle it
    • But I assume lots of core itself needs the field, then this is not a good idea
  • I supsect we can easily get the number of edits per user from the analytics cluster. Perhaps we can just copy that the the user table every now and then?

I don't know if it's possible to move data between the analytics cluster and production (IIRC it's not) but that's a good idea in general to have a pipeline from these two so we can update some tables from analytics cluster (like QueryPage special pages and querycache table)

It would be if this data were exposed in some nice form (similar to the pageview APIs).
Doing something like this however ends up raising a bunch of interesting thoughts.
These edit counts are used for things like autoconfirmed for example? so although an occasional import / update from some other source might be nice if we start missing updates, we still need some counter to be liveish?

There are some other options (I'm brain dumping, sorry if some sounds so weird):

  • Moving the field to a dedicated table. Than it wouldn't need to lock that row in user table. I assume it would make things quite faster (and reduce storage as most users won't have row in the new table since most users don't edit at all)

Sounds pretty fine to me.

  • Removing this functionality from core (behind a feature flag of course so it wouldn't break for third parties) and let AQS handle it
    • But I assume lots of core itself needs the field, then this is not a good idea

I think this has the same concerns attached to it that I raise above?

I can imagine a number of different solutions for this issue. Deciding which one is a question of system architecture, I think. Possible solutions include:

  • deduplicated sql queries in a job, as suggested by amir. Would probably require us to add a field to the user table that contains some kind of marker (e.g. revision ID and/or timestamp) of the last update, so we can calculate the number of edits since that marker.
  • I supsect we can easily get the number of edits per user from the analytics cluster. Perhaps we can just copy that the the user table every now and then?
  • Maintain the counter outside of MySQL entirely, and keep it updated using stream processing or some similar technique.

Which approach we can use also depends on how urgent the solution is.

The simplest and most idiomatic solution I think is to let MW handle this through its job queue and deferred updates aggregation, which is a cheap and reliable way to stashing and executing these updates without any novel or exciting complexity.

Based on what Amir described about these writes taking up notable time during something around Wikidata edits (what? which update or job, when and where does this happen?), I suspect something is bypassing this logic right now which would explain all the symptoms.

I suspect as such that something is causing this edit count write to happen synchronously as part of the "huge" transaction for Wikidata stuff and thus is holding on to locks from other threads much longer than needed. Letting DeferredUpdates do its thing and in its own transaction should solve all of that at once.

AFAIK it's not a job it's a deferred update, if it could have been a job instead (with proper aggregation of course), I assume it'll improve performance a lot: https://gerrit.wikimedia.org/g/mediawiki/core/+/d26a689edd407e8dd62867ea32a117452cba130a/includes/deferred/UserEditCountUpdate.php

The problem is that in wikidata (unlike other wikis), bots edit back to back (with really high speed) a lot, meaning their deferred update will probably spends a lot of time waiting for another one to finish. Maybe after the changes we made in ratelimit for bots, this is improved now?

I don't think we need to do anything here if we can have an appropriate rate limit for bots, or otherwise encourage the Wikidata bots to stop doing this.