Page MenuHomePhabricator

Investigate: How to make the GUC query performant
Closed, ResolvedPublic

Description

Background

We are making a GUC-like feature in MediaWiki. It will need to query all wikis for contributions made by temporary accounts, given an IP address or range. Other possible filters include that date and the wikis to search.

How can we do this in a way that isn't too slow and still returns a useful amount of information?

Questions to be answered
  • How do the existing GUC and XTools make equivalent queries?
  • If they are doing anything that won't be possible from MW with temp accounts enabled, what can we do instead?
  • Will we be able to offer sorting by time and by wiki?

Event Timeline

I don't know how to make it WMF-agnostic, but XTools and I believe GUC basically do the following:

  • Fetch (and cache) which sections there are and the databases they contain (Repository::getDbList())
  • All queries are then made by slice, not by each individual database. I.e. frwiki and ruwiki are both in s6. So the query would use the connection to s6 to UNION the queries against each db. This is important because it means we make (as of the time of writing) at most 8 queries, as opposed to 900+.
  • Fetch the actor IDs for each wiki (GlobalContribsRepository::getDbNamesAndActorIds())
    • In the case of the Toolforge replicas, we have the specialized view actor_revision. By querying it instead of just revision, we effectively get the actor ID only if there are edits on that wiki. In prod, I suppose you could just JOIN on revision to do the same thing.
  • Now we know which wikis have edits, and we have the actor IDs for them.
  • Query the actual revisions. Again, we do this by slice, by not individual db. The UNION-ed queries are wrapped with an outer query that is sorted by timestamp/namespace/etc. (GlobalContribsRepository:getRevisions())

If you are given an account (or temporary account, I guess) instead of an IP, it's much easier as you can fetch the wikis with edits from CentralAuth (GlobalContribsRepository::globalEditCountsFromCentralAuth()).

XTools attains fairly fast speeds using this method, even for wide IP ranges. I can only assume production will enjoy even faster speeds. However, you'd have to also check which wikis for which the user has permission to view (and thus query) IP data. I'd use the same tactic of querying against slices to speed that query up. For stewards or users with privileged global rights, you could skip this permissions check entirely.

As an additional note, it'd be super handy to surface the usual things you see in change lists, such as tags. mw-reverted in particular is very useful.

Thanks @MusikAnimal, this is really helpful.

Noting down some thoughts following a conversation with @Ladsgroup.

If you are given an account (or temporary account, I guess) instead of an IP, it's much easier as you can fetch the wikis with edits from CentralAuth (GlobalContribsRepository::globalEditCountsFromCentralAuth()).

We might be able to do something similar to this for IP addresses. We could have a central table that would store the IP addresses of temp accounts and which wiki they edited. Elaborating on this idea:

  • Overall, we'd (1) look up in this table which wikis were edited by this IP/range; (2) query the CheckUser tables on those wikis for the edits, along with the temp account name
  • This table would need to be purged (similar to the CheckUser tables) since we can't store this information indefinitely
  • For this reason we'd probably need a separate row for each edit, rather than keeping a running count (fields could be IP, revision ID, timestamp)
  • IPs would be stored in a hex representation so ranges are easier to look up (the ip_changes table is currently used similarly for Special:Contributions/<IP range>)

I didn't elaborate on IP ranges, but doing that is pretty fast as-is, by simply querying ip_changes (relevant code). To be clear, I'm not claiming this to be the "best" way to do it, rather I can vouch that it's fast enough – and that's using the slower Toolforge replicas.

That said, while a new table sounds great, you could just query cu_changes directly using the same per-slice strategy and run no more than in 8 queries (on our cluster) for a single IP/range/account. In that case you don't even need to JOIN on revision or whatever, as cu_changes only reflects live edits, and it also gets purged already. Duplicating the rows seems like data redundancy, when we have the existing performance advantage of querying by slice. But, the db slices as I understand may be more or less unique to WMF, so there's the trade-off of losing most 3rd party support if it's hard-coded to work off slices.

To that end, would the new table be part of CentralAuth? I'm guessing for this project 3rd party support isn't a hard requirement for an MVP, but it would be cool to remove the dependency on CentralAuth, specifically, as that's not really recommended outside WMF. The dependency on CheckUser is of course fine and necessary in our case.

I didn't elaborate on IP ranges, but doing that is pretty fast as-is, by simply querying ip_changes (relevant code). To be clear, I'm not claiming this to be the "best" way to do it, rather I can vouch that it's fast enough – and that's using the slower Toolforge replicas.

That said, while a new table sounds great, you could just query cu_changes directly using the same per-slice strategy and run no more than in 8 queries (on our cluster) for a single IP/range/account.

In production that's be complicated, e.g. for s3, you need to union 900 queries together, that can cause issues.

My recommendation is that the new table should only hold three fields: IP hex value, wiki id, timestamp of last edit (for purge). You don't need to add rev_id or anything like that (just bump the timestamp if there is a new edit). It already handles most cases well enough, you just select wiki_id WHERE *rules you want* and then query those wikis separately via querying CU table. That would make the whole code much simpler, makes it portable to other platforms if needed (CA is used outside WMF but not often) and it would be way less dependent on how our infra is setup which frees up how DBA do maintenance (and generally code should be decoupled as much as possible from infra and how it's set up)

I didn't elaborate on IP ranges, but doing that is pretty fast as-is, by simply querying ip_changes (relevant code). To be clear, I'm not claiming this to be the "best" way to do it, rather I can vouch that it's fast enough – and that's using the slower Toolforge replicas.

ip_changes won't be written to for temp users, since their IP addresses will only be stored in CheckUser tables. Would work on the same principle though.

[...]
To that end, would the new table be part of CentralAuth? I'm guessing for this project 3rd party support isn't a hard requirement for an MVP, but it would be cool to remove the dependency on CentralAuth, specifically, as that's not really recommended outside WMF. The dependency on CheckUser is of course fine and necessary in our case.

Having GUC in CheckUser makes sense to me too - I think the fewer places we store and look up IPs the better - mostly so the permissions are all handled in one place, but also to keep purging in sync, in case the time we're allowed to keep the data ever changes. For that reason, ideally the central table would be in CheckUser too, but I'm not sure if that's possible to do cleanly since CheckUser isn't aware of other wikis. (Could we have a config for whether to use the central table? Can you manage a single central table from an extension installed on many wikis?)

My recommendation is that the new table should only hold three fields: IP hex value, wiki id, timestamp of last edit (for purge). You don't need to add rev_id or anything like that (just bump the timestamp if there is a new edit). It already handles most cases well enough, you just select wiki_id WHERE *rules you want* and then query those wikis separately via querying CU table. That would make the whole code much simpler, makes it portable to other platforms if needed (CA is used outside WMF but not often) and it would be way less dependent on how our infra is setup which frees up how DBA do maintenance (and generally code should be decoupled as much as possible from infra and how it's set up)

Sounds good - "revision ID" in my initial comment was a typo - should be "wiki ID" of course, or else you wouldn't know the wiki :) But yes bumping the timestamp rather than storing a new row makes sense, since we only need to know edited recently vs didn't edit. What I didn't want to do was keep a running count of how many edits, since that could lead to storing old data.

I didn't elaborate on IP ranges, but doing that is pretty fast as-is, by simply querying ip_changes (relevant code). To be clear, I'm not claiming this to be the "best" way to do it, rather I can vouch that it's fast enough – and that's using the slower Toolforge replicas.

ip_changes won't be written to for temp users, since their IP addresses will only be stored in CheckUser tables. Would work on the same principle though.

[...]
To that end, would the new table be part of CentralAuth? I'm guessing for this project 3rd party support isn't a hard requirement for an MVP, but it would be cool to remove the dependency on CentralAuth, specifically, as that's not really recommended outside WMF. The dependency on CheckUser is of course fine and necessary in our case.

Having GUC in CheckUser makes sense to me too - I think the fewer places we store and look up IPs the better - mostly so the permissions are all handled in one place, but also to keep purging in sync, in case the time we're allowed to keep the data ever changes. For that reason, ideally the central table would be in CheckUser too, but I'm not sure if that's possible to do cleanly since CheckUser isn't aware of other wikis. (Could we have a config for whether to use the central table? Can you manage a single central table from an extension installed on many wikis?)

It should be doable with virtual domains. Here is an example of how it's being applied https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReadingLists/+/985162

Tchanders claimed this task.