Page MenuHomePhabricator

Optimize edit count queries in XTools
Closed, ResolvedPublic8 Estimated Story Points

Description

The current queries are inefficient. We could make them better by using the CentralAuth edit data when it is available.

Event Timeline

kaldari triaged this task as Medium priority.Apr 18 2017, 11:20 PM
kaldari set the point value for this task to 3.
kaldari moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.

There have been reasons from users against the CentralAuth data due to inaccuracy. I don't recall the exact reasons, though.

Anyway, regardless of whether we should use that data or not, if you want to optimize the existing approach, you can!

See https://tools.wmflabs.org/guc/ for example, which does it fairly quickly.

The gist of it is that it queries many wikis at once. In theory it could to all wikis at once (since labs db servers are all host all wikis, the s1-7 names are merely to distribute load and to optimize for row caching), but I found that in practice running up to 7 queries (instead of upto 800) is good enough. In fact, doing these queries separately may actually perform better for the same reason they exist: Caching.

The combination query is made by using the UNION keyword.

For example:

SELECT
  COUNT(rev_id) AS counter,
 'enwiki' as dbname
FROM enwiki_p.revision_userindex
WHERE rev_user_text = 'Krinkle'

UNION ALL

SELECT
  COUNT(rev_id) AS counter,
 'nlwiki' as dbname
FROM nlwiki_p.revision_userindex
WHERE rev_user_text = 'Krinkle'

/* .. */

https://quarry.wmflabs.org/query/18239

counterdbname
1895enwiki
8455nlwiki

And the loop through the results adding counter together, with the ability to also track which count came from which wiki. Or, if that isn't needed:

SELECT SUM(counter) FROM (
SELECT
  COUNT(rev_id) AS counter,
 'enwiki' as dbname
FROM enwiki_p.revision_userindex
WHERE rev_user_text = 'Krinkle'
UNION ALL
SELECT
  COUNT(rev_id) AS counter,
 'nlwiki' as dbname
FROM nlwiki_p.revision_userindex
WHERE rev_user_text = 'Krinkle'
) as sub;

https://quarry.wmflabs.org/query/18240

SUM(counter)
10350

Source code at https://github.com/wikimedia/labs-tools-guc/blob/abf9902dff15dd3348eeb292b98ea424dd65aeb3/src/Main.php#L185-L239

Why do we not want to use the user_editcount field in the user table?

Why do we not want to use the user_editcount field in the user table?

Not sure, I guess it's fine. That's what Special:Preferences, the API and various other interface to the edit count also uses. CentralAuth actually aggregates this as well from all local databases. It doesn't have store its own count anywhere.

Is that what XTools uses currently? If so, I'm unsure why that would be slow.

It's currently using the CentralAuth API to get the edit count. However, it's the getting of "recent edits across all wikis" that's slow; it's currently selecting the 10 most recent edits from all of a user's wikis (and showing the most recent 10 of them). Is there an easier way to do this than querying the revision table for each wiki?

@Samwilson Not really. But querying revision on all wikis shouldn't have to be too slow though. It should be possible to have it be pretty fast in most cases.

Ensure you're doing the following:

  • Query meta_p.wiki once to find all db names and their associated db slice.
  • Use a quick query to find out which wikis have at least one edit. GUC uses a UNION over multiple wikis at once (e.g. all the wikis on the same dbslice) to find the edit count. But you have this already so that should save time.
  • Use revision_userindex instead of revision. The default revision view doesn't have this index for various reasons, but revision_userindex should be identical for most practical purposes, except that it performs better for by-user queries.
  • When doing the queries that are per-wiki (e.g. not UNION), at least do re-use the same database connection for wikis on the same slice. This way eventhough you're running 100+ queries, it only establishes <= 7 connections, which saves a lot of round trip time.

If after that it is still slow, consider using recentchanges (recentchanges_userindex) instead of revision (revision_userindex). This has the obvious drawback of not showing all "latest 10 edits" if some were made longer ago. (Although that may be considered a feature.) If using recent changes, be sure to filter rc_type properly (only use edit and new) since it contains a lot of other entries as well.

Alternatively, you can try using UNION for the actual revision query as well. GUC doesn't actually do that right now. We just quickly query them one after the other, re-using the same connection only, and only for wikis with non-zero edit count.

Optimised revision query: https://tools.wmflabs.org/guc/?user=Samwilson

Recent changes only: https://tools.wmflabs.org/guc/?src=rc&user=Samwilson

Source code:

It looks like we don't use the user_editcount column because it is considered "approximate", which begs the question, why is it approximate?

My edit counts for English Wikipedia:
user_editcount: 58,268
Live edits (xtools): 57,790
Total edits (xtools): 59,870

I'm not sure if user_editcount is supposed to approximate the live edit count or the total edit count, but it isn't close to either :( Guess I should open a new ticket for that.

It looks like we don't use the user_editcount column because it is considered "approximate", which begs the question, why is it approximate?

My edit counts for English Wikipedia:
user_editcount: 58,268
Live edits (xtools): 57,790
Total edits (xtools): 59,870

I'm not sure if user_editcount is supposed to approximate the live edit count or the total edit count, but it isn't close to either :( Guess I should open a new ticket for that.

This discrepancy has been brought up many times over the years, and I feel like there must be a ticket somewhere for it. I think some of it is because data failed to replicate. At the time of writing, this is what I've got for me:

SourceSQLLive editsDeletedTotal
Tool Labs replicasSELECT COUNT(*) FROM revision_userindex106,7794,013110,792
Tool Labs replicasSELECT user_editcount FROM user106,818
Live productionSELECT COUNT(*) FROM revision106,8524,020110,872
Live productionSELECT user_editcount FROM user106,891

I tried to count revision on the replicas but it timed out. Production has no revision_userindex (the same indexing apparently exists on revision).

Also, user_editcount indeed is meant to include deleted edits. For example, going through the deletion log, I found a user who has one live edit at the time of writing, but 8 edits if you include those at Special:DeletedContributions (admin-only page), and 8 edits if you query user_editcount.

That being said, there's obviously a lot of discrepancies no matter which way you go about counting edits. I've found that it is consistent for users with lower edit count (like the aforementioned example), but for those with larger counts (or maybe older accounts?), the numbers don't match up. Frankly I don't know which to believe...

It looks like the old XTools only did the expensive revision query for the single requested wiki, but used the user_editcount numbers for the top 10 other wikis list under "General statistics". This seems like a good idea that we should also follow.

revision and revision_userindex are not real tables, they are views of the same table (revision), the discrepancy of edits is just the time between one query and the other. It is true that revision on labsdb1001 and labsdb1003 may have lost some rows due to faulty replication. New servers (labsdb1009/10/11) have an improved replication mechanism, and as far as we can tell, they should be identical to production (minus the filtering).

@Samwilson: FWIW, the counts in the user_editcount field should now stay closer to the XTools-specific edit counts, thanks to T163966. Are you planning on using user_editcount for the top 10 other wikis list?

@kaldari Cool. Yeah, it's currently using the globaluserinfo API call, which uses user_editcount doesn't it? — but I'll switch it to use the DB directly (to be more in line with everything else, and a little bit faster).

I've put in more stopwatch calls, so it's easier to see what's taking the long times in the profiler.

This discrepancy has been brought up many times over the years, and I feel like there must be a ticket somewhere for it. I think some of it is because data failed to replicate. At the time of writing, this is what I've got for me:

SourceSQLLive editsDeletedTotal
Tool Labs replicasSELECT COUNT(*) FROM revision_userindex106,7794,013110,792
Tool Labs replicasSELECT user_editcount FROM user106,818
Live productionSELECT COUNT(*) FROM revision106,8524,020110,872
Live productionSELECT user_editcount FROM user106,891

I tried to count revision on the replicas but it timed out. Production has no revision_userindex (the same indexing apparently exists on revision).

Also, user_editcount indeed is meant to include deleted edits. For example, going through the deletion log, I found a user who has one live edit at the time of writing, but 8 edits if you include those at Special:DeletedContributions (admin-only page), and 8 edits if you query user_editcount.

I think saying "meant to" is a stretch, I suspect if a patch was written to keep user_editcount in sync with deletions, it would get approved. It's clearly documented (https://www.mediawiki.org/wiki/Manual:User_table#user_editcount) that user_editcount is not going to be equal to the number of live edits.

That being said, there's obviously a lot of discrepancies no matter which way you go about counting edits. I've found that it is consistent for users with lower edit count (like the aforementioned example), but for those with larger counts (or maybe older accounts?), the numbers don't match up. Frankly I don't know which to believe...

Trust production obviously. Labs replicas also hide revdel'd edits, which could explain the discrepancy.

So, edit counter is currently getting a user's 40 most recent edits by first getting a list of edited wikis from CentralAuth (ordered by edit count) and then looping through each of them requesting the most recent 40 edits. Once it's got 40 (which often happens with the first query, because that's the user's largest), it limits the query to only those revisions newer than the oldest one in the 40, so each iteration is (usually) a quicker query than earlier ones. (If that makes sense?!)

The other way to do it would be to UNION all those queries, but then it wouldn't be able to have progressively tighter timestamp conditions, so perhaps would take longer anyway.

What's the best approach?

(As for the actual edit count, the first part above, I figure we can leave that up to CentralAuth — and if it's returning the wrong thing then that'll be fixed elsewhere? For non-CentralAuth wikis, it loops through all and does a COUNT(rev_id) on the revision table, limiting with rev_user_text.)

So, edit counter is currently getting a user's 40 most recent edits by first getting a list of edited wikis from CentralAuth (ordered by edit count) and then looping through each of them requesting the most recent 40 edits. Once it's got 40 (which often happens with the first query, because that's the user's largest), it limits the query to only those revisions newer than the oldest one in the 40, so each iteration is (usually) a quicker query than earlier ones. (If that makes sense?!)

How long does this take, compared to other things the edit counter looks up? If it's always going to be slow, I'd consider moving this functionality to an API endpoint and load it into the edit counter interface asynchronously (e.g. the AutoEdits tool does this for fetching non-automated contributions). Most users aren't going here to see global contributions, that's what GUC is for. I'm still finding the new production edit counter to be around twice as slow as the legacy xtools. Edit counter is by far the most popular tool, so speed is paramount and should have priority over showing any non-critical, nice-to-have data that the old tool offered.

Good idea. It's probably a bit under half the load time for many users. It has been moved it to load via JS.

DannyH raised the priority of this task from Medium to High.Jul 3 2017, 11:47 PM

So now the page loads quickly, but eventually I get an error under Latest global edits saying: Error querying Global contributions API: timeout. We should still work on optimizing the queries themselves, possibly just re-using the queries from the old XTools if they are reliably faster.

The speed has improved a bit by doing the ~30 queries for global edits in one UNION'd query. Still times out for some contributors, but works for lots. Is the load time of the main EC page acceptable now (excluding the global edits) do you think?

Try it out on http://xtools.wmflabs.org/ec

Hmm, it still seems slower than the old XTool's editcounter (even without loading global edits), but I can't prove that due to query caching).

http://xtools.wmflabs.org/ec/fr.wikipedia.org/Kaldari:
Executed in 64.903 second(s). · Taken 6.28 megabytes of memory to execute.

A bit more rejigging and things are improving (I think). Now with a cold cache https://xtools.wmflabs.org/ec/fr.wikipedia.org/Kaldari is:

Executed in 3.234 second(s). · Taken 6.33 megabytes of memory to execute.

But that seems too good, so perhaps the DB server is doing some caching too?

But that seems too good, so perhaps the DB server is doing some caching too?

Yeah, probably still had some of the queries cached in the database.

I'll try running http://xtools.wmflabs.org/ec/en.wikipedia.org/Kaldari today and then tomorrow against the old XTools:
New XTools: Executed in 26.229 second(s). · Taken 6.27 megabytes of memory to execute.

http://tools.wmflabs.org/xtools-ec/?user=Kaldari&project=en.wikipedia.org
Old XTools: Executed in 34.51 second(s). · Taken 4.5 megabytes of memory to execute. (Peak: 8)

Congratulations! The new one is faster!

Hmm, tried loading the new XTools again and this time it was slower :(
http://xtools.wmflabs.org/ec/en.wikipedia.org/Kaldari
Executed in 46.718 second(s). · Taken 6.2 megabytes of memory to execute.

:-(

Let's wait till we've redeployed to the new larger prod servers (T169590), and see if that changes the load time.

Hmm, tried loading the new XTools again and this time it was slower :(
http://xtools.wmflabs.org/ec/en.wikipedia.org/Kaldari
Executed in 46.718 second(s). · Taken 6.2 megabytes of memory to execute.

This might be because at the time you tested it, we weren't looking at every edit the user made to compute the automated edits, instead only looking at the last 1,000 edits. We need it to look at all edits.

Old EditCounter vs new:

The big issue is for many users we can't load every the full data (not just edit summary) into memory – it just won't fit. So with the old XTools, the data was pulled from the replicas and added to a temporary MySQL table, and calculations were done on that. This transfers the memory load to a different server, and while this caused problems of it's own (T133321), it also evidently is faster than the way we're doing it with the new XTools. I like that we're not doing this crazy temporary table thing – our way makes things easier to test and the code a lot less sloppy.

So I propose we make go about other means of optimization. For starters, we can load more things asynchronously, namely the non-automated stats since it's more or less a duplication of the AutoEdits tool. This data isn't super important and is shown way down at the bottom of the page, so a little wait time is fine. I have created T170185 and will work on this soon :)

Other ideas:

After my recent conversation with Bryan, I realized we should be avoiding fetchAll, instead calculating things iteratively with fetch and a while loop, updating your totals along the way. This will keep only one record in memory at a time. I don't know if we'd be able to make such optimizations here with EditCounter, or how saving memory would affect the over speed, but it's something to think about.

Even more ideas:

Symfony has an awesome built-in profiler that you can use in the dev environment, showing all queries and the run time. One thing it doesn't show that I really wish it did were external requests. Maybe that's not easy to do, but it would be very interesting to see. We may be making too many API calls or could be combining some. For instance, anytime we call dispalyTitle on a page it has to make an API query, unless the getInfo for that page has already been called and cached. If we have a group of pages for which we only need the displayTitle, we can put it all in one query (or use the massApi helper).

Does anyone know if there's an easy way to see what requests are being made? Hopefully we don't have to resort to one of those OS-level network monitoring tools, like https://superuser.com/questions/123001/is-there-a-good-tool-for-monitoring-network-activity-on-mac-os-x

I have a half-completed Symfony profiler data collector for addwiki/mediawiki-api-base calls, but I couldn't get it working properly. I'll try harder, and publish it as a thing.

The other thing to keep in mind is the stopwatch component, which can help find things that take a long time (it includes all DB queries by default, but anything else can be timed by it too; I've done a bit in the EditCounterRepository).

MusikAnimal moved this task from Archive to Needs Discussion on the Community-Tech board.

Reopening for further investigation. While investigating T172374 I've noticed for dewiki at least, the loading time of the new Edit Counter versus old is considerably slower.

The old took around 47 seconds: https://tools.wmflabs.org/xtools-ec/?user=Baumfreund-FFM&project=de.wikipedia.org

The new takes around 153 seconds, and that's before it even loads the global contributions: https://xtools.wmflabs.org/ec/de.wikipedia.org/Baumfreund-FFM :(

Part of this I think is due to T171264#3468493, but I'm baffled how the old edit counter was overall able to compute this data that much quicker. We should find out!

So I think we should look into using temporary tables again, just like the old version. Our arguments against this were:

  1. they caused a lot of problems, T133321 mainly
  2. it wasn't very testable

For #1, we should just delete these tables in the code, and/or create a cron to delete them. I don't know how PHP execution paths work, but we just need to make sure that if the request is abandoned, the table still gets deleted. This should theoretically happen anyway with CREATE TEMPORARY TABLE but obviously it did not, for whatever reason.

For #2, despite my claims that it isn't testable, I bet it could be. We'd first query for every revision the user made, directly storing it in a temporary table. This means memory is completely off-loaded to the MySQL server (which is a bonus). From there we can do individual queries/processing similar to what we're doing now, and hence make them testable. The idea being since our temporary table is much smaller than the real revision table, it must be considerably faster, so we probably can afford to do individual testable queries. Does that make sense? I'm just theorizing all of this in my head.

It's probably a lot of a work but I think this may be worth our while.

Hmm, yeah it seems even slower that it was back in July:
http://xtools.wmflabs.org/ec/en.wikipedia.org/Kaldari: Executed in 71.401 second(s). · Taken 6.24 megabytes of memory to execute.

kaldari changed the point value for this task from 3 to 8.
kaldari moved this task from Q1 2018-19 to Ready on the Community-Tech-Sprint board.

Just noting that since caeaba8d non-enwiki pages load much faster. In our example it took 62 seconds (down from ~150 secs). This is still slower than the old XTools, though :/

Preliminary testing of the new async approach (note xtools-dev is on the ec-performance-async at the time of writing, and might not be when you click on the links):

These are all faster than the old XTools :) We've also found that running on actual production is a good bit faster than xtools-dev, so who knows... maybe we'll far exceed the legacy version!

I'm no longer using temporary tables. They might speed it up even faster, but we have some concerns about them being future-proof, so I figure let's hold off on that idea for the time being.

The other thing is that here I'm making a grand total of 6 requests at the same time, 7 if you include the parent request. That probably should make me nervous but we're discussing using a dedicated API server. We have enough quota for another big box, and we can dedicate all of its resources to the almighty API :)

I'll report back here when I have a PR ready!

PR: https://github.com/x-tools/xtools/pull/96/files

The code itself is ready for review, but I still have this listed as "WIP" because I want to get the new API server up and running and document how to make it work in administration.rst. So basically, in the Apache config of our main app server, we need to make all requests to /api go to xtools-prod02.xtools.eqiad.wmflabs. That server will have an identical installation of XTools.

I can probably figure out how to set this up in Apache, but I'm not sure how to set up the new production instance. https://wikitech.wikimedia.org/wiki/Tool:XTools is missing a few steps I think (e.g. there is no /var/www directory), and our Puppet config apparently was never finished (T170514). @Samwilson I think you said you might want to look into that?

There is potentially one other outstanding issue. Ideally we would make this Edit Counter API for internal use only, since it is resource-heavy and hence isn't really well-suited for public use. The "Symfony" way of doing this is to configure access control in security.yml, see http://symfony.com/doc/current/security/access_control.html. I got this to work, but it slows down the tool significantly (there are 5 different requests made to the API, and each one has to be authenticated). I'm wondering if this is just an issue with xtools-dev, since it's not completely a production environment? Maybe there is some extra overhead there that's causing this? If we can get the new XTools prod instance up and running, I can use it for testing this issue.

@MusikAnimal: Those new runtimes look promising! Glad you were able to squeeze some more speed from the async requests. Regarding setting up a dedicated API server, I'm worried that we're treading into over-optimization at the expense of simplicity and maintainability. One of the main goals of rewriting XTools was to streamline it for maintainability, so that any technically competent community member could take it over at some point in the future and keep it running. At this point the new XTools is already pretty complicated, and I think a lot of technically competent community members would be reluctant to mess with it due to the complexity. Having a dedicated API server would only make that worse, IMO. I'm inclined to leave things how they are unless performance is significantly worse than the old XTools.

@MusikAnimal: I'm worried that we're treading into over-optimization at the expense of simplicity and maintainability. One of the main goals of rewriting XTools was to streamline it for maintainability, so that any technically competent community member could take it over at some point in the future and keep it running. At this point the new XTools is already pretty complicated, and I think a lot of technically competent community members would be reluctant to mess with it due to the complexity. Having a dedicated API server would only make that worse, IMO. I'm inclined to leave things how they are unless performance is significantly worse than the old XTools.

To retouch on what we talked about in the retro, I think we should at least give this a try. It will for one give us an opportunity to refine the puppet script. Moreover, I'm frankly scared to flip the switch when the async stuff makes 5+ queries to itself. Edit Counter is super duper resource-heavy. I'd give more weight and priority to stability than keeping things simple. At any rate, this rendition of the Edit Counter by no means requires a different server, and we can undo this at any time. The same is true for future maintainers. Should they have to rebuild a production XTools environment from scratch, they could do so with a single app server -- but hopefully they'll have our puppet script to do all the work for them :)

To retouch on what we talked about in the retro, I think we should at least give this a try.

OK, but let's try to finish this soon. It would be great to finish work on XTools by the end of this week, which is the end of the quarter. We really shouldn't be doing any significant work on XTools in Q2.

Looks we're done here!

https://xtools.wmflabs.org/ec/en.wikipedia.org/Kaldari ~ 20 secs (down from 71 secs)
https://xtools.wmflabs.org/ec/de.wikipedia.org/Baumfreund-FFM ~ 40 secs (down from 153 secs)
https://xtools.wmflabs.org/ec/en.wikipedia.org/GoldenRing ~ 3 secs (down from 20 secs)

However we still want to try offloading the API requests to a dedicated server (T176826). For now, just to be safe, I'm going to keep the old Edit Counter where it is. In the meantime I'll forward a few more of the other older tools to the new version, which will give us a better idea of how stable we are under higher traffic. Then we'll reconsider forwarding the Edit Counter :)