Page MenuHomePhabricator

Compare using recentchanges table against fully using the cu_changes table for range contribution queries
Closed, ResolvedPublic1 Story Points

Description

The test queries that we ran against recentchanges in T156318 were surprisingly slow. Let's compare using recentchanges against fully using the cu_changes table (for both revision info and IP address info) and see if there is a significant difference. If using cu_changes for both revision info and IP address info is faster, we'll need to re-evaluate putting Special:RangeContributions in core.

Event Timeline

kaldari created this task.Feb 9 2017, 5:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2017, 5:15 PM
kaldari triaged this task as Normal priority.Feb 9 2017, 5:45 PM
kaldari set the point value for this task to 1.
Bawolff added a subscriber: Bawolff.Feb 9 2017, 6:22 PM

The problem is ultimately about doing a range query in 2-dimensions (time and ip space). If you truly need it to scale well with widely varying ranges in both dimensions, consider looking into full text indexes (e.g.elastic search), maybe, or perhaps whatever we do for gps coordinates. But that makes actually making the feature much more complicated.

In the traditional db model I expect the best you could do is add hex ip fields to recentchanges, and rely on the fact that rc gets pruned after 30 days so there is not that much to sort through in the list of all rc entries that match a given range

MusikAnimal edited projects, added Community-Tech-Sprint; removed Community-Tech.
MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

Initial outlook not good :( I tried similar queries that I did with revision and only fetching rc_id, and it didn't seem to be that much faster. But see my comment at T156318#3018600. If we could (pretty please) replicate the timestamp in our new table I think things will be beyond reasonably fast. That's assuming my tests are accurate and I'm not being fooled by query caching.

@Bawolff @jcrespo Any ideas on how I could definitely prevent query caching? It seems minute changes to the timestamp and other clauses in the SQL don't get around it. I have a collection of IP ranges to test, but they were carefully picked out because they are wide and contain lots of edits. So simply changing to any ole IP range is not ideal for the purposes of testing.

Initial outlook not good :( I tried similar queries that I did with revision and only fetching rc_id, and it didn't seem to be that much faster.

Please post the queries and times if possible. Are you saying that queries using cu_changes for revision data weren't much faster than queries using recentchanges for revision data?

jcrespo added a comment.EditedFeb 11 2017, 5:36 PM

@Bawolff @jcrespo Any ideas on how I could definitely prevent query caching?

There is not such a thing as query caching on production hosts:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/production.my.cnf.erb;c0c6dc2a2ae9044dc40bbeecb8942a2d33afd10e$63

If you are getting cached results, that is not MySQL.

If you execute twice the same queries, and the second time works faster, that is data getting into memory instead of disk, and that is a normal optimization (but the queries themselves are not cached). On production hosts, almost all data will be in memory except rarely-accessed tables.

Whenever you test mysql performance, please compare the session handler statistics, to check the execution plan: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/74

There is not such a thing as query caching on production hosts:
If you are getting cached results, that is not MySQL.
If you execute twice the same queries, and the second time works faster, that is data getting into memory instead of disk, and that is a normal optimization (but the queries themselves are not cached). On production hosts, almost all data will be in memory except rarely-accessed tables.
Whenever you test mysql performance, please compare the session handler statistics, to check the execution plan: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/74

First off, thanks for the link to your talk! I plan to go through all of it at some point :) However I didn't have much luck with the handler stats. When I ran SHOW SESSION STATUS LIKE 'Hand%', I got the same results before, during and after my test queries, which were mostly zeros (sorry didn't copy the results).

I'm also using the Analytics prod replicas, so I'm not sure how caching and what not would differ with production. I asked @kaldari to run one of my test queries over and over in succession on production, and he also found the first query to be much slower than the identical ones that followed it. I didn't have him check the handler stats.

Overall, admittedly with limited SQL knowledge, I'm going to stand by my conclusion here: T156318#3023578. I have not done any more tests on recentchanges because I'm convinced we can get away with using revision, which is preferable.

You are supposed to use them like this:

mysql> FLUSH STATUS; -- drop previous session stats
mysql> SELECT ... ; -- run your query
mysql> SHOW STATUS like 'Hand%'; -- show status for your query only

That will help you in the future to compare efficiency, independently of the memory state of the server.

DannyH moved this task from Estimated to Bug backlog on the Community-Tech board.
MusikAnimal closed this task as Resolved.Jun 19 2017, 6:54 PM
MusikAnimal moved this task from Bug backlog to Archive on the Community-Tech board.

We're no longer piggybacking off of CheckUser, and the schema for a new table has been approved at T156318. Closing.