Compare using recentchanges table against fully using the cu_changes table for range contribution queries
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	kaldari
	Feb 9 2017, 5:15 PM

Description

The test queries that we ran against recentchanges in T156318 were surprisingly slow. Let's compare using recentchanges against fully using the cu_changes table (for both revision info and IP address info) and see if there is a significant difference. If using cu_changes for both revision info and IP address info is faster, we'll need to re-evaluate putting Special:RangeContributions in core.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T152114 Show provider and country for IPs in Special:RangeContributions
Declined	None	T145912 Create new Special:RangeContributions page to support viewing contributions across an IP range
Resolved	MusikAnimal	T157699 Compare using recentchanges table against fully using the cu_changes table for range contribution queries

Event Timeline

kaldari created this task.Feb 9 2017, 5:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2017, 5:15 PM

kaldari moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Feb 9 2017, 5:15 PM

kaldari triaged this task as Medium priority.Feb 9 2017, 5:45 PM

kaldari set the point value for this task to 1.

• DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.Feb 9 2017, 5:45 PM

The problem is ultimately about doing a range query in 2-dimensions (time and ip space). If you truly need it to scale well with widely varying ranges in both dimensions, consider looking into full text indexes (e.g.elastic search), maybe, or perhaps whatever we do for gps coordinates. But that makes actually making the feature much more complicated.

In the traditional db model I expect the best you could do is add hex ip fields to recentchanges, and rely on the fact that rc gets pruned after 30 days so there is not that much to sort through in the list of all rc entries that match a given range

MusikAnimal claimed this task.Feb 10 2017, 9:31 PM

MusikAnimal edited projects, added Community-Tech-Sprint; removed Community-Tech.

MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.

Initial outlook not good :( I tried similar queries that I did with revision and only fetching rc_id, and it didn't seem to be that much faster. But see my comment at T156318#3018600. If we could (pretty please) replicate the timestamp in our new table I think things will be beyond reasonably fast. That's assuming my tests are accurate and I'm not being fooled by query caching.

@Bawolff @jcrespo Any ideas on how I could definitely prevent query caching? It seems minute changes to the timestamp and other clauses in the SQL don't get around it. I have a collection of IP ranges to test, but they were carefully picked out because they are wide and contain lots of edits. So simply changing to any ole IP range is not ideal for the purposes of testing.

Initial outlook not good :( I tried similar queries that I did with revision and only fetching rc_id, and it didn't seem to be that much faster.

Please post the queries and times if possible. Are you saying that queries using cu_changes for revision data weren't much faster than queries using recentchanges for revision data?

@Bawolff @jcrespo Any ideas on how I could definitely prevent query caching?

There is not such a thing as query caching on production hosts:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/templates/mariadb/production.my.cnf.erb;c0c6dc2a2ae9044dc40bbeecb8942a2d33afd10e$63

If you are getting cached results, that is not MySQL.

If you execute twice the same queries, and the second time works faster, that is data getting into memory instead of disk, and that is a normal optimization (but the queries themselves are not cached). On production hosts, almost all data will be in memory except rarely-accessed tables.

Whenever you test mysql performance, please compare the session handler statistics, to check the execution plan: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/74

In T157699#3019450, @jcrespo wrote:

There is not such a thing as query caching on production hosts:

If you are getting cached results, that is not MySQL.

If you execute twice the same queries, and the second time works faster, that is data getting into memory instead of disk, and that is a normal optimization (but the queries themselves are not cached). On production hosts, almost all data will be in memory except rarely-accessed tables.

Whenever you test mysql performance, please compare the session handler statistics, to check the execution plan: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/74

First off, thanks for the link to your talk! I plan to go through all of it at some point :) However I didn't have much luck with the handler stats. When I ran SHOW SESSION STATUS LIKE 'Hand%', I got the same results before, during and after my test queries, which were mostly zeros (sorry didn't copy the results).

I'm also using the Analytics prod replicas, so I'm not sure how caching and what not would differ with production. I asked @kaldari to run one of my test queries over and over in succession on production, and he also found the first query to be much slower than the identical ones that followed it. I didn't have him check the handler stats.

Overall, admittedly with limited SQL knowledge, I'm going to stand by my conclusion here: T156318#3023578. I have not done any more tests on recentchanges because I'm convinced we can get away with using revision, which is preferable.

You are supposed to use them like this:

mysql> FLUSH STATUS; -- drop previous session stats
mysql> SELECT ... ; -- run your query
mysql> SHOW STATUS like 'Hand%'; -- show status for your query only

That will help you in the future to compare efficiency, independently of the memory state of the server.

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Mar 14 2017, 11:07 PM

• DannyH moved this task from Up Next (June 3-21) to Bug backlog on the Community-Tech board.

MusikAnimal added a parent task: T145912: Create new Special:RangeContributions page to support viewing contributions across an IP range.Mar 22 2017, 4:06 PM

BethNaught subscribed.Mar 22 2017, 5:00 PM

We're no longer piggybacking off of CheckUser, and the schema for a new table has been approved at T156318. Closing.

Nemo_bis added a project: MediaWiki-Special-pages.Mar 3 2018, 6:42 AM

Compare using recentchanges table against fully using the cu_changes table for range contribution queriesClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Compare using recentchanges table against fully using the cu_changes table for range contribution queries
Closed, ResolvedPublic1 Estimated Story Points
Actions

Related Objects
Search...