Page MenuHomePhabricator

Review category queries
Open, HighPublic

Description

Currently, to count edits within a category we're doing something like:

SELECT COUNT(DISTINCT(revs.rev_page)) AS edited,
    IFNULL(SUM(CASE WHEN revs.rev_parent_id = 0 THEN 1 ELSE 0 END), 0) AS created
FROM `enwiki_p`.`revision_userindex` revs
INNER JOIN `enwiki_p`.`page` ON page_id = rev_page
INNER JOIN `enwiki_p`.`categorylinks` ON cl_from = rev_page
WHERE (revs.rev_user_text IN ('MusikAnimal', 'Samwilson'))
AND (cl_to IN ('Living_people', 'Art_in_New_York_City'))
AND (revs.rev_timestamp BETWEEN 20180101000000 AND 20180131235959)
AND page_namespace = 0

This is pretty fast (~0.20 sec in this case) but not so much when you don't filter by user:

SELECT COUNT(DISTINCT(revs.rev_page)) AS edited,
    IFNULL(SUM(CASE WHEN revs.rev_parent_id = 0 THEN 1 ELSE 0 END), 0) AS created,
    COUNT(DISTINCT(revs.rev_user_text)) AS editors
FROM `enwiki_p`.`revision_userindex` revs
INNER JOIN `enwiki_p`.`page` ON page_id = rev_page
INNER JOIN `enwiki_p`.`categorylinks` ON cl_from = rev_page
WHERE (cl_to IN ('Living_people', 'Art_in_New_York_City'))
AND (revs.rev_timestamp BETWEEN 20180101000000 AND 20180131235959)
AND page_namespace = 0

This times out entirely (our max is 900 secs). This is mainly because "Living people" is so big. If you do this for only "Art in New York City" it's fairly fast (~1.20 secs). I imagine most organizers won't be using a big category like this, but it's good to prepare for the possibility. Note here we're also counting the number of unique editors. To make matters worse, we, if possible, would also like to discount edits made by bots. That'd mean another JOIN on user_groups. I imagine that can only make the query slower.

Is there a better way construct this query?

Finally, for our so-called "revision browser", we want the same queries above but SELECT each individual revision with basic data like username, page title, and diff size.

Event Timeline

@Milimetric @Nuria If I may take you up on the offer to review our queries... :)

I do like the idea of using EventStreams or other technologies, but from a programming standpoint ideally we'd have one way of doing things, regardless of the size of the category, length of event, etc.

aah! sorry I missed this, you should ping me on IRC, Phab notifications get grouped with the endless email firehose.

Ok, will look at these right away and give you any thoughts.

This is really hard! The real answer is that it's too heavy of a query for this schema and considering the lack of indices (like no PRIMARY on categorylinks). I got it to return in 34 seconds but for just one day and just one category, but I still haven't been able to get it to return for the month (UPDATE: it finishes in 25 minutes for the whole month, which is bumping up against the timeout limit so seems kind of dangerous to me):

 select count(distinct rev_page) as edited,
        sum(parent) as created,
        count(distinct rev_user_text) as editors

   from (select distinct rev_page, rev_user_text, if(rev_parent_id = 0, 1, 0) as parent
           from revision_userindex
                    inner join
                page                on page_id = rev_page
                                    and page_namespace = 0
                                    and rev_timestamp between '20180101000000' and '20180101235959'
                    inner join
                categorylinks       on cl_from = rev_page
                                    and cl_to = 'Living_people'
        ) edited_page_with_category
;

In my opinion, the proper thing to do is to define a denormalized data source that we could ingest into Druid and query externally. For example, (rev_page, rev_user_text, rev_parent_id == 0, page_namespace, array of categorylinks for rev_page). This should return responses to the kinds of queries you have really REALLY fast. If you want you can poke the DBAs, Jaime or Manuel, they can look at the query too, maybe agree to make an index? But to me the longer-term answer is to collaborate on a better shape of data to solve this problem, get that launched sometime in the next couple of quarters, and move all the querying to it.

(Here's me babbling about how I got to the query above). I tried to isolate the heavy part from the answer that you needed, so I split it up into a bigger subquery. That part didn't improve performance at all, but it did get rid of the duplication of rows due to categorylinks having multiple join successes for each rev_page. I also tweaked the count and sum statements a bit, just in case it matters (I think the compiler sees both versions as equivalent though). I was finally able to get anything at all to return on enwiki by telling it to USE INDEX(PRIMARY) on the categorylinks join, because really you're only using cl_from and cl_to, which are the PRIMARY key, and it was using cl_timestamp for some reason (probably it thinks it has to sort the output). This is ugly and nonstandard SQL, but it seems to work faster. It still takes forever to go through the whole month, but at least it returns. The other thing I did was ask for one category at a time, like = 'Living_people' instead of IN ('Living_people', ...)

So, this is the version that at least runs on enwiki in the private dbstore1002, with the USE INDEX syntax. I think this is a slower box than you have in the cloud, but the USE INDEX helps. Maybe if you could have that index in the cloud, it would work?

select count(distinct rev_page) as edited,
       sum(parent) as created,
       count(distinct rev_user_text) as editors

  from (select distinct rev_page, rev_user_text, if(rev_parent_id = 0, 1, 0) as parent
          from revision
                   inner join
               page            on page_id = rev_page
                               and page_namespace = 0
                               and rev_timestamp between '20180101000000' and '20180131235959'
                   inner join
               categorylinks   USE INDEX(PRIMARY)
                               on cl_from = rev_page
                               and cl_to = 'Living_people'
       ) edited_page_with_category

;

Thanks for the thorough investigation! I hesitate to bug the DBAs unless I really need to, but here there is an index on production that's not on the replicas. Maybe they can help.

@Marostegui if I may ping you :) Would it be possible add a PRIMARY index on categorylinks, as described above? This is for Grant-Metrics, a Foundation-built project for event coordinators and grantees. They need to be able to get various statistics about pages that are in a category and within a date range.

The query in question that's too slow:

SELECT COUNT(DISTINCT rev_page) as edited,
    SUM(parent) as created,
    COUNT(DISTINCT rev_user_text) as editors
FROM (
    SELECT DISTINCT rev_page, rev_user_text, IF(rev_parent_id = 0, 1, 0) AS parent
    FROM revision
    INNER JOIN page ON page_id = rev_page
        AND page_namespace = 0
        AND rev_timestamp BETWEEN 20180101000000 AND 20180131235959
    INNER JOIN categorylinks ON cl_from = rev_page
        AND cl_to = 'Living_people'
) edited_page_with_category

an alternative variant that I was using (even slower):

SELECT COUNT(DISTINCT(revs.rev_page)) AS edited,
    IFNULL(SUM(CASE WHEN revs.rev_parent_id = 0 THEN 1 ELSE 0 END), 0) AS created,
    COUNT(DISTINCT(revs.rev_user_text)) AS editors
FROM `enwiki_p`.`revision` revs
INNER JOIN `enwiki_p`.`page` ON page_id = rev_page
INNER JOIN `enwiki_p`.`categorylinks` ON cl_from = rev_page
WHERE (cl_to = 'Living_people')
AND (revs.rev_timestamp BETWEEN 20180101000000 AND 20180131235959)
AND page_namespace = 0

Ideally we could use cl_to IN ('Living_people', 'Art_in_New_York_City') etc. and be able to query for multiple categories. But if it's better we can just break them out into separate queries.

If it is not strictly necessary I would rather not create a new index on labs to avoid it drifting too much from production. So if it is possible to split the query into smaller ones, that'd be ideal.
If that is not possible and the indexes are really necessary, please send a patch for https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/profile/files/labs/db/views/maintain_replica_indexes.py

Milimetric moved this task from Incoming to Mentoring on the Analytics board.

Have we tried breaking this into two queries?

  1. Get the page IDs that are in the category and the timeframe.
  2. Query the revisions to those page IDs.

At least then we'd have a good sense which part of this is particularly slow.

We could also do some of the counting bits by just returning raw data in an array and looping through it to count whatever criteria we need. If that made the queries faster, looping through things in code is generally pretty fast depending on how large the array is.

@aezell I did try that when I was trying to tune the query. There are both lots of pages in big categories and way too many records in enwiki_p.revision_userindex, trying to join those two is the big problem. So, you could force it to finish by working through that page ID list one by one or ten by ten, querying for the revisions as you go. But that seems like an awful lot of pain when really what we need is just a better analytics-oriented way to look at this data. Instead of making hacks like that, I would recommend transforming the data in a big batch job once per day or week or whatever's feasible, and then querying a better dataset (like a denormalized and indexed revision table with a category array column)

@Milimetric I made the mistake of posting before reading all the comments or it would have been obvious from your earlier comment that you likely tried that already.

@MusikAnimal How does this change once we store the Page IDs?

From a very cursory investigation, having the page IDs on hand seems to really help. We will be putting a LIMIT on it. Category:Living_people was an extreme example just to test the waters. I don't expect we'll want to allow querying of that many pages without having usernames to narrow down the query.