Page MenuHomePhabricator

PageTriageUtil::getTopTriagers() is not accurate
Closed, ResolvedPublic3 Estimated Story Points

Description

PageTriageUtil::getTopTriagers() currently relies on data from the pagetriage_log table. However this data is not reliable since pagetriage_log only includes data for pages that are still in the PageTriage system (either because they are unreviewed or because they haven't been garbage collected yet).

We need to either switch PageTriageUtil::getTopTriagers() to look for data in the main logging table or we need to make pagetriage_log into a permanent log (that isn't synced with pagetriage_page). (See updatePageTriageQueue.php and PageTriage::deleteFromPageTriage() for the code that keeps them in sync.)

I'm not really sure what the purpose of the pagetriage_log table actually is, as it seems to be largely redundant with the same info in the logging table.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kaldari triaged this task as High priority.Jun 19 2017, 7:40 PM
kaldari updated the task description. (Show Details)
kaldari added a subscriber: MusikAnimal.

If we decide to switch to using the logging table, we'll need to look for actions with log_type of pagetriage-curation and log_action of reviewed.

This is perhaps why the old code only supported stats for the past day and week. My fault that this was overlooked :-/

As far as I can tell pagetriage_log is only used for these stats, so my guess is it was created because querying logging took too long. However the stats are cached, and things are probably faster than they were in 2012, so maybe now we could do without the replicated table. If not, let's just remove the last-month option or change it so the data isn't purged until one month afterwards, rather than removing the purging altogether.

I ran https://quarry.wmflabs.org/query/19672 on production and it took 25 seconds. That's probably too slow... but we might could add a cron job to pre-compute and cache the stats, say every hour. Consumers of the API will only get the cached result so they are never stuck there waiting. How does that sound?

Since querying the logging table is slow, it seems like the easiest solution is to just convert pagetriage_log into a permanent log and remove the syncing/purging. Setting up a new purging arrangement (that is separate from the pagetriage_page purging) just sounds like unneeded complexity. Once the table is too big to query efficiently (in like 10 years) we can purge it again :)

That AND log_params NOT LIKE '%::auto";i:1%' is everything but efficient. You should be able to replace it with AND log_action = 'patrol' (ie. log_action != 'patrol'), after which you could at least use type_action index.

If you need to query based on data in log_params, you should put it in the log_search table, which is indexed and easily queryable.

That AND log_params NOT LIKE '%::auto";i:1%' is everything but efficient. You should be able to replace it with AND log_action = 'patrol' (ie. log_action != 'patrol'), after which you could at least use type_action index.

Yes you are correct, that isn't needed here! Run time using AND log_action = 'patrol' was 10 seconds and I got the same numbers :) The log_params clause was a relic from queries of older time periods, when due to a bug some log entries were incorrectly marked as automatic patrols but not in log_params. There is a task somewhere for it to clean up the data, I believe.

I wrote a quick bot task to generate top page reviewers via any means of patrolling, using the above query https://en.wikipedia.org/wiki/User:MusikBot/TopPageReviewers/Report. This seems to satisfy those looking for such data, but it'd be nice if we had dedicated PageTriage data through our API. If we query just for page triage data, which I think was the point of this task (sorry, misread), run time is only ~1.5 seconds:

SELECT log_user_text AS `reviewer`,
       COUNT(log_id) AS `reviews`
FROM logging
WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
AND log_namespace = 0
AND log_type = 'pagetriage-curation'
AND log_action = 'reviewed'
GROUP BY reviewer
ORDER BY reviews DESC;

So maybe we should use logging?

@MusikAnimal: If querying only for pagetriage-curation reviews in the logging table is relatively fast, let's do that. 1.5 seconds (or 1.7 in my test) still feels slow for an API response. I tried querying a week instead of a month and it only took 0.6 seconds. How about we restrict the API to only querying for the past day or week (as it was originally), and let your bot handle the longer time frames (albeit for both patrolling and reviewing).

How about we restrict the API to only querying for the past day or week (as it was originally), and let your bot handle the longer time frames (albeit for both patrolling and reviewing).

Sounds good to me :)

kaldari set the point value for this task to 3.
DannyH lowered the priority of this task from High to Medium.Jul 3 2017, 11:46 PM
kaldari lowered the priority of this task from Medium to Low.Aug 16 2017, 11:18 PM
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.

Change 379425 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/PageTriage@master] Fixing the topreviewers API in PageTriage

https://gerrit.wikimedia.org/r/379425

Change 379604 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/PageTriage@master] Don't delete stuff from the pagetriage_log

https://gerrit.wikimedia.org/r/379604

Change 379425 abandoned by Kaldari:
Fixing the topreviewers API in PageTriage

Reason:
In favor of Id589a6f50

https://gerrit.wikimedia.org/r/379425

Change 379604 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Don't delete stuff from the pagetriage_log unless it's a year old

https://gerrit.wikimedia.org/r/379604