PageTriageUtil::getTopTriagers() is not accurate
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

PageTriageUtil::getTopTriagers() currently relies on data from the pagetriage_log table. However this data is not reliable since pagetriage_log only includes data for pages that are still in the PageTriage system (either because they are unreviewed or because they haven't been garbage collected yet).

We need to either switch PageTriageUtil::getTopTriagers() to look for data in the main logging table or we need to make pagetriage_log into a permanent log (that isn't synced with pagetriage_page). (See updatePageTriageQueue.php and PageTriage::deleteFromPageTriage() for the code that keeps them in sync.)

I'm not really sure what the purpose of the pagetriage_log table actually is, as it seems to be largely redundant with the same info in the logging table.

Details

	Subject	Repo	Branch	Lines +/-
	Don't delete stuff from the pagetriage_log unless it's a year old	mediawiki/extensions/PageTriage	master	+11 -13
	Fixing the topreviewers API in PageTriage	mediawiki/extensions/PageTriage	master	+14 -8

Customize query in gerrit

Event Timeline

kaldari created this task.Jun 19 2017, 7:33 PM

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptJun 19 2017, 7:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

kaldari triaged this task as High priority.Jun 19 2017, 7:40 PM

kaldari updated the task description. (Show Details)

kaldari added a subscriber: MusikAnimal.

If we decide to switch to using the logging table, we'll need to look for actions with log_type of pagetriage-curation and log_action of reviewed.

TonyBallioni subscribed.Jun 19 2017, 7:56 PM

This is perhaps why the old code only supported stats for the past day and week. My fault that this was overlooked :-/

As far as I can tell pagetriage_log is only used for these stats, so my guess is it was created because querying logging took too long. However the stats are cached, and things are probably faster than they were in 2012, so maybe now we could do without the replicated table. If not, let's just remove the last-month option or change it so the data isn't purged until one month afterwards, rather than removing the purging altogether.

I ran https://quarry.wmflabs.org/query/19672 on production and it took 25 seconds. That's probably too slow... but we might could add a cron job to pre-compute and cache the stats, say every hour. Consumers of the API will only get the cached result so they are never stuck there waiting. How does that sound?

Since querying the logging table is slow, it seems like the easiest solution is to just convert pagetriage_log into a permanent log and remove the syncing/purging. Setting up a new purging arrangement (that is separate from the pagetriage_page purging) just sounds like unneeded complexity. Once the table is too big to query efficiently (in like 10 years) we can purge it again :)

That AND log_params NOT LIKE '%::auto";i:1%' is everything but efficient. You should be able to replace it with AND log_action = 'patrol' (ie. log_action != 'patrol'), after which you could at least use type_action index.

If you need to query based on data in log_params, you should put it in the log_search table, which is indexed and easily queryable.

In T168335#3361658, @Platonides wrote:

That AND log_params NOT LIKE '%::auto";i:1%' is everything but efficient. You should be able to replace it with AND log_action = 'patrol' (ie. log_action != 'patrol'), after which you could at least use type_action index.

Yes you are correct, that isn't needed here! Run time using AND log_action = 'patrol' was 10 seconds and I got the same numbers :) The log_params clause was a relic from queries of older time periods, when due to a bug some log entries were incorrectly marked as automatic patrols but not in log_params. There is a task somewhere for it to clean up the data, I believe.

I wrote a quick bot task to generate top page reviewers via any means of patrolling, using the above query https://en.wikipedia.org/wiki/User:MusikBot/TopPageReviewers/Report. This seems to satisfy those looking for such data, but it'd be nice if we had dedicated PageTriage data through our API. If we query just for page triage data, which I think was the point of this task (sorry, misread), run time is only ~1.5 seconds:

SELECT log_user_text AS `reviewer`,
       COUNT(log_id) AS `reviews`
FROM logging
WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
AND log_namespace = 0
AND log_type = 'pagetriage-curation'
AND log_action = 'reviewed'
GROUP BY reviewer
ORDER BY reviews DESC;

So maybe we should use logging?

@MusikAnimal: If querying only for pagetriage-curation reviews in the logging table is relatively fast, let's do that. 1.5 seconds (or 1.7 in my test) still feels slow for an API response. I tried querying a week instead of a month and it only took 0.6 seconds. How about we restrict the API to only querying for the past day or week (as it was originally), and let your bot handle the longer time frames (albeit for both patrolling and reviewing).

In T168335#3361876, @kaldari wrote:

How about we restrict the API to only querying for the past day or week (as it was originally), and let your bot handle the longer time frames (albeit for both patrolling and reviewing).

Sounds good to me :)

• Mattflaschen-WMF moved this task from Untriaged to Near-Term Interest on the Collaboration-Team-Triage board.Jun 20 2017, 6:01 PM

kaldari moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Jun 20 2017, 6:33 PM

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.Jun 20 2017, 11:26 PM

kaldari set the point value for this task to 3.

• DannyH lowered the priority of this task from High to Medium.Jul 3 2017, 11:46 PM

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Jul 5 2017, 11:05 PM

• DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.Jul 5 2017, 11:23 PM

kaldari added a project: good first task.Jul 11 2017, 8:24 PM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptJul 11 2017, 8:24 PM

kaldari lowered the priority of this task from Medium to Low.Aug 16 2017, 11:18 PM

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.

Change 379425 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/PageTriage@master] Fixing the topreviewers API in PageTriage

https://gerrit.wikimedia.org/r/379425

gerritbot added a project: Patch-For-Review.Sep 20 2017, 10:31 PM

Change 379604 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/PageTriage@master] Don't delete stuff from the pagetriage_log

https://gerrit.wikimedia.org/r/379604

Change 379425 abandoned by Kaldari:
Fixing the topreviewers API in PageTriage

Reason:
In favor of Id589a6f50

https://gerrit.wikimedia.org/r/379425

kaldari claimed this task.Sep 26 2017, 11:23 PM

kaldari moved this task from Ready to Needs Review/Feedback on the Community-Tech-Sprint board.

Change 379604 merged by jenkins-bot:
[mediawiki/extensions/PageTriage@master] Don't delete stuff from the pagetriage_log unless it's a year old

https://gerrit.wikimedia.org/r/379604

kaldari closed this task as Resolved.Oct 2 2017, 10:44 PM

kaldari moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.

kaldari moved this task from Q1 2018-19 to Needs Review/Feedback on the Community-Tech-Sprint board.

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)).Oct 3 2017, 12:02 AM

Samwilson moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.Oct 4 2017, 4:19 AM

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Oct 25 2017, 10:21 PM

• DannyH moved this task from Up Next (June 3-21) to Archive on the Community-Tech board.

PageTriageUtil::getTopTriagers() is not accurateClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Event Timeline

PageTriageUtil::getTopTriagers() is not accurate
Closed, ResolvedPublic3 Estimated Story Points
Actions