Page MenuHomePhabricator

New Pages Feed: run ORES backfill script in English Wikipedia
Closed, ResolvedPublic

Description

Before rolling out ORES scores in English Wikipedia, we'll need to run the backfill script that was created in T198982, so that all pages in the New Pages Feed will have scores.

Event Timeline

JTannerWMF moved this task from Inbox to Q1 2018-19 on the Growth-Team board.Sep 6 2018, 9:05 AM
SBisson claimed this task.Sep 25 2018, 7:58 PM
SBisson edited projects, added Growth-Team (Current Sprint); removed Growth-Team.
SBisson moved this task from Incoming to In Progress on the Growth-Team (Current Sprint) board.

@SBisson -- after the backfilling is run in English Wikipedia (but before the feature is exposed), is it easy to produce some counts from the database on how many pages are flagged with which ORES categories from the two models? It would be good to break this down by:

  • NPP/AfC
  • Reviewed/Unreviewed status (for the NPP side)
  • Whether something is a redirect or not (for the NPP side)
  • AfC state (for the AfC side)

@SBisson -- after the backfilling is run in English Wikipedia (but before the feature is exposed), is it easy to produce some counts from the database on how many pages are flagged with which ORES categories from the two models? It would be good to break this down by:

  • NPP/AfC
  • Reviewed/Unreviewed status (for the NPP side)
  • Whether something is a redirect or not (for the NPP side)
  • AfC state (for the AfC side)

Sure, I'll build those queries when the data is in.

Mentioned in SAL (#wikimedia-operations) [2018-09-27T18:17:13Z] <stephanebisson> Starting mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286)

Just dumping some queries here as a reference. Data from testwiki

Pages per draftquality class and namespace
mysql:research@s3-analytics-slave [testwiki]> select page_namespace, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=4 and oresc_is_predicted=1 and page_namespace in (0, 2, 118) and page_is_redirect=0 group by page_namespace, oresc_class;
+----------------+-------------+--------------------+
| page_namespace | oresc_class | count(page_latest) |
+----------------+-------------+--------------------+
|              0 |           0 |                  3 |
|              0 |           1 |               1719 |
|              0 |           2 |                147 |
|              0 |           3 |                387 |
|              2 |           1 |                 14 |
|              2 |           2 |                  8 |
|              2 |           3 |                 12 |
|            118 |           1 |                 29 |
|            118 |           2 |                  8 |
|            118 |           3 |                  1 |
+----------------+-------------+--------------------+
10 rows in set (0.12 sec)
Pages per articlequality class and namespace
mysql:research@s3-analytics-slave [testwiki]> select page_namespace, CASE WHEN oresc_probability <0.1666 THEN 'Stub' WHEN oresc_probability < 0.333 THEN 'Start' WHEN oresc_probability<0.5 THEN 'C' WHEN oresc_probability<0.666 THEN 'B' WHEN oresc_probability<0.8333 THEN 'GA' ELSE 'FA' END as articlequality_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=6 and page_namespace in (0, 2, 118) and page_is_redirect=0 group by page_namespace, CASE WHEN oresc_probability <0.1666 THEN 'Stub' WHEN oresc_probability < 0.333 THEN 'Start' WHEN oresc_probability<0.5 THEN 'C' WHEN oresc_probability<0.666 THEN 'B' WHEN oresc_probability<0.8333 THEN 'GA' ELSE 'FA' END;
+----------------+----------------------+--------------------+
| page_namespace | articlequality_class | count(page_latest) |
+----------------+----------------------+--------------------+
|              0 | B                    |                  7 |
|              0 | C                    |                 21 |
|              0 | GA                   |                  5 |
|              0 | Start                |                136 |
|              0 | Stub                 |               2087 |
|              2 | Start                |                  1 |
|              2 | Stub                 |                 33 |
|            118 | B                    |                  3 |
|            118 | GA                   |                  1 |
|            118 | Start                |                  4 |
|            118 | Stub                 |                 30 |
+----------------+----------------------+--------------------+
11 rows in set (0.03 sec)
name=NPP pages per review state, draftquality class and namespace
mysql:research@s3-analytics-slave [testwiki]> select page_namespace, ptrp_reviewed, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=4 and oresc_is_predicted=1 and page_namespace in (0, 2) and page_is_redirect=0 group by page_namespace, ptrp_reviewed, oresc_class;
+----------------+---------------+-------------+--------------------+
| page_namespace | ptrp_reviewed | oresc_class | count(page_latest) |
+----------------+---------------+-------------+--------------------+
|              0 |             0 |           0 |                  3 |
|              0 |             0 |           1 |               1718 |
|              0 |             0 |           2 |                146 |
|              0 |             0 |           3 |                383 |
|              0 |             3 |           1 |                  1 |
|              0 |             3 |           2 |                  1 |
|              0 |             3 |           3 |                  4 |
|              2 |             0 |           1 |                 14 |
|              2 |             0 |           2 |                  4 |
|              2 |             0 |           3 |                  7 |
|              2 |             2 |           3 |                  1 |
|              2 |             3 |           2 |                  4 |
|              2 |             3 |           3 |                  4 |
+----------------+---------------+-------------+--------------------+
13 rows in set (0.05 sec)
name=NPP pages per namespace, draftquality class, and whether it's a redirect or not
mysql:research@s3-analytics-slave [testwiki]> select page_namespace, page_is_redirect, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=4 and oresc_is_predicted=1 and page_namespace in (0, 2) group by page_namespace, page_is_redirect, oresc_class;
+----------------+------------------+-------------+--------------------+
| page_namespace | page_is_redirect | oresc_class | count(page_latest) |
+----------------+------------------+-------------+--------------------+
|              0 |                0 |           0 |                  3 |
|              0 |                0 |           1 |               1719 |
|              0 |                0 |           2 |                147 |
|              0 |                0 |           3 |                387 |
|              2 |                0 |           1 |                 14 |
|              2 |                0 |           2 |                  8 |
|              2 |                0 |           3 |                 12 |
+----------------+------------------+-------------+--------------------+
7 rows in set (0.03 sec)
AfC pages per draftquality class and afc state
mysql:research@s3-analytics-slave [testwiki]> select ptrpt_value as afc_state, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id inner join pagetriage_page_tags on ptrp_page_id=ptrpt_page_id and ptrpt_tag_id=18 where oresc_model=4 and oresc_is_predicted=1 and page_namespace=118 group by ptrpt_value, oresc_class;
+-----------+-------------+--------------------+
| afc_state | oresc_class | count(page_latest) |
+-----------+-------------+--------------------+
| 1         |           1 |                  5 |
| 1         |           2 |                  2 |
| 1         |           3 |                  1 |
| 2         |           1 |                 13 |
| 2         |           2 |                  2 |
| 3         |           1 |                  1 |
| 3         |           2 |                  1 |
| 4         |           1 |                 10 |
| 4         |           2 |                  3 |
+-----------+-------------+--------------------+
9 rows in set (0.01 sec)

Mentioned in SAL (#wikimedia-operations) [2018-09-27T23:07:59Z] <stephanebisson> Finished mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286)

I started the backfill script on 09/27 and it was still running when I stopped working that day. The next morning, it had finished but I ran it again and it processed a few hundred revisions. About 10h later, there's a lot more revisions that are missing scores (I check using the --dry-run option of the script.)

It is possible that new revisions are not all scored correctly. I don't see that happening locally. Need to find logs about that.

It is possible (likely?) that a bunch of scores are being deleted as their associated recent changes entries fall off the 30-day RC window. There's a model config to prevent that (keepForever) but it's not used in production AFAICT. @Ladsgroup would we get away with setting it to true for articlequality and draftquality on enwiki? What if I promise to find a solution to cleanup those rows when the associated pages are removed from PageTriage?

Any other idea about the possible cause for the ever increasing missing rows?

Regarding articlequality the option of keepForever is actually designed just for that model (as we are removing the parent revision so it won't take much rows) so it should be just enabled and hopefully it should work (I tested it and it worked) but regarding draftquality, I'm for it if you think we need to enable it. Do you have any proposal for removing them once they removed from the PageTriage queue?

Regarding articlequality the option of keepForever is actually designed just for that model (as we are removing the parent revision so it won't take much rows) so it should be just enabled and hopefully it should work (I tested it and it worked) but regarding draftquality,

I'm for it if you think we need to enable it.

I think we do. The PageTriage queue is much smaller and much slower-paced than the RC queue. Most pages haven't been updated in the last 30 days but we still need to show ORES predictions in PageTriage to help reviewers get through their backlog.

Do you have any proposal for removing them once they removed from the PageTriage queue?

Architecturally, I was hoping for something similar to the CheckModels hook where the ORES extension plans to delete scores unless a hook handler says: "please keep those" but I don't think that would work here. The deletion is currently based on revisions going through RecentChangesPurgeRows so PageTriage would be able to prevent rows from being deleted when they fall off RC but not to force them to be deleted with they fall off PT, which is the main problem we are trying to solve here. If the deletion was based on "everything older than X" then maybe this kind of approach could work.

One option is for PageTriage to notify the ORES extension when pages are reviewed or removed from the queue so that their revisions can be cleaned up. We would make that interface convenient for ORES and completely independent from PT.

Yet another option is to configure draftquality just like articlequality with keepForever and cleanParent. That would be ideal for PageTriage but I don't know about other consumers of this model. (or even cleanParentButFirst so we only store the first and last revisions?)

Change 463945 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[operations/mediawiki-config@master] Don't purge articlequality, draftquality scores

https://gerrit.wikimedia.org/r/463945

Change 463945 merged by jenkins-bot:
[operations/mediawiki-config@master] Don't purge articlequality, draftquality scores

https://gerrit.wikimedia.org/r/463945

Mentioned in SAL (#wikimedia-operations) [2018-10-03T11:54:08Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463945|Don't purge articlequality, draftquality scores (T203286)]] (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-10-03T12:12:41Z] <stephanebisson> Starting mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286)

Mentioned in SAL (#wikimedia-operations) [2018-10-03T12:26:29Z] <stephanebisson> Finished mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286)

I checked in enwiki (wmf.23 with ?ores=true). @SBisson - thank you for the queries, they are really helpful.

A question about "the intervals" for articlequality, e.g. oresc_probability <0.1666. Are they calculated for each wiki and and then re-calculated if max(oresc_probability) (or min) is changed? Or all is good as it is and the numbers close enough across wikis and do not change significantly?

I compared couple of drafts for damaging/goodfaith scoring vs draftquality/articlequality - I did not see any discrepancies, although it'd be interesting to check if there are articles that are scored as 'vandalism' but scored high on 'goodfaith' and very low on 'damaging' etc.

@MMiller_WMF

Two tiny things to be aware of, but very minor:
(1) The numbers between db and the counter on AfC page for different filers do not match but they are very close and, given the volume of data, there should be no impact on users.

(2) When AfC filters are changed ('Set filters' clicked) there is a noticeable (not a dramatic one though) delay before the counter for number of pages would change.

A question about "the intervals" for articlequality, e.g. oresc_probability <0.1666. Are they calculated for each wiki and and then re-calculated if max(oresc_probability) (or min) is changed? Or all is good as it is and the numbers close enough across wikis and do not change significantly?

Those are not calibrated thresholds. The probability of all 6 classes are combined into a scale from 0 to 1 using (p0+2*p1+3*p2+...)/6

It is possible to tweak them in configuration but any issue with a model should probably be looked at in the model itself by the scoring team before we start tweaking our interpretation of the scores.

Two tiny things to be aware of, but very minor:
(1) The numbers between db and the counter on AfC page for different filers do not match but they are very close and, given the volume of data, there should be no impact on users.

There is lag in a few places and enwiki is pretty active so it's expected that they are a bit different. However they should match on testwiki.

Thanks, @SBisson - all makes sense.

Some data from enwiki

Pages per draftquality class and namespace
mysql:research@s3-analytics-slave [enwiki]> select CASE WHEN page_namespace=0 THEN '0-Article' WHEN page_namespace=2 THEN '2-User' WHEN page_namespace=118 THEN '118-Draft' END as namespace, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=33 and oresc_is_predicted=1 and page_namespace in (0, 2, 118) and page_is_redirect=0 group by namespace, oresc_class;
+-----------+-------------+--------------------+
| namespace | oresc_class | count(page_latest) |
+-----------+-------------+--------------------+
| 0-Article |           1 |              19132 |
| 0-Article |           2 |                 24 |
| 0-Article |           3 |                  4 |
| 118-Draft |           0 |                 35 |
| 118-Draft |           1 |              33943 |
| 118-Draft |           2 |               7186 |
| 118-Draft |           3 |               1217 |
| 2-User    |           0 |                 74 |
| 2-User    |           1 |              19104 |
| 2-User    |           2 |               4633 |
| 2-User    |           3 |               5841 |
+-----------+-------------+--------------------+
11 rows in set (2 min 22.75 sec)
Pages per articlequality class and namespace
mysql:research@s3-analytics-slave [enwiki]> select CASE WHEN page_namespace=0 THEN '0-Article' WHEN page_namespace=2 THEN '2-User' WHEN page_namespace=118 THEN '118-Draft' END as namespace, CASE WHEN oresc_probability <0.1666 THEN '0-Stub' WHEN oresc_probability < 0.333 THEN '1-Start' WHEN oresc_probability<0.5 THEN '2-C' WHEN oresc_probability<0.666 THEN '3-B' WHEN oresc_probability<0.8333 THEN '4-GA' ELSE '5-FA' END as articlequality_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=35 and page_namespace in (0, 2, 118) and page_is_redirect=0 group by namespace, articlequality_class order by namespace, articlequality_class;
+-----------+----------------------+--------------------+
| namespace | articlequality_class | count(page_latest) |
+-----------+----------------------+--------------------+
| 0-Article | 0-Stub               |              12932 |
| 0-Article | 1-Start              |               4390 |
| 0-Article | 2-C                  |               1513 |
| 0-Article | 3-B                  |                322 |
| 0-Article | 4-GA                 |                 46 |
| 118-Draft | 0-Stub               |              26274 |
| 118-Draft | 1-Start              |              12910 |
| 118-Draft | 2-C                  |               2966 |
| 118-Draft | 3-B                  |                220 |
| 118-Draft | 4-GA                 |                 14 |
| 2-User    | 0-Stub               |              26931 |
| 2-User    | 1-Start              |               2216 |
| 2-User    | 2-C                  |                418 |
| 2-User    | 3-B                  |                 75 |
| 2-User    | 4-GA                 |                 16 |
+-----------+----------------------+--------------------+
15 rows in set (3.70 sec)
name=NPP pages per review state, draftquality class and namespace
mysql:research@s3-analytics-slave [enwiki]> select CASE WHEN page_namespace=0 THEN '0-Article' WHEN page_namespace=2 THEN '2-User' WHEN page_namespace=118 THEN '118-Draft' END as namespace, ptrp_reviewed, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=33 and oresc_is_predicted=1 and page_namespace in (0, 2) and page_is_redirect=0 group by namespace, ptrp_reviewed, oresc_class;
+-----------+---------------+-------------+--------------------+
| namespace | ptrp_reviewed | oresc_class | count(page_latest) |
+-----------+---------------+-------------+--------------------+
| 0-Article |             0 |           1 |               2809 |
| 0-Article |             0 |           2 |                 15 |
| 0-Article |             0 |           3 |                  3 |
| 0-Article |             1 |           1 |               4156 |
| 0-Article |             1 |           2 |                  3 |
| 0-Article |             2 |           1 |               1674 |
| 0-Article |             2 |           2 |                  3 |
| 0-Article |             2 |           3 |                  1 |
| 0-Article |             3 |           1 |              10500 |
| 0-Article |             3 |           2 |                  1 |
| 2-User    |             0 |           0 |                 74 |
| 2-User    |             0 |           1 |              16073 |
| 2-User    |             0 |           2 |               4455 |
| 2-User    |             0 |           3 |               5661 |
| 2-User    |             1 |           1 |                106 |
| 2-User    |             1 |           2 |                 33 |
| 2-User    |             1 |           3 |                 43 |
| 2-User    |             2 |           1 |                780 |
| 2-User    |             2 |           2 |                 61 |
| 2-User    |             2 |           3 |                 60 |
| 2-User    |             3 |           1 |               2147 |
| 2-User    |             3 |           2 |                 85 |
| 2-User    |             3 |           3 |                 77 |
+-----------+---------------+-------------+--------------------+
23 rows in set (2.16 sec)
name=NPP pages per namespace, draftquality class, and whether it's a redirect or not
mysql:research@s3-analytics-slave [enwiki]> select CASE WHEN page_namespace=0 THEN '0-Article' WHEN page_namespace=2 THEN '2-User' WHEN page_namespace=118 THEN '118-Draft' END as namespace, page_is_redirect, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id where oresc_model=33 and oresc_is_predicted=1 and page_namespace in (0, 2) group by namespace, page_is_redirect, oresc_class;
+-----------+------------------+-------------+--------------------+
| namespace | page_is_redirect | oresc_class | count(page_latest) |
+-----------+------------------+-------------+--------------------+
| 0-Article |                0 |           1 |              19143 |
| 0-Article |                0 |           2 |                 22 |
| 0-Article |                0 |           3 |                  4 |
| 0-Article |                1 |           1 |              28492 |
| 2-User    |                0 |           0 |                 74 |
| 2-User    |                0 |           1 |              19105 |
| 2-User    |                0 |           2 |               4633 |
| 2-User    |                0 |           3 |               5841 |
| 2-User    |                1 |           1 |                 48 |
| 2-User    |                1 |           2 |                  2 |
+-----------+------------------+-------------+--------------------+
10 rows in set (3.56 sec)
AfC pages per draftquality class and afc state
mysql:research@s3-analytics-slave [enwiki]> select CASE WHEN ptrpt_value=1 THEN '1-Unsubmitted' WHEN ptrpt_value=2 THEN '2-Pending' WHEN ptrpt_value=3 THEN '3-UnderReview' ELSE '4-Declined' END as afc_state, oresc_class, count(page_latest) from ores_classification inner join page on oresc_rev=page_latest inner join pagetriage_page on ptrp_page_id=page_id inner join pagetriage_page_tags on ptrp_page_id=ptrpt_page_id and ptrpt_tag_id=20 where oresc_model=33 and oresc_is_predicted=1 and page_namespace=118 group by ptrpt_value, oresc_class;
+---------------+-------------+--------------------+
| afc_state     | oresc_class | count(page_latest) |
+---------------+-------------+--------------------+
| 1-Unsubmitted |           0 |                 30 |
| 1-Unsubmitted |           1 |              15507 |
| 1-Unsubmitted |           2 |               4022 |
| 1-Unsubmitted |           3 |               1194 |
| 2-Pending     |           1 |               3519 |
| 2-Pending     |           2 |                386 |
| 3-UnderReview |           1 |                  3 |
| 4-Declined    |           0 |                  5 |
| 4-Declined    |           1 |              14984 |
| 4-Declined    |           2 |               2780 |
| 4-Declined    |           3 |                 23 |
+---------------+-------------+--------------------+
11 rows in set (3.77 sec)
MMiller_WMF closed this task as Resolved.Oct 5 2018, 5:55 PM

I just checked this out in English Wikipedia and went over the counts. Everything seems to be in working order! Thank you.