Page MenuHomePhabricator

New Pages Feed: generate ORES scores (3.1)
Closed, ResolvedPublic

Description

The work in this task and in T195927 make up the third useful feature change that we could roll out to users. The work in the task is part of accomplishing these user stories:

  • As a reviewer, I need to be able to filter by the four categories in the ORES draftquality model (vandalism, spam, attack, ok).
  • As a reviewer, I need to be able to filter by the six categories in the ORES wp10 model (Stub, Start, C-class, B-class, Good, Featured).

Specifically, the work is to generate scores, while the filtering of scores is tasked in T195927:

  • For all pages in the New Pages Feed, including the Article, Draft, and User namespaces, generate scores from both the draftquality and wp10 models described on the ORES page. It is unlikely that we use the scores from the User namespace for anything.
  • The draftquality model returns six scores and the wp10 model returns four scores. I recommend that we store or otherwise have access to all ten of those scores, and separately apply logic to determine which classes to display for those two models, as described in T195927. This may give us needed flexibility to change our logic later on.
  • In terms of when to score the models, we have some flexibility. Ideally, we would be able to score pages upon their first appearance in the New Pages Feed, and rescore them on each successive edit as long as they are part of the feed. Rescoring the models is more important for NPP work than AfC work, as new articles do tend to be edited soon after they are first created, whereas new drafts submitted to AfC do not. We have two main options here, according to @Halfak and @Ladsgroup:
    • The Scoring team is currently in the process of adding the draftquality and wp10 models to the set of models that are already being rescored on every edit, storing the scores in the Mediawiki database for uses like the New Pages Feed (T190471). We need to decide whether this will work for our use case and on the timeline on which we're working.
    • New Pages Feed could also use the ORES API to query for new model scores. As a side note, for fastest throughput, @Halfak recommends two parallel connections requesting scores on 50 pages at a time.
  • If we decide for technical reasons that rescoring models with every edit is not a good idea, these are some potential alternative business rules that the team can discuss:
    • Rescore models once a day (or other time period) on pages that have been changed in the previous day (or other time period).
    • Rescore models on a given page after a certain number of edits
    • Rescore models on a given page after a certain number of bytes changed.

Two other notes:

  • User:SQL made a page that scores the two ORES models on all submitted drafts each day. Perhaps there are some things we can learn from that user's implementation: https://en.wikipedia.org/wiki/User:SQL/AFC-Ores.
  • It will be great if we can sanity check our scores before integrating them into the software. As we're working on this development, it would be good to be able to export lists of scored pages so that humans can look them over and make sure the scores and cutoffs make sense.

Note: the specifics listed above may be changed by ongoing community conversation around the design, which can be found here.

Event Timeline

MMiller_WMF renamed this task from New Pages Feed: add ORES models (3) to New Pages Feed: generate ORES scores (3.1).May 29 2018, 10:22 PM
MMiller_WMF updated the task description. (Show Details)
MMiller_WMF added a subscriber: Ladsgroup.
kaldari set the point value for this task to 8.May 30 2018, 12:16 AM

It looks like the tables ores_classification and ores_model are already in production, ...and on the replicas! I looked at Special:RecentChanges, and there are scores even for the most recent edits, as soon as they are made. I'm guessing it's safe to go off of this?

This would be awesome because then I can mock the ORES data in my local MediaWiki database, instead of attempting to install ORES (reportedly very nontrivial) in order to use the API.

That's great news. Maybe @Halfak can verify that these values are now consistently generated and safe to rely on.

That's great news. Maybe @Halfak can verify that these values are now consistently generated and safe to rely on.

! In T195796#4247003, @MusikAnimal wrote:

It looks like the tables ores_classification and ores_model are already in production, ...and on the replicas! I looked at Special:RecentChanges, and there are scores even for the most recent edits, as soon as they are made. I'm guessing it's safe to go off of this?

This would be awesome because then I can mock the ORES data on my local MediaWiki database, instead of attempting to install ORES (reportedly very nontrivial) in order to use the API.

Spoke too soon :( I talked to Marshall and indeed it looks like only damaging and good-faith scores are being stored. So, I'll proceed as planned and talk to @Ladsgroup (pinging if you want to comment here) to see if we should count on the draftquality and wp10 being added in the near future. I hope it happens, it would be so useful for many applications!

@MusikAnimal: I'm guessing that the reason it isn't storing those currently is that they are much more expensive and slow to generate. For @Halfak and @Ladsgroup's benefit, what we are wanting to do is generate the draftquality and wp10 scores for each revision of articles that are in the patrolling queues (New Page Patrol and Articles for Creation). Our initial thought was to generate these via the ORES API and store the scores as tags within the PageTriage tables. Would it be OK if we stored these within the existing ORES tables instead? Are those scores retained in the database indefinitely?

T190471 is the relevant task. Award your tokens there! :)

Would it be OK if we stored these within the existing ORES tables instead?

That is an option! But my hope is this will happen outside PageTriage as that would cut a lot of work out on our end, and apply to wikis that don't use PageTriage.

@kaldari Would we still need to add PageTriage tags for the scores if the same data is already easily accessible in the MediaWiki database? We might could just do a JOIN or two and get it just as efficiently. I ask because if don't need to duplicate the data on our end, the impact of this task will be equivalent to T190471 in terms of database storage (so whatever anxiety there is about storage applies to us, too, if that is all a concern).

Yeah, I would also love for these scores to get recorded completely independent of PageTriage. I don't know if that's going to happen any time soon though. Regardless, I don't think this data should be duplicated across two different tables. So I guess the options, in order of preference would be:

  • ORES automatically stores draftquality and wp10 scores in the ORES tables for the most recent revisions of articles and draft articles (at least for articles less than 90 days old).
  • PageTriage generates draftquality and wp10 scores for new articles and draft articles via the ORES API and stores them in the ORES tables.
  • PageTriage generates draftquality and wp10 scores for new articles and draft articles via the ORES API and stores them as tags in the PageTriage tables.

I concur! Note that according to T175757, the plan is store only the most recent prediction, which I think will satisfy most use cases, PageTriage included.

@MusikAnimal and I discussed with @Ladsgroup this morning. My notes are below. @MusikAnimal and @Ladsgroup please fill in or correct things I've missed.

  • The two models that New Pages Feed needs are draftquality and wp10.
  • draftquality is already in the Mediawiki database with these business rules:
    • Scored on the first revision of all new pages.
    • Not scored on any subsequent revisions.
    • Deleted after 30 days.
    • Stored as four scores per page (one for each of the four classes: spam, vandalism, attack, OK)
  • wp10 is not yet in there. It is awaiting code review from his team (or our team if we want to get up to speed). It is also awaiting approval from the DBAs. It will have these business rules:
    • Scored on the most recent revision of all new pages.
    • Rescored with each revision.
    • Not deleted.
    • Stored as one score that spans the class scale.
  • The draftquality business rules (e.g. scoring on first revision, and deletion after 30 days) could be changed if we decide that a change is important for our feature.

A question I have for @Ladsgroup: under the current business rules, which articles currently have scores? Only the subset of pages created since the model was implemented?

The next step for CommTech is that @MusikAnimal is going to test out the draftquality scores that are already in the database.

FYI @Halfak

Stored as one score that spans the class scale.

Can you describe this in more detail? Does that mean that it just stores the most likely class or that it stores scores for all of the classes in a single JSON blob or serialized format?

A question I have for @Ladsgroup: under the current business rules, which articles currently have scores? Only the subset of pages created since the model was implemented?

I don't understand the question fully. It varies between different models and different namespaces. We have maintaince scripts to back-populate the table but we barely run them as they get populated really fast.

One potential problem is that even though these scores are ostensibly for pages, they are only associated with revisions in the ores_classification table. While there is a page_latest field in the page table, which will allow us to easily retrieve the wp10 score, there is no page_first field, so retrieving the draftquality score in a performant way is going to be hard. If https://gerrit.wikimedia.org/r/#/c/399897/ ever gets merged, it might provide a hacky work-around, but otherwise we may need to denormalize this data by either:

  • Storing the associated page in the ores_classification table
  • Storing the first revision of the page in the pagetriage_page table

Alternately, the script that generates the draftquality scores could be changed to run on the most recent revisions similar to wp10.

Storing the first revision of the page in the pagetriage_page table makes more sense to me specially it can be used for other usecases outside of ORES application.

@Ladsgroup -- to clarify my question, I was asking about draftquality. But in thinking this through, I got the answer to my own question. Since those scores are deleted after 30 days, all the pages that have been created in the last 30 days have scores in the database. No earlier than that.

@kaldari -- my understanding of the plan for wp10 is that they are reducing the separate scores for each class (Start, Stub , etc.) to all be boiled down into one master score, where the top end of the range is like "Featured" and the bottom is like "Stub". And then we would decide on the ranges of scores that we map to the discrete class values.

@Ladsgroup @Halfak -- is there any documentation or information about the particulars of that change for the wp10 score?

@Ladsgroup: What do you think about storing the draftquality score for the most recent revision rather than the first revision? This would help on a number of fronts.

@Ladsgroup @Halfak -- is there any documentation or information about the particulars of that change for the wp10 score?

Not to my knowledge, The only thing is that they turn to weighed sum. I can explain in more depth if you think it's needed.

@Ladsgroup: What do you think about storing the draftquality score for the most recent revision rather than the first revision? This would help on a number of fronts.

I thought about it and it's not hard to treat it like wp10 model but we end up with lots of useless scores in the database for articles that have been created ages ago but got recently edited (like article of "Barack Obama") which doesn't make sense. I can't find any not-expensive way to prevent that now but ideas are more than welcome. FWIW, Some analysis showed the scores are not very different between the first and the latest version.

The WMF Growth Team (lead engineer: @Catrope) will be working on this task instead of Community Tech, as previously planned. We had a meeting today to transfer knowledge from one team to the other, and reviewed the open questions we have for using ORES models. I'm writing them down below to record them, and @Catrope will pursue them and ask Scoring Team members for input as needed:

  • We are not currently 100% sure on the status of the draftquality and wp10 scores in the mediawiki databases. The relevant Phabricator tasks may not have all the detailed information.
  • We still need to resolve two open questions around the draftquality model:
    • On which revision(s) should it be scored: earliest, latest, or all?
    • How long should scores be retained? The original intention was to delete them after 30 days, but that is flexible.
  • We need to learn how to set cutoffs for the new manifestation of the wp10 model, which is one score that goes from 0 to 1 instead of 6 orthogonal scores. Aaron Halfaker offered to write a function for us to "re-normalize" the score into six parts.

And some additional notes and thoughts about using ORES:

  • There is a concern that ArticleCompile will run before scores arrive. This may not be an issue because of the way tables are set up. But if there is an issue, two ideas on that front:
    • Add a hook to ORES so that when the scores arrive, we can rerun ArticleCompile.
    • If the models are fast enough, we can just hit the ORES API directly from ArticleCompile (possibly with the DB as a cache in between).
  • There may be a race condition if a score for a later revision arrives before the score for an earlier revision.
  • A clarified requirement: though we want pages to always have their latest scores, the scores do not need to update in the front-end. It will be sufficient that they update on page refresh.
Vvjjkkii renamed this task from New Pages Feed: generate ORES scores (3.1) to j4baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii removed MusikAnimal as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from j4baaaaaaa to New Pages Feed: generate ORES scores (3.1).Jul 2 2018, 3:30 PM
CommunityTechBot assigned this task to MusikAnimal.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot set the point value for this task to 8.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

We still need to resolve two open questions around the draftquality model:
On which revision(s) should it be scored: earliest, latest, or all?

Probably latest.

How long should scores be retained? The original intention was to delete them after 30 days, but that is flexible.

We would need the draftquality scores to be retained for at least 90 days (assuming we are only retaining one score per article).

@Catrope is going to post the plan for investigations here in the task (FYI @kostajh @SBisson).

Here's the result of my investigation into what the ORES extension already does today:

  • wp10 scores are already stored in the database in production
    • only on English Wikipedia
    • only for the main namespace
    • only for the latest revision of each page (when a new score comes in, the score for the previous revision is deleted)
    • stored as an "aggregated" score (only one row, with oresc_model=0 and oresc_is_predicted always 0), which we'll have to ask the Scoring team how to interpret
  • draftquality scores are also already stored in the database in production
    • only on English Wikipedia
    • only for the main namespace
    • only for new page creations; scores for subsequent edits are not computed
    • scores for OK (oresc_class=1), spam (2) and vandalism (3) are stored, but the ones for attack (0) are not; but we can compute those because the four scores always add up to 1
    • I can find no evidence that old scores are deleted after 30 days, 90 days, or any other amount of time; there are lots of scores from January in the DB

Example queries for getting scores for a given revision:

mysql:research@s3-analytics-slave [enwiki]> select oresc_class, oresc_probability, oresc_is_predicted from ores_classification join ores_model on oresm_id=oresc_model where oresc_rev=848604103 and oresm_name='draftquality' and oresm_is_current=1;
+-------------+-------------------+--------------------+
| oresc_class | oresc_probability | oresc_is_predicted |
+-------------+-------------------+--------------------+
|           1 |             0.496 |                  1 |
|           2 |             0.483 |                  0 |
|           3 |             0.018 |                  0 |
+-------------+-------------------+--------------------+
3 rows in set (0.00 sec)

mysql:research@s3-analytics-slave [enwiki]> select oresc_class, oresc_probability, oresc_is_predicted from ores_classification join ores_model on oresm_id=oresc_model where oresc_rev=848608039 and oresm_name='wp10' and oresm_is_current=1;
+-------------+-------------------+--------------------+
| oresc_class | oresc_probability | oresc_is_predicted |
+-------------+-------------------+--------------------+
|           0 |             0.534 |                  0 |
+-------------+-------------------+--------------------+
1 row in set (0.00 sec)
  • wp10 scores are already stored in the database in production
    • stored as an "aggregated" score (only one row, with oresc_model=0 and oresc_is_predicted always 0), which we'll have to ask the Scoring team how to interpret

@Halfak @Ladsgroup @awight: Could you help us understand how to extract the predictions from the aggregated score?

I've looked at https://gist.github.com/halfak/b925a2d45a3903a3e10dc5d6cd7c01b1 and the implementation in the ORES extension and I noted a few differences. Most notably that "Stub" has class 0 in the extension. Therefore its score is not included in the weighted sum. I don't know if it matters.

This is how to interpret, and filter by, aggregated scores:

class idclass namerange
0Stub[0, 1/6[
1Start[1/6, 2/6[
2C[2/6, 3/6[
3B[3/6, 4/6[
4GA[4/6, 5/6[
5FA[5/6, 1[

I did not reduce the fractions to make it more obvious that the range [0, 1[ is divided into 6 equal parts (number of classes) and each part represents a prediction.

Querying for revisions that are Start or B can be done like this:

WHERE ( oresc_probability >= 1/6 AND oresc_probability < 2/6 )
   OR ( oresc_probability >= 3/6 AND oresc_probability < 4/6 )

That can be simplified for continuous ranges and the first and last classes but that's the basic idea.

I would recommend encapsulating the creation of the where clause in the ORES extension. It already contains pretty much the same thing for the damaging and goodfaith rc filters.

@SBisson -- did that set of cutoffs come from the Scoring team? Or how did you produce it?

@SBisson -- did that set of cutoffs come from the Scoring team? Or how did you produce it?

I produced it based on my conversation with @Ladsgroup

Reassigning this to @SBisson, who is working on it now.

Filtering on draftquality may not be as straightforward as we'd like, especially for attacks (class 0).

The ORES extension stores 3 lines for every revision scores on draftquality. It ignores class 0 explicitly. Its value can be deducted from the other lines by substracting the sum of their probabilities from 1 (1 - ( p(OK) + p(SPAM) + p(VANDALISM) ) ). If none of the other class has oresc_is_predicted === 1 then class 0 logically has it.

Here is an example of a revision that is probably an attack.

mysql:research@s3-analytics-slave [enwiki]> select * from ores_classification where oresc_model=33 and oresc_rev=845715298;
+-----------+-----------+-------------+-------------+-------------------+--------------------+
| oresc_id  | oresc_rev | oresc_model | oresc_class | oresc_probability | oresc_is_predicted |
+-----------+-----------+-------------+-------------+-------------------+--------------------+
| 232150135 | 845715298 |          33 |           1 |             0.060 |                  0 |
| 232150136 | 845715298 |          33 |           2 |             0.258 |                  0 |
| 232150137 | 845715298 |          33 |           3 |             0.312 |                  0 |
+-----------+-----------+-------------+-------------+-------------------+--------------------+
3 rows in set (0.00 sec)

Revisions that are attacks can be found with a query like this:

mysql:research@s3-analytics-slave [enwiki]> select oresc_rev, count(oresc_is_predicted) as c, sum(oresc_is_predicted) as s from ores_classification where oresc_model=33 group by oresc_rev having c=3 and s=0 limit 4;
+-----------+---+------+
| oresc_rev | c | s    |
+-----------+---+------+
| 819044721 | 3 |    0 |
| 819086300 | 3 |    0 |
| 819201556 | 3 |    0 |
| 819241003 | 3 |    0 |
+-----------+---+------+
4 rows in set (3.58 sec)

Storing class 0 for draftquality would take up more space but make filtering much easier.

That sum query is probably not gonna be great performance-wise. You could also do this with something like LEFT JOIN ores_classification ON oresc_model=33 AND oresc_rev=rev_id AND oresc_is_predicted=1 WHERE oresc_probability IS NULL, it's possible that that would be more performant, but it would also flag all unscored revisions as attacks.

I agree that we should ask the ORES team to store class 0 for draftquality.

SBisson removed the point value for this task.Jul 13 2018, 2:39 PM

Now that all other ORES tickets are Done, this old parent task is also Done. Anything outstanding is ticketed separately.