Page MenuHomePhabricator

AfC - Review (AFCH) added categories change ORES scoring for the same draft
Closed, InvalidPublic

Description

See specific example in the comments:

  1. Create an article in Draft namespace and check how what AfC filters will be applicable to it. The draft will be found by the following filters:

State: Unsubmitted
Predicted class: start
Predicted issues: spam

  1. Without modifying the article - submit for the review (a new category will be added). Now the filters that will find the article will be:

State: Awaiting review
Predicted class: Start
Predicted issues: N/A

  1. Without modifying the article - add it under review:

State: Under review
Predicted class: C-class
Predicted issues: N/A

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2018, 10:53 PM

The specific example for the steps described in the ticket - check the revisions for the following page - there will be three revisions differently scored in ores_classification table.

select * from revision where rev_page=192313\G
MMiller_WMF added a subscriber: MMiller_WMF.

@Aklapper -- this is not a direct work item for the Scoring team, but rather is about the Growth team's usage of ORES, so I don't think that tag applies.

MMiller_WMF added a subscriber: SBisson.EditedJul 30 2018, 8:56 PM

@SBisson -- @Etonkovidova told me a little bit about this, but it would be great if you could add your perspective on why this is happening. Is it possible from your perspective to get any counts or estimates on how often it occurs? If this is an edge case, then we might not have to worry about it.

ORES scores are tied to revisions, not pages. It is expected that when a draft is being reviewed, a new revision is created to add the template, and the score appears as N/A until scoring for this new revision is done and saved.

'draftquality' will stay as N/A until T199357: New Pages Feed: score draftquality on most recent revision is done and deployed because the ORES extension currently only scores the first revision of a page.

I would not expect the 'wp10' score to change much because a template and a category were added but

  1. It's a discussion to have with the scoring team, ORES is a black box from PageTriage's perspective.
  2. More importantly, depending on where you are testing it may not be running a real model or not using the real data.
    1. Both models are disabled for test.wikipedia.org because 'testwiki' in ORES doesn't support them. (T198997 is related).
    2. Both models are enabled for enwiki in betalabs but my guess is that they are giving scores based on revision content in enwiki production. That's what's happening locally for me. (To clarify: you ask for the wp10 for rev 1234, it gets rev 1234 content from enwiki production to do the scoring)

@SBisson -- thanks for this explanation. That helps. But given that ORES doesn't work right in testwiki or betalabs, how do you think we should test that our implementation of ORES is working correctly before we push it to production?

SBisson added a subscriber: Halfak.Aug 1 2018, 4:09 PM

@SBisson -- thanks for this explanation. That helps. But given that ORES doesn't work right in testwiki or betalabs, how do you think we should test that our implementation of ORES is working correctly before we push it to production?

I think we have to be able to test on testwiki. I commented on T198997. @MMiller_WMF you can also add your perspective there to push things a little.

As for betalabs, I really don't know if it does or can have real ORES scores. That's a question for @Halfak

@SBisson

I would not expect the 'wp10' score to change much because a template and a category were added

The fact that all revisions are subject to ORES scoring makes it quite probable that the scores would be different. And, yes, ideally it'd be logical to exclude the revisions when only AfC related templates and categories are added.

Btw, damaging and goodfaith models are applied to all revisions also (testwiki wmf.15). The screenshot shows that drafts (Draft:Test Zilant13 2) submitted for review and being under review get re-evaluated even though the content was not changed.

@SBisson

I would not expect the 'wp10' score to change much because a template and a category were added

The fact that all revisions are subject to ORES scoring makes it quite probable that the scores would be different. And, yes, ideally it'd be logical to exclude the revisions when only AfC related templates and categories are added.

I agree in general. But in practice, that's a shortcut that the ORES server can take when it judges 2 consecutive revisions to be similar enough. As consumers or this data, I don't think we should get into such complexities.

Btw, damaging and goodfaith models are applied to all revisions also (testwiki wmf.15). The screenshot shows that drafts (Draft:Test Zilant13 2) submitted for review and being under review get re-evaluated even though the content was not changed.

Indeed, that's exactly how things are supposed to work.

I suggest to decline this task as invalid.

Etonkovidova renamed this task from AfC - Review (AFCH) added categories change draft_quality and wp10 scores for the same draft to AfC - Review (AFCH) added categories change ORES scoring for the same draft.Aug 2 2018, 4:07 PM
MMiller_WMF closed this task as Invalid.Aug 3 2018, 1:00 AM

Declining as invalid. We can revisit if the community's experience shows that the scores are problematically erratic.

Halfak added a comment.Aug 3 2018, 2:29 PM

Cool. FWIW, it's always good to throw a tag like articlequality-modeling, on something like this since my team might have insights to share :)

MMiller_WMF added a comment.EditedAug 3 2018, 4:46 PM

@Halfak -- I did just want to ask you briefly, what are your thoughts on whether the model scores do/should change as categories and templates change? Were categories and templates part of the training data? I could see the case being made either way.

Halfak added a comment.Aug 6 2018, 6:24 PM

Yes. Templates and categories are part of the features to the model. If we want the quality to not change, we'll need to train the model that some templates don't matter. We'd need to add these observations to the training data and figure out a good strategy for re-balancing.

Do the predictions change substantially?

@Halfak -- we're not 100% sure yet. Once T198997 is deployed, I think we'll be able to play with the models in Test Wiki more easily and get a feel for how much they change. I know we could do this over the API, but having it in Test Wiki will allow less technical people to check it out.