⚓ T166235 Flagged revs approve model to fiwiki

		Status	Subtype	Assigned	Task
		Open		None	T165848 Decrease FlaggedRevs backlog by using ORES predictions models
		Resolved		Halfak	T166235 Flagged revs approve model to fiwiki

Zache created this task.May 24 2017, 4:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 24 2017, 4:44 PM

4shadoww subscribed.Jun 1 2017, 6:19 AM

Halfak triaged this task as Medium priority.Jun 1 2017, 2:31 PM

Halfak added a project: editquality-modeling.

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 1 2017, 2:32 PM

Halfak reassigned this task from Halfak to Ladsgroup.Jun 1 2017, 2:32 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Halfak subscribed.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJun 1 2017, 2:32 PM

Ladsgroup moved this task from Incoming to In progress on the User-Ladsgroup board.Jun 5 2017, 11:58 PM

I talked to @Ladsgroup about this task and wanted to record a couple points of clarity. The point of this task is to train a brand new model on explicit flagged revs approvals/non-approvals. Then we'll allow fiwiki to use the model for managing their flagged revs backlog. The upside vs. the "damaging" model is that there should be far more observations.

We'll need to dig through the logging table a bit to figure out what the data looks like.

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Jun 15 2017, 2:41 PM

Halfak reassigned this task from Ladsgroup to awight.Jun 26 2017, 4:43 PM

Halfak added a subscriber: Ladsgroup.

Tgr subscribed.Jun 26 2017, 4:46 PM

Some links,
https://www.mediawiki.org/wiki/Extension:FlaggedRevs
https://www.mediawiki.org/wiki/API:Review
https://www.mediawiki.org/wiki/Help:Extension:FlaggedRevs

Exploring the logging table structure on fiwiki,

select
        distinct log_action,
        log_type
from logging
where
        log_id > 9000000
order by
        log_type, log_action;

+-------------------+---------------+    
| log_action        | log_type      |    
+-------------------+---------------+    
| modify            | abusefilter   |    
| block             | block         |    
| reblock           | block         |    
| unblock           | block         |    
| delete            | delete        |    
| delete_redir      | delete        |    
| event             | delete        |    
| flow-delete-post  | delete        |    
| flow-delete-topic | delete        |    
| restore           | delete        |    
| revision          | delete        |    
| dwhitelist        | gblblock      |    
| flow-lock-topic   | lock          |    
| merge             | merge         |    
| move              | move          |    
| move_redir        | move          |    
| autocreate        | newusers      |    
| byemail           | newusers      |    
| create            | newusers      |    
| create2           | newusers      |    
| autopatrol        | patrol        |    
| patrol            | patrol        |    
| modify            | protect       |    
| protect           | protect       |    
| unprotect         | protect       |    
| renameuser        | renameuser    |    
| approve           | review        |    
| approve-a         | review        |    
| approve-i         | review        |    
| approve-ia        | review        |    
| approve2          | review        |    
| approve2-i        | review        |    
| unapprove         | review        |    
| unapprove2        | review        |    
| rights            | rights        |    
| hit               | spamblacklist |    
| config            | stable        |    
| modify            | stable        |    
| move_stable       | stable        |    
| reset             | stable        |    
| revision          | suppress      |    
| update            | tag           |    
| thank             | thanks        |    
| overwrite         | upload        |    
| upload            | upload        |    
+-------------------+---------------+

I think the ones we care about are approve* and unapprove*. From the code frontend/FlaggedRevsReviewLogFormatter.php, I glean that any approve-*a is an "auto" approval. The "2" seems to have something to do with "re-reviewing", where one reviewer can either confirm or remove the last reviewer's flag.

Log record contents are,

Approval:

       log_id: 9445215
     log_type: review
   log_action: approve  
log_timestamp: 20170629004808                    
     log_user: 242194   
log_namespace: 0        
    log_title: Mikael_Gabriel                    
  log_comment: [Tila: Silmäilty]                 
   log_params: a:3:{i:0;i:16580135;i:1;i:16560239;i:2;s:14:"20170628223936";}                     
  log_deleted: 0        
log_user_text: Seegge   
     log_page: 721071

Unapproval [sic]:

       log_id: 9443844
     log_type: review
   log_action: unapprove
log_timestamp: 20170628144603
     log_user: 305160
log_namespace: 0
    log_title: Tuuliranta
  log_comment: 
   log_params: a:3:{i:0;i:16579214;i:1;i:16570109;i:2;s:14:"20170628143155";}
  log_deleted: 0
log_user_text: Parantaja asiantuntija
     log_page: 982674

log_params look like, [rev_id_after_change, old_rev_id, change_timestamp]

This is the flaggedrevs record for the approval,

       fr_rev_id: 16580135
fr_rev_timestamp: 20170628223936
      fr_page_id: 721071
         fr_user: 242194
    fr_timestamp: 20170629004808
      fr_quality: 0
         fr_tags: accuracy:1

        fr_flags: ,dynamic
     fr_img_name: NULL
fr_img_timestamp: NULL
     fr_img_sha1: NULL

I can't find an equivalent record for the unapproval.

Logging..log_action

approve, approve2 = reviewing the pending changes
approve-i, approve2-i = articles first approve
approve-a, approve2-a = autoreviewed
approve-ia, approve2-ia = autoreviewed first approves (articles created by users with autoreview right)
unapprove, unapprove2 = removing the approval

If there is number 2 after the keyword it means that the selected level of the revision is "quality". Without it the label of level in user interface is "checked". (see flagged revs docs). The difference between levels is that if the quality level is in use you can link to the latest quality version or select it as default version instead of latest stable or current version of the page. However there is no practical difference between quality or checked levels in fiwiki or in any of the WMF wiki configurations and all can be thinked as "checked" reviews.

Flaggedrevs table
flaggedrevs table contains just the current flaggedrevs review state of the revision. If the revision is re-reviewed then the row will be updated. If revision is unapproved then row will be deleted.

Ladsgroup moved this task from In progress to Blocked on others on the User-Ladsgroup board.Jul 3 2017, 8:12 PM

@Zache Thank you for digging this up! Your queries also help, now I understand that "unapprove" isn't the same as a reversion. Actually, your queries are pretty much all we need to make this task happen :-)

awight created subtask T170464: Quarry cannot store results with identical column names.Jul 12 2017, 7:02 PM

awight removed a subtask: T170464: Quarry cannot store results with identical column names.Jul 12 2017, 8:37 PM

@Zache I could use more eyeballs on:

https://quarry.wmflabs.org/query/20200
https://quarry.wmflabs.org/query/20201

My questions are stated in https://github.com/wiki-ai/editquality/commit/0426c71c2c1c, but to repeat here,

I'm unsure whether the reverted query is getting what we intended. When does "rv" get added to the ChangeTags, and is it specific to pages using flaggedrevs?
As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

awight mentioned this in rOEQ0426c71c2c1c: [WIP] Query and fetch approved and reverted changes.Jul 12 2017, 9:13 PM

I haven't been able to find any multi-diff approvals. Using the condition AND r1.rev_parent_id != r2.rev_id, I was able to find changes which were approved and merged onto later versions of the page perhaps? But it doesn't appear that the change was cumulative? Any thoughts would be appreciated.

Update after discussing with @Halfak: I'm going to train our trial model using the FR approved revisions, plus the autolabeled set. We won't include the reverted query above.

awight mentioned this in rOEQ803666879f4d: [WIP] Query and fetch changes approved through FlaggedRevs.Jul 13 2017, 12:47 AM

awight mentioned this in rOEQa30e817b4789: [WIP] Query and fetch changes approved through FlaggedRevs.Jul 17 2017, 7:56 PM

I'm unsure whether the reverted query is getting what we intended. When does "rv" get added to the ChangeTags, and is it specific to pages using flaggedrevs?

It is added via cronjob to namespace 0 edits which comment will match to regexp below AND which comment do not match with text "special:contributions/" + rc_user_text (eg. crude self revert detection). Full query to get those can be found in quarry 20339

"(\\b[Rr][Vv][Vv]?\\b|[Ss]otkemis|\\b[Ss]otkua\\b|andalism|sivu palautettiin|ontributions/|uokkaukset|ylättiin|alautettin|umottiin)"

Update after discussing with @Halfak: I'm going to train our trial model using the FR approved revisions, plus the autolabeled set. We won't include the reverted query above.

Ok, should be good as rv-tags i think.

As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

My select will merge multiple edits to single diff, in example diff ( review log id 9521589 ) contains three edits from two users.

As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

I think in best solution would be that the edits from single user / single edit session would be grouped as single diff and if there is edits from multiple users then the diff would be ignored.

Next best solution would be that if there is multiple edits in the diff then it would be ignored.

Worst results would come from multi-diffs which are made by multiple users OR from multiple edit sessions which are hard cases even for humans to review.

I haven't been able to find any multi-diff approvals.

Some multi-diffs examples : Quarry 20356 . Sorry about messy SQL, but result list should be usable enough.

In T166235#3449799, @Zache wrote:

Some multi-diffs examples : Quarry 20356 . Sorry about messy SQL, but result list should be usable enough.

@Zache /o\ That was a wicked query! So it seems that multi-revision approvals are actually approving a chain of potentially unrelated edits. I'll go ahead and remove them from our data set.

@Halfak I need some advice on how to merge the "approved" list with existing data. Should I use known human-labeled damaging edits for contrast? Should I use any of the good, autolabeled changes which were not approved via FR? Should I use FR changes with the non-"approved" outcomed? What proportion of the data set should be approved changes, vs. bad changes? The merge_labels tool doesn't look like it wants to touch a new "approved" column. Should I use a different tool, or generalize merge_labels to handle unexpected columns, and add "approved: False" to the non-approved rows?

awight mentioned this in rOEQ6e37f7a53378: [WIP] Query and fetch changes approved through FlaggedRevs.Jul 20 2017, 9:39 PM

awight added a parent task: T165848: Decrease FlaggedRevs backlog by using ORES predictions models.Jul 25 2017, 9:26 PM

Test results

make models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
revscoring test_model \
models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model \
damaging \
--observations=datasets/fiwiki.labeled_revisions_testing.w_cache.5k_2016.json > models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
2017-07-26 18:22:38,669 INFO:revscoring.utilities.test_model -- Testing model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_features="log2", min_samples_leaf=1, loss="deviance", subsample=1.0, scale=true, max_leaf_nodes=null, random_state=null, balanced_sample=false, center=true, presort="auto", init=null, min_samples_split=2, max_depth=5, learning_rate=0.01, balanced_sample_weight=true, min_weight_fraction_leaf=0.0, n_estimators=700, warm_start=false, verbose=0
 - version: 0.0.1
 - trained: 2017-07-25T20:50:13.806134

Table:
                 ~False    ~True
        -----  --------  -------
        False      4589      138
        True        137      121

Accuracy: 0.945
Precision:
        -----  -----
        False  0.971
        True   0.467
        -----  -----

Recall:
        -----  -----
        False  0.971
        True   0.469
        -----  -----

ROC-AUC:
        -----  ---
        False  0.9
        True   0.9
        -----  ---

PR-AUC:
        -----  -----
        False  0.993
        True   0.437
        -----  -----

Comparing to revscoring model_info models/fiwiki.damaging.gradient_boosting.model, the model trained on flagged_revisions found fewer of the damaging edits. The ROC-AUC has also fallen.

awight moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 26 2017, 6:51 PM

Halfak closed this task as Resolved.Jul 27 2017, 3:30 PM

Announced as https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-07-26

Nice work on this. I'm starting to think that we're going to want some tuning to get this working better. @Zache, can you confirm that we should treat all "approved" edits as non-damaging/goodfaith?

If we do this again, noting one minor thing I messed up: We should have thrown out multi-revision approvals, but I never added that to my query.

In T166235#3482194, @Halfak wrote:

Nice work on this. I'm starting to think that we're going to want some tuning to get this working better. @Zache, can you confirm that we should treat all "approved" edits as non-damaging/goodfaith?

If there is an approved edit that is also reverted then it should be treated as damaging and not goodfaith OR it can be filtered out.

Also if the content of the approved edit is not reverted but rewritten then it could be considered as neutral (eg filtered out from the examples) but I think that there is currently no easy way to detect rewrites? These are cases where (goodfaith) edit is not good enough for some reason and the reviewer approves edit first and then fixes it.

One thing which we could try to do is also to filter out the user "SeulojaBot" and user "Zache" reviews because their reviews contain "bot" reviews.

A couple of questions:

#1 to get model to be useful then at some point we need API access for it. Do I create a new phab ticket for that?

#2 How the results should be interpreted ( https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-07-26 )?

#3 I think that we are least trying to replicate out of curiosity what awight did using makefile and revscoring. However is there somewhere a documentation how to create models?

I think that we did not fully understand the meaning of "approved" and that has resulted in us not using that part of the dataset properly. @awight, can you use the autolabel script on the "approved" edits and then filter out all of the edits that are autolabeled "reverted_for_damage": true"? That should give us just the good edits and our labeling strategy should then work.

Halfak reopened this task as Open.Jul 31 2017, 4:06 PM

Halfak moved this task from Completed to Parked on the Machine-Learning-Team (Active Tasks) board.Jul 31 2017, 4:43 PM

I'm doing another iteration of this experiment, addressing the critiques that came up:

Omit approvals where more than one revision was approved.
Omit approvals which were later reverted for being damaging.
Omit approvals by users "SeulojaBot" and "Zache".

There are many more approval logs than I had realized at first. log_params was only serialized beginning in December 2016, and when we relax the serialized data match on log_params, there are about 320k rows to work with. I'll try to include this data and parse both the legacy and new format params.

This script gives us 310k rows in the desired format, but this form will only work on "stat" machines. Needs to be tweaked to run on Quarry, and given temporary table privs on a new db.

USE fiwiki;

DROP TABLE IF EXISTS test.fiwiki_flaggedrevs_approvals;
-- Uncomment and make temporary after debugging.
CREATE /*TEMPORARY*/ TABLE test.fiwiki_flaggedrevs_approvals (
  params TEXT,
  rev_id_start INTEGER,
  rev_id_end INTEGER,
  INDEX rev_id_start (rev_id_start),
  INDEX rev_id_end (rev_id_end)
);
-- Parse out the start and end revisions in the chain being approved, and
-- implicitly cast to int.
INSERT INTO test.fiwiki_flaggedrevs_approvals
  (params)
SELECT
  log_params
FROM
  logging
WHERE
  log_action IN ('approve', 'approve2')
  AND log_type = 'review'
  AND log_namespace=0
  -- User is not Zache or SeulojaBot
  AND log_user not in (4128, 324508);

-- Parse PHP-serialized params
UPDATE test.fiwiki_flaggedrevs_approvals
SET
  rev_id_start = 0 + REGEXP_REPLACE(params, '^.*i:0;i:(\\d+);i:1;i:(\\d+);.*$', '\\2'),
  rev_id_end = 0 + REGEXP_REPLACE(params, '^.*i:0;i:(\\d+);i:1;i:(\\d+);.*$', '\\1')
WHERE
  params like 'a:3:{i:0;i:%;i:1;i:%;i:2;s:14:"%";}';

-- Parse legacy serialized params
UPDATE test.fiwiki_flaggedrevs_approvals
SET
  rev_id_start = 0 + REGEXP_REPLACE(params, '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$', '\\2'),
  rev_id_end = 0 + REGEXP_REPLACE(params, '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$', '\\1')
WHERE
  params rlike '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$';

SELECT
  rev_id_end AS rev_id,
  concat("https://fi.wikipedia.org/w/index.php?diff=", rev_id_end, "&oldid=", rev_id_start) AS diff,
  'true' AS approved,
  'false' AS damaging,
  'true' AS goodfaith
FROM
  test.fiwiki_flaggedrevs_approvals,
  revision AS r1,
  revision AS r2
WHERE
  r1.rev_id = rev_id_end
  AND r2.rev_id = rev_id_start
  AND r1.rev_parent_id = rev_id_start;

-- Note that we still need to filter out edits that were later reverted.  We'll
-- accomplish that with autolabel.

awight mentioned this in rOEQ74b20e925d2f: [WIP] Query and fetch changes approved through FlaggedRevs.Aug 1 2017, 8:47 PM

awight mentioned this in rOEQ8580a274f69e: [WIP] Query and fetch changes approved through FlaggedRevs.

After implementing suggested fixes and rebuilding the model, our health seems to have gone down slightly, even with the bigger and presumably well-labeled training set.

cat datasets/fiwiki.flaggedrevs_training.w_cache.225k.json | \
revscoring train_model \
  revscoring.scorer_models.GradientBoosting \
  editquality.feature_lists.fiwiki.damaging \
  damaging \
  --observations "<stdin>" \
  -p 'learning_rate=0.01' \
  -p 'max_features="log2"' \
  -p 'max_depth=5' \
  -p 'n_estimators=700' \
  --balance-sample-weight \
  --version 0.0.1 \
  --center --scale > models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model
2017-08-02 04:15:34,005 INFO:revscoring.utilities.train_model -- Training model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_depth=5, random_state=null, init=null, min_weight_fraction_leaf=0.0, presort="auto", learning_rate=0.01, max_leaf_nodes=null, balanced_sample=false, subsample=1.0, verbose=0, warm_start=false, min_samples_leaf=1, scale=true, center=true, min_samples_split=2, max_features="log2", balanced_sample_weight=true, loss="deviance", n_estimators=700
 - version: 0.0.1
 - trained: 2017-08-02T04:43:42.045973

No stats available

revscoring test_model \
models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model \
damaging \
--observations=datasets/fiwiki.labeled_revisions_testing.w_cache.5k_2016.json > models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
2017-08-02 04:43:57,726 INFO:revscoring.utilities.test_model -- Testing model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false
 - version: 0.0.1
 - trained: 2017-08-02T04:43:42.045973

Table:
                 ~False    ~True
        -----  --------  -------
        False      4588      139
        True        138      120

Accuracy: 0.944
Precision:
        -----  -----
        False  0.971
        True   0.463
        -----  -----

Recall:
        -----  -----
        False  0.971
        True   0.465
        -----  -----

ROC-AUC:
        -----  -----
        False  0.878
        True   0.878
        -----  -----

PR-AUC:
        -----  -----
        False  0.991
        True   0.401
        -----  -----

I'll post the recipe tomorrow in case anyone wants to replicate the experiment.

Interesting and surprising. Could it be that our labels for "damaging": false are recording something different than the assumed labels for approved-and-not-reverted? I guess it could maybe also be that we're giving the model a lot of examples of good contributions by newcomers (since I imagine most edits caught by flaggedrevs are anons and newcomers) and the model is simply down-weighting its confidence that newcomer/anon edits are crappy.

If we feel that this data can be considered a known-good, then I'm in favor of adding it to the training *and* testing sets and going from there. If we want to explore the approved-and-not-reverted data more carefully, then I suggest we turn that into a follow-up task.

@Zache We would love if you weighed in with how you would like to proceed. Our second experiment showed a slight drop in fitness which we can't fully explain, but @Halfak is considering mixing the Flagged Revs approval set into our training and test data for fiwiki anyway.

The question seems to be, how confident are we that the approved revisions used in this experiment are not damaging? Here's the approvals data set after making refinements: https://github.com/wiki-ai/editquality/blob/fiwiki.flaggedrevs/datasets/fiwiki.flaggedrevs_autolabeled_unreverted.210k_2017.json.bz2

Work is described in more detail here:
https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-08-03

@Zache: nudge--we're hoping to get your opinion on the question above. Just spot-check the data in that .json.bz2 file, and let us know if you're confident that approvals are roughly as good as the Wiki Labels output.

Sorry about the delay. Just too many things to do. I am trying to do thist today.

@awight , after toying with the .json.bz2 i would say that we can be pretty sure that edits aren't damaging. I didn't find any clear vandalism and in context of damaging the biggest problem seems to be rather high level of unsourced (goodfaith) edits. However i personally don't see unsourced goodfaith edits a problem.

awight reassigned this task from awight to Halfak.Oct 2 2017, 4:45 PM

awight subscribed.

awight mentioned this in Blog Post: Status update (October 6, 2017).Oct 6 2017, 1:26 AM

Halfak mentioned this in rOEQ5fcbfa1f942c: [WIP] Query and fetch changes approved through FlaggedRevs.Oct 10 2017, 7:37 PM

Halfak mentioned this in rOEQ8b490ec3914e: [WIP] Query and fetch changes approved through FlaggedRevs.

Not directly related, but it seems that huwiki moves to $wgFlaggedRevsOverride = false; mode (T121995) for a 6 month testing period. Currently in huwiki all edits are reviewed before being visible. After changes they will be directly visible like in fiwiki so there is bigger need to find bad edits. Also they don't have ORES goodfaith/damaging model labeling done ( http://labels.wmflabs.org/stats/huwiki/ )

Ladsgroup removed a project: User-Ladsgroup.Nov 2 2017, 4:48 PM

Halfak mentioned this in rOEQ255e2b75df12: [WIP] Query and fetch changes approved through FlaggedRevs.Dec 22 2017, 3:53 PM

Halfak mentioned this in rOEQce3f4d281fe0: [WIP] Query and fetch changes approved through FlaggedRevs.

awight moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Jan 15 2018, 3:47 PM

https://github.com/wiki-ai/editquality/pull/112 FYI

awight moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Jan 23 2018, 8:24 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Feb 2 2018, 11:02 PM

awight mentioned this in rOEQ6f40746ed172: [WIP] Query and fetch changes approved through FlaggedRevs.Feb 2 2018, 11:21 PM

Cleaned up and rebased. @awight, please have another look.

Ladsgroup mentioned this in rOEQ0a146d9bb22d: Use reviewed & unreverted edits as good for fiwiki (#112).Feb 6 2018, 8:29 PM

awight moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Feb 7 2018, 2:31 PM

awight moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 26 2018, 5:32 PM