Page MenuHomePhabricator

Flagged revs approve model to fiwiki
Closed, ResolvedPublic

Description

Aaron Halfaker asked to write this down as a ticket in Wikimedia Hackathon 2017.

Create a model which tries to determine if an edit will be manually approved with FlaggedRevs OR if it will be reverted. User interface isn’t required except for API support. The goal is to see how well a model which has been taught using Flagged Revs approves and reverts works compared to goodfaith and damaging filter.

Good diffs (approved manually)

Bad diffs (changes generated using reverts detected by summary)

Related Objects

Event Timeline

Halfak triaged this task as Medium priority.Jun 1 2017, 2:31 PM
Halfak added a project: editquality-modeling.
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.
Halfak subscribed.

I talked to @Ladsgroup about this task and wanted to record a couple points of clarity. The point of this task is to train a brand new model on explicit flagged revs approvals/non-approvals. Then we'll allow fiwiki to use the model for managing their flagged revs backlog. The upside vs. the "damaging" model is that there should be far more observations.

We'll need to dig through the logging table a bit to figure out what the data looks like.

Some links,
https://www.mediawiki.org/wiki/Extension:FlaggedRevs
https://www.mediawiki.org/wiki/API:Review
https://www.mediawiki.org/wiki/Help:Extension:FlaggedRevs

Exploring the logging table structure on fiwiki,

select
        distinct log_action,
        log_type
from logging
where
        log_id > 9000000
order by
        log_type, log_action;

+-------------------+---------------+    
| log_action        | log_type      |    
+-------------------+---------------+    
| modify            | abusefilter   |    
| block             | block         |    
| reblock           | block         |    
| unblock           | block         |    
| delete            | delete        |    
| delete_redir      | delete        |    
| event             | delete        |    
| flow-delete-post  | delete        |    
| flow-delete-topic | delete        |    
| restore           | delete        |    
| revision          | delete        |    
| dwhitelist        | gblblock      |    
| flow-lock-topic   | lock          |    
| merge             | merge         |    
| move              | move          |    
| move_redir        | move          |    
| autocreate        | newusers      |    
| byemail           | newusers      |    
| create            | newusers      |    
| create2           | newusers      |    
| autopatrol        | patrol        |    
| patrol            | patrol        |    
| modify            | protect       |    
| protect           | protect       |    
| unprotect         | protect       |    
| renameuser        | renameuser    |    
| approve           | review        |    
| approve-a         | review        |    
| approve-i         | review        |    
| approve-ia        | review        |    
| approve2          | review        |    
| approve2-i        | review        |    
| unapprove         | review        |    
| unapprove2        | review        |    
| rights            | rights        |    
| hit               | spamblacklist |    
| config            | stable        |    
| modify            | stable        |    
| move_stable       | stable        |    
| reset             | stable        |    
| revision          | suppress      |    
| update            | tag           |    
| thank             | thanks        |    
| overwrite         | upload        |    
| upload            | upload        |    
+-------------------+---------------+

I think the ones we care about are approve* and unapprove*. From the code frontend/FlaggedRevsReviewLogFormatter.php, I glean that any approve-*a is an "auto" approval. The "2" seems to have something to do with "re-reviewing", where one reviewer can either confirm or remove the last reviewer's flag.

Log record contents are,

Approval:

       log_id: 9445215
     log_type: review
   log_action: approve  
log_timestamp: 20170629004808                    
     log_user: 242194   
log_namespace: 0        
    log_title: Mikael_Gabriel                    
  log_comment: [Tila: Silmäilty]                 
   log_params: a:3:{i:0;i:16580135;i:1;i:16560239;i:2;s:14:"20170628223936";}                     
  log_deleted: 0        
log_user_text: Seegge   
     log_page: 721071

Unapproval [sic]:

       log_id: 9443844
     log_type: review
   log_action: unapprove
log_timestamp: 20170628144603
     log_user: 305160
log_namespace: 0
    log_title: Tuuliranta
  log_comment: 
   log_params: a:3:{i:0;i:16579214;i:1;i:16570109;i:2;s:14:"20170628143155";}
  log_deleted: 0
log_user_text: Parantaja asiantuntija
     log_page: 982674

log_params look like, [rev_id_after_change, old_rev_id, change_timestamp]

This is the flaggedrevs record for the approval,

       fr_rev_id: 16580135
fr_rev_timestamp: 20170628223936
      fr_page_id: 721071
         fr_user: 242194
    fr_timestamp: 20170629004808
      fr_quality: 0
         fr_tags: accuracy:1

        fr_flags: ,dynamic
     fr_img_name: NULL
fr_img_timestamp: NULL
     fr_img_sha1: NULL

I can't find an equivalent record for the unapproval.

Logging..log_action

  • approve, approve2 = reviewing the pending changes
  • approve-i, approve2-i = articles first approve
  • approve-a, approve2-a = autoreviewed
  • approve-ia, approve2-ia = autoreviewed first approves (articles created by users with autoreview right)
  • unapprove, unapprove2 = removing the approval

If there is number 2 after the keyword it means that the selected level of the revision is "quality". Without it the label of level in user interface is "checked". (see flagged revs docs). The difference between levels is that if the quality level is in use you can link to the latest quality version or select it as default version instead of latest stable or current version of the page. However there is no practical difference between quality or checked levels in fiwiki or in any of the WMF wiki configurations and all can be thinked as "checked" reviews.

Flaggedrevs table
flaggedrevs table contains just the current flaggedrevs review state of the revision. If the revision is re-reviewed then the row will be updated. If revision is unapproved then row will be deleted.

@Zache Thank you for digging this up! Your queries also help, now I understand that "unapprove" isn't the same as a reversion. Actually, your queries are pretty much all we need to make this task happen :-)

@Zache I could use more eyeballs on:

https://quarry.wmflabs.org/query/20200
https://quarry.wmflabs.org/query/20201

My questions are stated in https://github.com/wiki-ai/editquality/commit/0426c71c2c1c, but to repeat here,

  • I'm unsure whether the reverted query is getting what we intended. When does "rv" get added to the ChangeTags, and is it specific to pages using flaggedrevs?
  • As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

I haven't been able to find any multi-diff approvals. Using the condition AND r1.rev_parent_id != r2.rev_id, I was able to find changes which were approved and merged onto later versions of the page perhaps? But it doesn't appear that the change was cumulative? Any thoughts would be appreciated.

Update after discussing with @Halfak: I'm going to train our trial model using the FR approved revisions, plus the autolabeled set. We won't include the reverted query above.

I'm unsure whether the reverted query is getting what we intended. When does "rv" get added to the ChangeTags, and is it specific to pages using flaggedrevs?

It is added via cronjob to namespace 0 edits which comment will match to regexp below AND which comment do not match with text "special:contributions/" + rc_user_text (eg. crude self revert detection). Full query to get those can be found in quarry 20339

"(\\b[Rr][Vv][Vv]?\\b|[Ss]otkemis|\\b[Ss]otkua\\b|andalism|sivu palautettiin|ontributions/|uokkaukset|ylättiin|alautettin|umottiin)"

Update after discussing with @Halfak: I'm going to train our trial model using the FR approved revisions, plus the autolabeled set. We won't include the reverted query above.

Ok, should be good as rv-tags i think.

As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

My select will merge multiple edits to single diff, in example diff ( review log id 9521589 ) contains three edits from two users.

As you said, the approved query is only taking the first diff, and there can be multiple edits per approval. I'm wondering if we want to ignore these multi-diff approvals, or take all the included diffs as approved?

I think in best solution would be that the edits from single user / single edit session would be grouped as single diff and if there is edits from multiple users then the diff would be ignored.

Next best solution would be that if there is multiple edits in the diff then it would be ignored.

Worst results would come from multi-diffs which are made by multiple users OR from multiple edit sessions which are hard cases even for humans to review.

I haven't been able to find any multi-diff approvals.

Some multi-diffs examples : Quarry 20356 . Sorry about messy SQL, but result list should be usable enough.

Some multi-diffs examples : Quarry 20356 . Sorry about messy SQL, but result list should be usable enough.

@Zache /o\ That was a wicked query! So it seems that multi-revision approvals are actually approving a chain of potentially unrelated edits. I'll go ahead and remove them from our data set.

@Halfak I need some advice on how to merge the "approved" list with existing data. Should I use known human-labeled damaging edits for contrast? Should I use any of the good, autolabeled changes which were not approved via FR? Should I use FR changes with the non-"approved" outcomed? What proportion of the data set should be approved changes, vs. bad changes? The merge_labels tool doesn't look like it wants to touch a new "approved" column. Should I use a different tool, or generalize merge_labels to handle unexpected columns, and add "approved: False" to the non-approved rows?

Test results

make models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
revscoring test_model \
models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model \
damaging \
--observations=datasets/fiwiki.labeled_revisions_testing.w_cache.5k_2016.json > models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
2017-07-26 18:22:38,669 INFO:revscoring.utilities.test_model -- Testing model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_features="log2", min_samples_leaf=1, loss="deviance", subsample=1.0, scale=true, max_leaf_nodes=null, random_state=null, balanced_sample=false, center=true, presort="auto", init=null, min_samples_split=2, max_depth=5, learning_rate=0.01, balanced_sample_weight=true, min_weight_fraction_leaf=0.0, n_estimators=700, warm_start=false, verbose=0
 - version: 0.0.1
 - trained: 2017-07-25T20:50:13.806134

Table:
                 ~False    ~True
        -----  --------  -------
        False      4589      138
        True        137      121

Accuracy: 0.945
Precision:
        -----  -----
        False  0.971
        True   0.467
        -----  -----

Recall:
        -----  -----
        False  0.971
        True   0.469
        -----  -----

ROC-AUC:
        -----  ---
        False  0.9
        True   0.9
        -----  ---

PR-AUC:
        -----  -----
        False  0.993
        True   0.437
        -----  -----

Comparing to revscoring model_info models/fiwiki.damaging.gradient_boosting.model, the model trained on flagged_revisions found fewer of the damaging edits. The ROC-AUC has also fallen.

Nice work on this. I'm starting to think that we're going to want some tuning to get this working better. @Zache, can you confirm that we should treat all "approved" edits as non-damaging/goodfaith?

If we do this again, noting one minor thing I messed up: We should have thrown out multi-revision approvals, but I never added that to my query.

Nice work on this. I'm starting to think that we're going to want some tuning to get this working better. @Zache, can you confirm that we should treat all "approved" edits as non-damaging/goodfaith?

If there is an approved edit that is also reverted then it should be treated as damaging and not goodfaith OR it can be filtered out.

Also if the content of the approved edit is not reverted but rewritten then it could be considered as neutral (eg filtered out from the examples) but I think that there is currently no easy way to detect rewrites? These are cases where (goodfaith) edit is not good enough for some reason and the reviewer approves edit first and then fixes it.

One thing which we could try to do is also to filter out the user "SeulojaBot" and user "Zache" reviews because their reviews contain "bot" reviews.

A couple of questions:

#1 to get model to be useful then at some point we need API access for it. Do I create a new phab ticket for that?

#2 How the results should be interpreted ( https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-07-26 )?

#3 I think that we are least trying to replicate out of curiosity what awight did using makefile and revscoring. However is there somewhere a documentation how to create models?

I think that we did not fully understand the meaning of "approved" and that has resulted in us not using that part of the dataset properly. @awight, can you use the autolabel script on the "approved" edits and then filter out all of the edits that are autolabeled "reverted_for_damage": true"? That should give us just the good edits and our labeling strategy should then work.

I'm doing another iteration of this experiment, addressing the critiques that came up:

  • Omit approvals where more than one revision was approved.
  • Omit approvals which were later reverted for being damaging.
  • Omit approvals by users "SeulojaBot" and "Zache".

There are many more approval logs than I had realized at first. log_params was only serialized beginning in December 2016, and when we relax the serialized data match on log_params, there are about 320k rows to work with. I'll try to include this data and parse both the legacy and new format params.

This script gives us 310k rows in the desired format, but this form will only work on "stat" machines. Needs to be tweaked to run on Quarry, and given temporary table privs on a new db.

USE fiwiki;

DROP TABLE IF EXISTS test.fiwiki_flaggedrevs_approvals;
-- Uncomment and make temporary after debugging.
CREATE /*TEMPORARY*/ TABLE test.fiwiki_flaggedrevs_approvals (
  params TEXT,
  rev_id_start INTEGER,
  rev_id_end INTEGER,
  INDEX rev_id_start (rev_id_start),
  INDEX rev_id_end (rev_id_end)
);
-- Parse out the start and end revisions in the chain being approved, and
-- implicitly cast to int.
INSERT INTO test.fiwiki_flaggedrevs_approvals
  (params)
SELECT
  log_params
FROM
  logging
WHERE
  log_action IN ('approve', 'approve2')
  AND log_type = 'review'
  AND log_namespace=0
  -- User is not Zache or SeulojaBot
  AND log_user not in (4128, 324508);

-- Parse PHP-serialized params
UPDATE test.fiwiki_flaggedrevs_approvals
SET
  rev_id_start = 0 + REGEXP_REPLACE(params, '^.*i:0;i:(\\d+);i:1;i:(\\d+);.*$', '\\2'),
  rev_id_end = 0 + REGEXP_REPLACE(params, '^.*i:0;i:(\\d+);i:1;i:(\\d+);.*$', '\\1')
WHERE
  params like 'a:3:{i:0;i:%;i:1;i:%;i:2;s:14:"%";}';

-- Parse legacy serialized params
UPDATE test.fiwiki_flaggedrevs_approvals
SET
  rev_id_start = 0 + REGEXP_REPLACE(params, '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$', '\\2'),
  rev_id_end = 0 + REGEXP_REPLACE(params, '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$', '\\1')
WHERE
  params rlike '^\\s*(\\d+)\\s+(\\d+)\\s+\\d+\\s*$';

SELECT
  rev_id_end AS rev_id,
  concat("https://fi.wikipedia.org/w/index.php?diff=", rev_id_end, "&oldid=", rev_id_start) AS diff,
  'true' AS approved,
  'false' AS damaging,
  'true' AS goodfaith
FROM
  test.fiwiki_flaggedrevs_approvals,
  revision AS r1,
  revision AS r2
WHERE
  r1.rev_id = rev_id_end
  AND r2.rev_id = rev_id_start
  AND r1.rev_parent_id = rev_id_start;

-- Note that we still need to filter out edits that were later reverted.  We'll
-- accomplish that with autolabel.

After implementing suggested fixes and rebuilding the model, our health seems to have gone down slightly, even with the bigger and presumably well-labeled training set.

cat datasets/fiwiki.flaggedrevs_training.w_cache.225k.json | \
revscoring train_model \
  revscoring.scorer_models.GradientBoosting \
  editquality.feature_lists.fiwiki.damaging \
  damaging \
  --observations "<stdin>" \
  -p 'learning_rate=0.01' \
  -p 'max_features="log2"' \
  -p 'max_depth=5' \
  -p 'n_estimators=700' \
  --balance-sample-weight \
  --version 0.0.1 \
  --center --scale > models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model
2017-08-02 04:15:34,005 INFO:revscoring.utilities.train_model -- Training model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_depth=5, random_state=null, init=null, min_weight_fraction_leaf=0.0, presort="auto", learning_rate=0.01, max_leaf_nodes=null, balanced_sample=false, subsample=1.0, verbose=0, warm_start=false, min_samples_leaf=1, scale=true, center=true, min_samples_split=2, max_features="log2", balanced_sample_weight=true, loss="deviance", n_estimators=700
 - version: 0.0.1
 - trained: 2017-08-02T04:43:42.045973

No stats available

revscoring test_model \
models/fiwiki.damaging_w_flaggedrevs_wo_testinfo.gradient_boosting.model \
damaging \
--observations=datasets/fiwiki.labeled_revisions_testing.w_cache.5k_2016.json > models/fiwiki.damaging_w_flaggedrevs.gradient_boosting.model
2017-08-02 04:43:57,726 INFO:revscoring.utilities.test_model -- Testing model...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_leaf_nodes=null, warm_start=false, subsample=1.0, verbose=0, max_features="log2", random_state=null, min_samples_split=2, loss="deviance", init=null, n_estimators=700, learning_rate=0.01, balanced_sample_weight=true, scale=true, max_depth=5, center=true, min_weight_fraction_leaf=0.0, min_samples_leaf=1, presort="auto", balanced_sample=false
 - version: 0.0.1
 - trained: 2017-08-02T04:43:42.045973

Table:
                 ~False    ~True
        -----  --------  -------
        False      4588      139
        True        138      120

Accuracy: 0.944
Precision:
        -----  -----
        False  0.971
        True   0.463
        -----  -----

Recall:
        -----  -----
        False  0.971
        True   0.465
        -----  -----

ROC-AUC:
        -----  -----
        False  0.878
        True   0.878
        -----  -----

PR-AUC:
        -----  -----
        False  0.991
        True   0.401
        -----  -----

I'll post the recipe tomorrow in case anyone wants to replicate the experiment.

Interesting and surprising. Could it be that our labels for "damaging": false are recording something different than the assumed labels for approved-and-not-reverted? I guess it could maybe also be that we're giving the model a lot of examples of good contributions by newcomers (since I imagine most edits caught by flaggedrevs are anons and newcomers) and the model is simply down-weighting its confidence that newcomer/anon edits are crappy.

If we feel that this data can be considered a known-good, then I'm in favor of adding it to the training *and* testing sets and going from there. If we want to explore the approved-and-not-reverted data more carefully, then I suggest we turn that into a follow-up task.

@Zache We would love if you weighed in with how you would like to proceed. Our second experiment showed a slight drop in fitness which we can't fully explain, but @Halfak is considering mixing the Flagged Revs approval set into our training and test data for fiwiki anyway.

The question seems to be, how confident are we that the approved revisions used in this experiment are not damaging? Here's the approvals data set after making refinements: https://github.com/wiki-ai/editquality/blob/fiwiki.flaggedrevs/datasets/fiwiki.flaggedrevs_autolabeled_unreverted.210k_2017.json.bz2

@Zache: nudge--we're hoping to get your opinion on the question above. Just spot-check the data in that .json.bz2 file, and let us know if you're confident that approvals are roughly as good as the Wiki Labels output.

Sorry about the delay. Just too many things to do. I am trying to do thist today.

@awight , after toying with the .json.bz2 i would say that we can be pretty sure that edits aren't damaging. I didn't find any clear vandalism and in context of damaging the biggest problem seems to be rather high level of unsourced (goodfaith) edits. However i personally don't see unsourced goodfaith edits a problem.

Not directly related, but it seems that huwiki moves to $wgFlaggedRevsOverride = false; mode (T121995) for a 6 month testing period. Currently in huwiki all edits are reviewed before being visible. After changes they will be directly visible like in fiwiki so there is bigger need to find bad edits. Also they don't have ORES goodfaith/damaging model labeling done ( http://labels.wmflabs.org/stats/huwiki/ )

Cleaned up and rebased. @awight, please have another look.