Page MenuHomePhabricator

Implement wp10 model for trwiki
Closed, ResolvedPublic

Description

Event Timeline

We use project template like this:

{{VikiProje |Proje = ____ |Sınıf = ____ |Önem = ____ }}

Parameters and possible values:
Proje (name of project) = 10K, Edebiyat, Sanat, Sinema, Çin, Azerbaycan, Sovyetler Birliği, etc....
Sınıf (class, quality) = SM (featured article) , SL (featured list) , KM (Good articles) , B , C , Başlangıç (start) , Taslak (stub) , Liste (list) , Gerekli (needed) , Şablon (template)
Önem (importance) = En (Top) , Çok(High) , Orta (Mid) , Az (Low)

Project table for 10K. Half of these articles dont have class label.

Our community definitely needs WikiClass. Thanks.

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Halfak triaged this task as Medium priority.May 11 2017, 2:36 PM
Halfak moved this task from Unsorted to Research & analysis on the Machine-Learning-Team board.
Halfak added a subscriber: Nettrom.

How do InfoBoxes work? Are they used like on English Wikipedia?

Are there "citation needed" templates? How do they work?

Yes.
Template:Citation needed: https://tr.wikipedia.org/wiki/Şablon:Kaynak_belirt
and redirects over: Olgu, Fact, Delil

We usually imitating English Wikipedia. You can find all necessary templates over interwiki links.

@Mavrikant: thanks for getting code for the trwiki extractor up on https://github.com/Mavrikant/wikiclass/blob/master/wikiclass/extractors/trwiki.py, it makes everything a lot easier!

I chatted with @Halfak about it on IRC, and I have two comments:

  1. The article quality model is not tested on lists, so we should not pick up featured list labels. I have not studied quality criteria for lists in Wikipedia, so I have no idea whether the model will work or not, but we do know that it works well for regular articles.
  2. The code that strips HTML comments has troubles with some cases where there are newlines, leading it to not pick up the quality label correctly. I identified this bug a few months ago and wrote some updated code to try to fix it but didn't get around to properly testing it. Nonetheless there's now a pull request up on https://github.com/wiki-ai/wikiclass/pull/35 The exact change that shows how to alter the code is here: https://github.com/wiki-ai/wikiclass/pull/35/commits/d9687dfdc7a97309282e00c07b8c596910d9e111

Let me know if there's anything I can help with!

@Nettrom I changed trwiki extractor.

  1. We can skip list.
  2. I'm 100% sure there is no comment in template and %99.9 sure no newline. I edited all pages 2 weak ago with bot (example edit). They are all in standart form.

@Mavrikant Excellent! The extractor looks good to go as far as I can tell. Also, happy to hear that you don't have HTML comments in your WikiProject templates, that makes life a lot easier :)

I used the extractor on trwiki and found zero instances of the template. Something must have gone wrong. Can you give me a page that has the template to process?

Got the issue. we were lower-casing the template name and then checking for equality with an upper-cased string. I have a new branch on the main repo where I have merged the work of @Mavrikant and am working on a new extraction. Once it *works*, I'll submit a PR that includes @Mavrikant's commits. :)

Just finish a run over the XML dumps and only found 1712 labels on 1636 pages. It looks like they were all added by Mavrikant bot. Since we need to find the version of the article that was originally labeled, we'll need to handle the templates before Mavrikant bot came to clean them up. I'll be looking into this.

As an example, this revision has the following quality templates:

{{VikiProje Türkiye
 |sınıf=B
 |önem=En
}}
{{VikiProje GM|sınıf=B| önem = En }}
{{VikiProje 10K | sınıf = B | önem = En }}

It's good to know AI is looking that times revision for training. Most of our labels are outdated.
Somethings you may need to know:

  • After Mavrikant Bot unified project templates, our users added more than 500 label to talk pages. (example:soviet union wikiproject collaboration + new projet feminism)
  • There was A-class before unifying but we removed this class move pages to B-classs. (58 page)
  • Currenly there is 4698 unique page with class label. (Petscan)
  • Mavrikant Bot added empty wikiproject template and importance label to talk pages by looking enwiki. (importace adding, empty wikiproject temp. adding) (~2000 edit)
  • Unifying processes is mostly like this "{{VikiProje XXXXX |sınıf=B |önem=En }}" -> "{{VikiProje |Proje=XXXXX |Sınıf=B |Önem=En }}"
    • There is some cases project name changes. These projects have few pages and probably already have other wikiprojects template. So you can ignore this case. (example1, example2 )
    • Old template may have lower case characters. ( {{[Vv]iki[Pp]roje [Ss]por ...}} -> "{{VikiProje |Proje=Spor ...}} )
$ wc trwiki.observations.first_labelings.20170501.json 
  12996  124270 1465684 trwiki.observations.first_labelings.20170501.json
$ cat trwiki.observations.first_labelings.20170501.json | json2tsv page_title | sort | uniq | wc
  10722   25506  200154
$ cat trwiki.observations.first_labelings.20170501.json | json2tsv wp10 | sort | uniq -c
    919 b
   1485 c
    293 km
    272 sm
  10027 taslak

It's useful if we can build a model based on balanced observations. Looks like we can get a balanced set of 272*5=1360 observations. This should be useful for an initial model. We could then use this model to find some good examples of sm/km predictions that are not yet promoted to boost the number of those observations.

Just started extracting features. I should have a baseline model ready in a few hours.

cat datasets/trwiki.labeling_revisions.w_cache.2k.json | \
        revscoring cv_train \
          revscoring.scorer_models.GradientBoosting \
          wikiclass.feature_lists.trwiki.wp10 \
          wp10 \
          --version 0.5.0 \
          -p 'max_depth=5' \
          -p 'learning_rate=0.01' \
          -p 'max_features="log2"' \
          -p 'n_estimators=300' \
          -s 'table' -s 'accuracy' -s 'roc' -s 'f1' \
          --balance-sample \
          --center --scale > \
        models/trwiki.wp10.gradient_boosting.model
2017-06-15 20:58:28,448 INFO:revscoring.utilities.cv_train -- Cross-validating model statistics for 10 folds...
2017-06-15 20:58:28,489 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 1...
2017-06-15 20:58:28,493 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 2...
2017-06-15 20:58:28,496 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 3...
2017-06-15 20:58:28,499 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 4...
2017-06-15 20:58:28,502 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 5...
2017-06-15 20:58:28,504 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 6...
2017-06-15 20:58:28,507 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 7...
2017-06-15 20:58:28,509 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 8...
2017-06-15 20:58:28,512 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 9...
2017-06-15 20:58:28,519 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 10...
2017-06-15 20:58:32,303 INFO:revscoring.utilities.cv_train -- Training model on all data...
ScikitLearnClassifier
 - type: GradientBoosting
 - params: min_samples_leaf=1, learning_rate=0.01, max_features="log2", random_state=null, max_depth=5, center=true, n_estimators=300, balanced_sample_weight=false, init=null, min_samples_split=2, presort="auto", warm_start=false, scale=true, min_weight_fraction_leaf=0.0, loss="deviance", max_leaf_nodes=null, verbose=0, balanced_sample=true, subsample=1.0
 - version: 0.5.0
 - trained: 2017-06-15T20:58:35.694308

Table:
                  ~b    ~c    ~km    ~sm    ~taslak
        ------  ----  ----  -----  -----  ---------
        b        126    65     39     34          4
        c         43   168     25      7         23
        km        17    12    203     31          4
        sm        11     3     33    218          4
        taslak     6    22      2      0        238

Accuracy: 0.712
ROC-AUC:
        --------  -----
        'b'       0.847
        'c'       0.893
        'km'      0.93
        'sm'      0.96
        'taslak'  0.983
        --------  -----

F1:
        ------  -----
        b       0.529
        km      0.711
        sm      0.777
        c       0.621
        taslak  0.876
        ------  -----

Pull request incoming.

I've got the model deployed to our WMFLabs service. See https://ores.wmflabs.org/v3/scores/trwiki/18724885/wp10 for a score of the most recent version of https://tr.wikipedia.org/wiki/DNA

This is ready for some experimentation.

I talked to @Mavrikant in IRC and he told me that a whole class was missing from the data. So I'm now re-building the model. New PR coming soon.

Here's the newest version of the model:

ScikitLearnClassifier
 - type: GradientBoosting
 - params: warm_start=false, subsample=1.0, init=null, loss="deviance", n_estimators=300, balanced_sample=true, balanced_sample_weight=false, learning_rate=0.01, min_weight_fraction_leaf=0.0, presort="auto", center=true, min_samples_split=2, scale=true, max_features="log2", min_samples_leaf=1, verbose=0, max_depth=5, max_leaf_nodes=null, random_state=null
 - version: 0.5.0
 - trained: 2017-06-19T16:03:11.379319

Table:
                    ~b    ~baslagıç    ~c    ~km    ~sm    ~taslak
        --------  ----  -----------  ----  -----  -----  ---------
        b          133           16    49     42     29          2
        baslagıç     5          149    54      6      0         52
        c           27           58   145     21      7          9
        km          15            5    16    197     32          2
        sm           6            2     3     28    227          3
        taslak       2           51     9      1      0        205

Accuracy: 0.657
ROC-AUC:
        ----------  -----
        'b'         0.863
        'baslagıç'  0.895
        'c'         0.867
        'km'        0.94
        'sm'        0.969
        'taslak'    0.967
        ----------  -----

F1:
        --------  -----
        km        0.699
        taslak    0.757
        baslagıç  0.541
        b         0.58
        c         0.532
        sm        0.804
        --------  -----