It seems they have a wp10-like system:
https://tr.wikipedia.org/wiki/Tart%C4%B1%C5%9Fma:T%C3%BCrkiye?veaction=editsource
Contact person: @Mavrikant
It seems they have a wp10-like system:
https://tr.wikipedia.org/wiki/Tart%C4%B1%C5%9Fma:T%C3%BCrkiye?veaction=editsource
Contact person: @Mavrikant
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T168099 Mid June 2017 ORES deployment | |||
Resolved | Halfak | T164671 Implement wp10 model for trwiki |
We use project template like this:
{{VikiProje |Proje = ____ |Sınıf = ____ |Önem = ____ }}
Parameters and possible values:
Proje (name of project) = 10K, Edebiyat, Sanat, Sinema, Çin, Azerbaycan, Sovyetler Birliği, etc....
Sınıf (class, quality) = SM (featured article) , SL (featured list) , KM (Good articles) , B , C , Başlangıç (start) , Taslak (stub) , Liste (list) , Gerekli (needed) , Şablon (template)
Önem (importance) = En (Top) , Çok(High) , Orta (Mid) , Az (Low)
Project table for 10K. Half of these articles dont have class label.
Our community definitely needs WikiClass. Thanks.
How do InfoBoxes work? Are they used like on English Wikipedia?
Are there "citation needed" templates? How do they work?
Yes.
Template:Citation needed: https://tr.wikipedia.org/wiki/Şablon:Kaynak_belirt
and redirects over: Olgu, Fact, Delil
We usually imitating English Wikipedia. You can find all necessary templates over interwiki links.
@Mavrikant: thanks for getting code for the trwiki extractor up on https://github.com/Mavrikant/wikiclass/blob/master/wikiclass/extractors/trwiki.py, it makes everything a lot easier!
I chatted with @Halfak about it on IRC, and I have two comments:
Let me know if there's anything I can help with!
@Nettrom I changed trwiki extractor.
@Mavrikant Excellent! The extractor looks good to go as far as I can tell. Also, happy to hear that you don't have HTML comments in your WikiProject templates, that makes life a lot easier :)
I used the extractor on trwiki and found zero instances of the template. Something must have gone wrong. Can you give me a page that has the template to process?
Got the issue. we were lower-casing the template name and then checking for equality with an upper-cased string. I have a new branch on the main repo where I have merged the work of @Mavrikant and am working on a new extraction. Once it *works*, I'll submit a PR that includes @Mavrikant's commits. :)
Just finish a run over the XML dumps and only found 1712 labels on 1636 pages. It looks like they were all added by Mavrikant bot. Since we need to find the version of the article that was originally labeled, we'll need to handle the templates before Mavrikant bot came to clean them up. I'll be looking into this.
As an example, this revision has the following quality templates:
{{VikiProje Türkiye |sınıf=B |önem=En }} {{VikiProje GM|sınıf=B| önem = En }} {{VikiProje 10K | sınıf = B | önem = En }}
It's good to know AI is looking that times revision for training. Most of our labels are outdated.
Somethings you may need to know:
$ wc trwiki.observations.first_labelings.20170501.json 12996 124270 1465684 trwiki.observations.first_labelings.20170501.json $ cat trwiki.observations.first_labelings.20170501.json | json2tsv page_title | sort | uniq | wc 10722 25506 200154 $ cat trwiki.observations.first_labelings.20170501.json | json2tsv wp10 | sort | uniq -c 919 b 1485 c 293 km 272 sm 10027 taslak
It's useful if we can build a model based on balanced observations. Looks like we can get a balanced set of 272*5=1360 observations. This should be useful for an initial model. We could then use this model to find some good examples of sm/km predictions that are not yet promoted to boost the number of those observations.
Just started extracting features. I should have a baseline model ready in a few hours.
cat datasets/trwiki.labeling_revisions.w_cache.2k.json | \ revscoring cv_train \ revscoring.scorer_models.GradientBoosting \ wikiclass.feature_lists.trwiki.wp10 \ wp10 \ --version 0.5.0 \ -p 'max_depth=5' \ -p 'learning_rate=0.01' \ -p 'max_features="log2"' \ -p 'n_estimators=300' \ -s 'table' -s 'accuracy' -s 'roc' -s 'f1' \ --balance-sample \ --center --scale > \ models/trwiki.wp10.gradient_boosting.model 2017-06-15 20:58:28,448 INFO:revscoring.utilities.cv_train -- Cross-validating model statistics for 10 folds... 2017-06-15 20:58:28,489 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 1... 2017-06-15 20:58:28,493 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 2... 2017-06-15 20:58:28,496 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 3... 2017-06-15 20:58:28,499 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 4... 2017-06-15 20:58:28,502 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 5... 2017-06-15 20:58:28,504 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 6... 2017-06-15 20:58:28,507 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 7... 2017-06-15 20:58:28,509 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 8... 2017-06-15 20:58:28,512 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 9... 2017-06-15 20:58:28,519 INFO:revscoring.scorer_models.sklearn_classifier -- Performing cross-validation 10... 2017-06-15 20:58:32,303 INFO:revscoring.utilities.cv_train -- Training model on all data... ScikitLearnClassifier - type: GradientBoosting - params: min_samples_leaf=1, learning_rate=0.01, max_features="log2", random_state=null, max_depth=5, center=true, n_estimators=300, balanced_sample_weight=false, init=null, min_samples_split=2, presort="auto", warm_start=false, scale=true, min_weight_fraction_leaf=0.0, loss="deviance", max_leaf_nodes=null, verbose=0, balanced_sample=true, subsample=1.0 - version: 0.5.0 - trained: 2017-06-15T20:58:35.694308 Table: ~b ~c ~km ~sm ~taslak ------ ---- ---- ----- ----- --------- b 126 65 39 34 4 c 43 168 25 7 23 km 17 12 203 31 4 sm 11 3 33 218 4 taslak 6 22 2 0 238 Accuracy: 0.712 ROC-AUC: -------- ----- 'b' 0.847 'c' 0.893 'km' 0.93 'sm' 0.96 'taslak' 0.983 -------- ----- F1: ------ ----- b 0.529 km 0.711 sm 0.777 c 0.621 taslak 0.876 ------ -----
Pull request incoming.
I've got the model deployed to our WMFLabs service. See https://ores.wmflabs.org/v3/scores/trwiki/18724885/wp10 for a score of the most recent version of https://tr.wikipedia.org/wiki/DNA
This is ready for some experimentation.
I talked to @Mavrikant in IRC and he told me that a whole class was missing from the data. So I'm now re-building the model. New PR coming soon.
Here's the newest version of the model:
ScikitLearnClassifier - type: GradientBoosting - params: warm_start=false, subsample=1.0, init=null, loss="deviance", n_estimators=300, balanced_sample=true, balanced_sample_weight=false, learning_rate=0.01, min_weight_fraction_leaf=0.0, presort="auto", center=true, min_samples_split=2, scale=true, max_features="log2", min_samples_leaf=1, verbose=0, max_depth=5, max_leaf_nodes=null, random_state=null - version: 0.5.0 - trained: 2017-06-19T16:03:11.379319 Table: ~b ~baslagıç ~c ~km ~sm ~taslak -------- ---- ----------- ---- ----- ----- --------- b 133 16 49 42 29 2 baslagıç 5 149 54 6 0 52 c 27 58 145 21 7 9 km 15 5 16 197 32 2 sm 6 2 3 28 227 3 taslak 2 51 9 1 0 205 Accuracy: 0.657 ROC-AUC: ---------- ----- 'b' 0.863 'baslagıç' 0.895 'c' 0.867 'km' 0.94 'sm' 0.969 'taslak' 0.967 ---------- ----- F1: -------- ----- km 0.699 taslak 0.757 baslagıç 0.541 b 0.58 c 0.532 sm 0.804 -------- -----