See https://gist.github.com/halfak/f00ea4efb2b158fe3d1924b3dfb10d58 for the goal config.
Description
Description
Related Objects
Related Objects
Event Timeline
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMar 28 2018, 10:07 PM2018-03-28 22:07:11 (UTC+0)
Comment Actions
@awight, I implemented what we talked about because I got blocked on a weird corner of the current template for huwiki and wanted to just get this fixed.
awight moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Apr 4 2018, 9:49 PM2018-04-04 21:49:16 (UTC+0)
Comment Actions
Yeah. Was looking into it and it'll take some work to revive it. It doesn't seem pressing now.
Comment Actions
Example configuration:
1 | name: huwiki |
---|---|
2 | label: Hungarian Wikipedia |
3 | host: hu.wikipedia.org |
4 | |
5 | external_samples: |
6 | sampled_revisions.40k_2016: |
7 | quarry_url: "http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true" |
8 | human_labeled_revisions.raw.5k_2016: |
9 | labeling_campaign: "https://labels.wmflabs.org/campaigns/huwiki/12/" |
10 | |
11 | autolabeled_samples: |
12 | trusted_edits: 1000 |
13 | trusted_groups: |
14 | - sysop |
15 | - oversight |
16 | - trusted |
17 | - bot |
18 | - rollbacker |
19 | - checkuser |
20 | - abusefilter |
21 | - bureaucrat |
22 | - editor |
23 | - templateeditor |
24 | - interface-editor |
25 | labeled_samples: |
26 | autolabeled_revisions.40k_2016: sampled_revisions.40k_2016 |
27 | |
28 | balanced_5k_samples: |
29 | revisions_for_review.5k_2016: autolabeled_revisions.40k_2016 |
30 | |
31 | merged_samples: |
32 | labeled_revisions.40k_2016: |
33 | - autolabeled_revisions.40k_2016 |
34 | - human_labeled_revisions.5k_2016 |
35 | |
36 | extracted_samples: |
37 | labeled_revisions.w_cache.40k_2016: |
38 | sample: labeled_revisions.20k_2016 |
39 | features_for: [damaging, goodfaith] |
40 | |
41 | |
42 | models: |
43 | damaging: |
44 | observations: labeled_revisions.w_cache.40k_2016 |
45 | label: damaging |
46 | pop_rate_true: 0.01 |
47 | tune: true |
48 | cv_train: |
49 | algorithm: GradientBoosting |
50 | parameters: |
51 | max_depth: 7 |
52 | learning_rate: 0.01 |
53 | max_features: log2 |
54 | n_estimators: 700 |
55 | goodfaith: |
56 | observations: labeled_revisions.w_cache.40k_2016 |
57 | label: goodfaith |
58 | pop_rate_true: 0.99 |
59 | tune: true |
60 | cv_train: |
61 | algorithm: GradientBoosting |
62 | parameters: |
63 | max_depth: 7 |
64 | learning_rate: 0.01 |
65 | max_features: log2 |
66 | n_estimators: 700 |
Example output:
1 | $ ./utility generate_make --config test-config |
---|---|
2 | # This file is built automatically using cg.py file and Makefile.j2 |
3 | # Any change you make on this file will be lost in the next run. |
4 | |
5 | # Remove target files after command failure. |
6 | .DELETE_ON_ERROR: |
7 | |
8 | models: \ |
9 | huwiki_models |
10 | |
11 | tuning_reports: \ |
12 | huwiki_tuning_reports |
13 | |
14 | touch: |
15 | touch datasets/* |
16 | touch models/* |
17 | |
18 | include Makefile.manual |
19 | |
20 | |
21 | ############################# Hungarian Wikipedia ################################ |
22 | datasets/huwiki.human_labeled_revisions.raw.5k_2016.json: |
23 | ./utility fetch_labels \ |
24 | https://labels.wmflabs.org/campaigns/huwiki/12/ > $@ |
25 | datasets/huwiki.sampled_revisions.40k_2016.json: |
26 | wget -qO- http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true > $@ |
27 | |
28 | datasets/huwiki.autolabeled_revisions.40k_2016.json: \ |
29 | datasets/huwiki.sampled_revisions.40k_2016.json |
30 | cat $< | \ |
31 | ./utility autolabel --host=https://hu.wikipedia.org \ |
32 | --trusted-groups=sysop,oversight,trusted,bot,rollbacker,checkuser,abusefilter,bureaucrat,editor,templateeditor,interface-editor \ |
33 | --trusted-edits=1000 \ |
34 | --revert-radius=3 \ |
35 | --revert-window=48 \ |
36 | --verbose > $@ |
37 | |
38 | datasets/huwiki.revisions_for_review.5k_2016.json: \ |
39 | datasets/huwiki.autolabeled_revisions.40k_2016.json |
40 | ( \ |
41 | cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (true|"True") | \ |
42 | shuf -n 2500; \ |
43 | cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (false|"False") | \ |
44 | shuf -n 2500 \ |
45 | ) | shuf > $@ |
46 | |
47 | datasets/huwiki.labeled_revisions.40k_2016.json: \ |
48 | datasets/huwiki.autolabeled_revisions.40k_2016.json \ |
49 | datasets/huwiki.human_labeled_revisions.5k_2016.json ./utility merge_labels $^ > $@ |
50 | |
51 | datasets/huwiki.labeled_revisions.w_cache.40k_2016.json: \ |
52 | datasets/huwiki.labeled_revisions.20k_2016.json |
53 | revscoring extract \ |
54 | editquality.feature_lists.huwiki.damaging \ |
55 | editquality.feature_lists.huwiki.goodfaith \ |
56 | --host https://hu.wikipedia.org \ |
57 | --extractors $(max_extractors) \ |
58 | --verbose > $@ |
59 | |
60 | tuning_reports/huwiki.damaging.md: \ |
61 | datasets/huwiki.labeled_revisions.w_cache.40k_2016.json |
62 | cat $< | \ |
63 | revscoring tune \ |
64 | config/classifiers.params.yaml \ |
65 | editquality.feature_lists.huwiki.damaging \ |
66 | damaging \ |
67 | roc_auc.labels.true \ |
68 | --label-weight $(damaging_label_weight) \ |
69 | --pop-rate "true=0.01" \ |
70 | --pop-rate "false=0.99" \ |
71 | --center --scale \ |
72 | --cv-timeout 60 \ |
73 | --debug > $@ |
74 | |
75 | models/huwiki.damaging.gradient_boosting.model: \ |
76 | datasets/huwiki.labeled_revisions.w_cache.40k_2016.json |
77 | cat $< | \ |
78 | revscoring cv_train |
79 | damaging \ |
80 | --version=$(damaging_major_minor). \ |
81 | -p 'learning_rate=0.01' \ |
82 | -p 'max_depth=7' \ |
83 | -p 'max_features="log2"' \ |
84 | -p 'n_estimators=700' \ |
85 | --label-weight $(damaging_label_weight) \ |
86 | --pop-rate "true=0.01" \ |
87 | --pop-rate "false=0.99" \ |
88 | --center --scale > $@ |
89 | |
90 | revscoring model_info $@ > model_info/huwiki.damaging.md |
91 | |
92 | tuning_reports/huwiki.goodfaith.md: \ |
93 | datasets/huwiki.labeled_revisions.w_cache.40k_2016.json |
94 | cat $< | \ |
95 | revscoring tune \ |
96 | config/classifiers.params.yaml \ |
97 | editquality.feature_lists.huwiki.goodfaith \ |
98 | goodfaith \ |
99 | roc_auc.labels.true \ |
100 | --label-weight $(goodfaith_label_weight) \ |
101 | --pop-rate "true=0.99" \ |
102 | --pop-rate "false=0.010000000000000009" \ |
103 | --center --scale \ |
104 | --cv-timeout 60 \ |
105 | --debug > $@ |
106 | |
107 | models/huwiki.goodfaith.gradient_boosting.model: \ |
108 | datasets/huwiki.labeled_revisions.w_cache.40k_2016.json |
109 | cat $< | \ |
110 | revscoring cv_train |
111 | goodfaith \ |
112 | --version=$(goodfaith_major_minor). \ |
113 | -p 'learning_rate=0.01' \ |
114 | -p 'max_depth=7' \ |
115 | -p 'max_features="log2"' \ |
116 | -p 'n_estimators=700' \ |
117 | --label-weight $(goodfaith_label_weight) \ |
118 | --pop-rate "true=0.99" \ |
119 | --pop-rate "false=0.010000000000000009" \ |
120 | --center --scale > $@ |
121 | |
122 | revscoring model_info $@ > model_info/huwiki.goodfaith.md |
123 | |
124 | |
125 | huwiki_models: \ |
126 | models/huwiki.goodfaith.gradient_boosting.model \ |
127 | models/huwiki.damaging.gradient_boosting.model |
128 | |
129 | huwiki_tuning_reports: \ |
130 | tuning_reports/huwiki.goodfaith.md \ |
131 | tuning_reports/huwiki.damaging.md |
Comment Actions
This is now finished and tested. See notes in the PR for suggestions on how to review.
Halfak moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Mar 22 2019, 4:28 PM2019-03-22 16:28:54 (UTC+0)
Halfak moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 23 2019, 3:06 PM2019-04-23 15:06:10 (UTC+0)