Page MenuHomePhabricator

Simplify and modularize the Makefile template
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@awight, I implemented what we talked about because I got blocked on a weird corner of the current template for huwiki and wanted to just get this fixed.

Can we shelf this in the backlog?

Yeah. Was looking into it and it'll take some work to revive it. It doesn't seem pressing now.

Ladsgroup triaged this task as Medium priority.Nov 28 2018, 6:38 AM
Ladsgroup raised the priority of this task from Medium to Needs Triage.
Ladsgroup moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Example configuration:

1name: huwiki
2label: Hungarian Wikipedia
3host: hu.wikipedia.org
4
5external_samples:
6 sampled_revisions.40k_2016:
7 quarry_url: "http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true"
8 human_labeled_revisions.raw.5k_2016:
9 labeling_campaign: "https://labels.wmflabs.org/campaigns/huwiki/12/"
10
11autolabeled_samples:
12 trusted_edits: 1000
13 trusted_groups:
14 - sysop
15 - oversight
16 - trusted
17 - bot
18 - rollbacker
19 - checkuser
20 - abusefilter
21 - bureaucrat
22 - editor
23 - templateeditor
24 - interface-editor
25 labeled_samples:
26 autolabeled_revisions.40k_2016: sampled_revisions.40k_2016
27
28balanced_5k_samples:
29 revisions_for_review.5k_2016: autolabeled_revisions.40k_2016
30
31merged_samples:
32 labeled_revisions.40k_2016:
33 - autolabeled_revisions.40k_2016
34 - human_labeled_revisions.5k_2016
35
36extracted_samples:
37 labeled_revisions.w_cache.40k_2016:
38 sample: labeled_revisions.20k_2016
39 features_for: [damaging, goodfaith]
40
41
42models:
43 damaging:
44 observations: labeled_revisions.w_cache.40k_2016
45 label: damaging
46 pop_rate_true: 0.01
47 tune: true
48 cv_train:
49 algorithm: GradientBoosting
50 parameters:
51 max_depth: 7
52 learning_rate: 0.01
53 max_features: log2
54 n_estimators: 700
55 goodfaith:
56 observations: labeled_revisions.w_cache.40k_2016
57 label: goodfaith
58 pop_rate_true: 0.99
59 tune: true
60 cv_train:
61 algorithm: GradientBoosting
62 parameters:
63 max_depth: 7
64 learning_rate: 0.01
65 max_features: log2
66 n_estimators: 700

Example output:

1$ ./utility generate_make --config test-config
2# This file is built automatically using cg.py file and Makefile.j2
3# Any change you make on this file will be lost in the next run.
4
5# Remove target files after command failure.
6.DELETE_ON_ERROR:
7
8models: \
9 huwiki_models
10
11tuning_reports: \
12 huwiki_tuning_reports
13
14touch:
15 touch datasets/*
16 touch models/*
17
18include Makefile.manual
19
20
21############################# Hungarian Wikipedia ################################
22datasets/huwiki.human_labeled_revisions.raw.5k_2016.json:
23 ./utility fetch_labels \
24 https://labels.wmflabs.org/campaigns/huwiki/12/ > $@
25datasets/huwiki.sampled_revisions.40k_2016.json:
26 wget -qO- http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true > $@
27
28datasets/huwiki.autolabeled_revisions.40k_2016.json: \
29 datasets/huwiki.sampled_revisions.40k_2016.json
30 cat $< | \
31 ./utility autolabel --host=https://hu.wikipedia.org \
32 --trusted-groups=sysop,oversight,trusted,bot,rollbacker,checkuser,abusefilter,bureaucrat,editor,templateeditor,interface-editor \
33 --trusted-edits=1000 \
34 --revert-radius=3 \
35 --revert-window=48 \
36 --verbose > $@
37
38datasets/huwiki.revisions_for_review.5k_2016.json: \
39 datasets/huwiki.autolabeled_revisions.40k_2016.json
40 ( \
41 cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (true|"True") | \
42 shuf -n 2500; \
43 cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (false|"False") | \
44 shuf -n 2500 \
45 ) | shuf > $@
46
47datasets/huwiki.labeled_revisions.40k_2016.json: \
48 datasets/huwiki.autolabeled_revisions.40k_2016.json \
49 datasets/huwiki.human_labeled_revisions.5k_2016.json ./utility merge_labels $^ > $@
50
51datasets/huwiki.labeled_revisions.w_cache.40k_2016.json: \
52 datasets/huwiki.labeled_revisions.20k_2016.json
53 revscoring extract \
54 editquality.feature_lists.huwiki.damaging \
55 editquality.feature_lists.huwiki.goodfaith \
56 --host https://hu.wikipedia.org \
57 --extractors $(max_extractors) \
58 --verbose > $@
59
60tuning_reports/huwiki.damaging.md: \
61 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
62 cat $< | \
63 revscoring tune \
64 config/classifiers.params.yaml \
65 editquality.feature_lists.huwiki.damaging \
66 damaging \
67 roc_auc.labels.true \
68 --label-weight $(damaging_label_weight) \
69 --pop-rate "true=0.01" \
70 --pop-rate "false=0.99" \
71 --center --scale \
72 --cv-timeout 60 \
73 --debug > $@
74
75models/huwiki.damaging.gradient_boosting.model: \
76 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
77 cat $< | \
78 revscoring cv_train
79 damaging \
80 --version=$(damaging_major_minor). \
81 -p 'learning_rate=0.01' \
82 -p 'max_depth=7' \
83 -p 'max_features="log2"' \
84 -p 'n_estimators=700' \
85 --label-weight $(damaging_label_weight) \
86 --pop-rate "true=0.01" \
87 --pop-rate "false=0.99" \
88 --center --scale > $@
89
90 revscoring model_info $@ > model_info/huwiki.damaging.md
91
92tuning_reports/huwiki.goodfaith.md: \
93 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
94 cat $< | \
95 revscoring tune \
96 config/classifiers.params.yaml \
97 editquality.feature_lists.huwiki.goodfaith \
98 goodfaith \
99 roc_auc.labels.true \
100 --label-weight $(goodfaith_label_weight) \
101 --pop-rate "true=0.99" \
102 --pop-rate "false=0.010000000000000009" \
103 --center --scale \
104 --cv-timeout 60 \
105 --debug > $@
106
107models/huwiki.goodfaith.gradient_boosting.model: \
108 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
109 cat $< | \
110 revscoring cv_train
111 goodfaith \
112 --version=$(goodfaith_major_minor). \
113 -p 'learning_rate=0.01' \
114 -p 'max_depth=7' \
115 -p 'max_features="log2"' \
116 -p 'n_estimators=700' \
117 --label-weight $(goodfaith_label_weight) \
118 --pop-rate "true=0.99" \
119 --pop-rate "false=0.010000000000000009" \
120 --center --scale > $@
121
122 revscoring model_info $@ > model_info/huwiki.goodfaith.md
123
124
125huwiki_models: \
126 models/huwiki.goodfaith.gradient_boosting.model \
127 models/huwiki.damaging.gradient_boosting.model
128
129huwiki_tuning_reports: \
130 tuning_reports/huwiki.goodfaith.md \
131 tuning_reports/huwiki.damaging.md

@Ladsgroup, I'd love to have your notes on this one.

This is now finished and tested. See notes in the PR for suggestions on how to review.