Page MenuHomePhabricator

Simplify and modularize the Makefile template
Closed, ResolvedPublic

Event Timeline

Halfak created this task.Mar 28 2018, 10:07 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMar 28 2018, 10:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak added a subscriber: awight.Mar 28 2018, 10:09 PM

@awight, I implemented what we talked about because I got blocked on a weird corner of the current template for huwiki and wanted to just get this fixed.

Halfak claimed this task.Apr 2 2018, 4:44 PM

Can we shelf this in the backlog?

Halfak removed Halfak as the assignee of this task.Oct 12 2018, 4:08 PM

Yeah. Was looking into it and it'll take some work to revive it. It doesn't seem pressing now.

Ladsgroup triaged this task as Medium priority.Nov 28 2018, 6:38 AM
Ladsgroup raised the priority of this task from Medium to Needs Triage.
Ladsgroup moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.
Ladsgroup triaged this task as Low priority.Dec 5 2018, 2:31 PM

Example configuration:

1name: huwiki
2label: Hungarian Wikipedia
3host: hu.wikipedia.org
4
5external_samples:
6 sampled_revisions.40k_2016:
7 quarry_url: "http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true"
8 human_labeled_revisions.raw.5k_2016:
9 labeling_campaign: "https://labels.wmflabs.org/campaigns/huwiki/12/"
10
11autolabeled_samples:
12 trusted_edits: 1000
13 trusted_groups:
14 - sysop
15 - oversight
16 - trusted
17 - bot
18 - rollbacker
19 - checkuser
20 - abusefilter
21 - bureaucrat
22 - editor
23 - templateeditor
24 - interface-editor
25 labeled_samples:
26 autolabeled_revisions.40k_2016: sampled_revisions.40k_2016
27
28balanced_5k_samples:
29 revisions_for_review.5k_2016: autolabeled_revisions.40k_2016
30
31merged_samples:
32 labeled_revisions.40k_2016:
33 - autolabeled_revisions.40k_2016
34 - human_labeled_revisions.5k_2016
35
36extracted_samples:
37 labeled_revisions.w_cache.40k_2016:
38 sample: labeled_revisions.20k_2016
39 features_for: [damaging, goodfaith]
40
41
42models:
43 damaging:
44 observations: labeled_revisions.w_cache.40k_2016
45 label: damaging
46 pop_rate_true: 0.01
47 tune: true
48 cv_train:
49 algorithm: GradientBoosting
50 parameters:
51 max_depth: 7
52 learning_rate: 0.01
53 max_features: log2
54 n_estimators: 700
55 goodfaith:
56 observations: labeled_revisions.w_cache.40k_2016
57 label: goodfaith
58 pop_rate_true: 0.99
59 tune: true
60 cv_train:
61 algorithm: GradientBoosting
62 parameters:
63 max_depth: 7
64 learning_rate: 0.01
65 max_features: log2
66 n_estimators: 700

Example output:

1$ ./utility generate_make --config test-config
2# This file is built automatically using cg.py file and Makefile.j2
3# Any change you make on this file will be lost in the next run.
4
5# Remove target files after command failure.
6.DELETE_ON_ERROR:
7
8models: \
9 huwiki_models
10
11tuning_reports: \
12 huwiki_tuning_reports
13
14touch:
15 touch datasets/*
16 touch models/*
17
18include Makefile.manual
19
20
21############################# Hungarian Wikipedia ################################
22datasets/huwiki.human_labeled_revisions.raw.5k_2016.json:
23 ./utility fetch_labels \
24 https://labels.wmflabs.org/campaigns/huwiki/12/ > $@
25datasets/huwiki.sampled_revisions.40k_2016.json:
26 wget -qO- http://quarry.wmflabs.org/run/79645/output/0/json-lines?download=true > $@
27
28datasets/huwiki.autolabeled_revisions.40k_2016.json: \
29 datasets/huwiki.sampled_revisions.40k_2016.json
30 cat $< | \
31 ./utility autolabel --host=https://hu.wikipedia.org \
32 --trusted-groups=sysop,oversight,trusted,bot,rollbacker,checkuser,abusefilter,bureaucrat,editor,templateeditor,interface-editor \
33 --trusted-edits=1000 \
34 --revert-radius=3 \
35 --revert-window=48 \
36 --verbose > $@
37
38datasets/huwiki.revisions_for_review.5k_2016.json: \
39 datasets/huwiki.autolabeled_revisions.40k_2016.json
40 ( \
41 cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (true|"True") | \
42 shuf -n 2500; \
43 cat datasets/huwiki.autolabeled_revisions.40k_2016.json | grep '"needs_review": (false|"False") | \
44 shuf -n 2500 \
45 ) | shuf > $@
46
47datasets/huwiki.labeled_revisions.40k_2016.json: \
48 datasets/huwiki.autolabeled_revisions.40k_2016.json \
49 datasets/huwiki.human_labeled_revisions.5k_2016.json ./utility merge_labels $^ > $@
50
51datasets/huwiki.labeled_revisions.w_cache.40k_2016.json: \
52 datasets/huwiki.labeled_revisions.20k_2016.json
53 revscoring extract \
54 editquality.feature_lists.huwiki.damaging \
55 editquality.feature_lists.huwiki.goodfaith \
56 --host https://hu.wikipedia.org \
57 --extractors $(max_extractors) \
58 --verbose > $@
59
60tuning_reports/huwiki.damaging.md: \
61 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
62 cat $< | \
63 revscoring tune \
64 config/classifiers.params.yaml \
65 editquality.feature_lists.huwiki.damaging \
66 damaging \
67 roc_auc.labels.true \
68 --label-weight $(damaging_label_weight) \
69 --pop-rate "true=0.01" \
70 --pop-rate "false=0.99" \
71 --center --scale \
72 --cv-timeout 60 \
73 --debug > $@
74
75models/huwiki.damaging.gradient_boosting.model: \
76 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
77 cat $< | \
78 revscoring cv_train
79 damaging \
80 --version=$(damaging_major_minor). \
81 -p 'learning_rate=0.01' \
82 -p 'max_depth=7' \
83 -p 'max_features="log2"' \
84 -p 'n_estimators=700' \
85 --label-weight $(damaging_label_weight) \
86 --pop-rate "true=0.01" \
87 --pop-rate "false=0.99" \
88 --center --scale > $@
89
90 revscoring model_info $@ > model_info/huwiki.damaging.md
91
92tuning_reports/huwiki.goodfaith.md: \
93 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
94 cat $< | \
95 revscoring tune \
96 config/classifiers.params.yaml \
97 editquality.feature_lists.huwiki.goodfaith \
98 goodfaith \
99 roc_auc.labels.true \
100 --label-weight $(goodfaith_label_weight) \
101 --pop-rate "true=0.99" \
102 --pop-rate "false=0.010000000000000009" \
103 --center --scale \
104 --cv-timeout 60 \
105 --debug > $@
106
107models/huwiki.goodfaith.gradient_boosting.model: \
108 datasets/huwiki.labeled_revisions.w_cache.40k_2016.json
109 cat $< | \
110 revscoring cv_train
111 goodfaith \
112 --version=$(goodfaith_major_minor). \
113 -p 'learning_rate=0.01' \
114 -p 'max_depth=7' \
115 -p 'max_features="log2"' \
116 -p 'n_estimators=700' \
117 --label-weight $(goodfaith_label_weight) \
118 --pop-rate "true=0.99" \
119 --pop-rate "false=0.010000000000000009" \
120 --center --scale > $@
121
122 revscoring model_info $@ > model_info/huwiki.goodfaith.md
123
124
125huwiki_models: \
126 models/huwiki.goodfaith.gradient_boosting.model \
127 models/huwiki.damaging.gradient_boosting.model
128
129huwiki_tuning_reports: \
130 tuning_reports/huwiki.goodfaith.md \
131 tuning_reports/huwiki.damaging.md

Halfak claimed this task.Feb 13 2019, 8:37 PM
Halfak moved this task from Active to Review on the Scoring-platform-team (Current) board.

@Ladsgroup, I'd love to have your notes on this one.

This is now finished and tested. See notes in the PR for suggestions on how to review.

awight removed a subscriber: awight.Mar 21 2019, 4:02 PM
Halfak closed this task as Resolved.Jun 18 2019, 1:39 PM