Thanks @Halfak for the invite to this task! I am interested in understanding how long standing editor practices can be best encoded in the editing interface that can help editors do better and allow us to build robust models from structured editing data that can automatically flag outstanding issues for editors to fix. One of the problems with building effective automated content flaw detection to help Wikipedians is the lack of precise information around historical edits (like what exactly was improved in this edit?)
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Feb 24 2022
May 7 2020
I'm curious to know what kind of features is the Wikidata topic models API using? Is it the same features as the original topic model developed for English Wikipedia or something different?
Mar 1 2020
Please test run your solutions locally. If it runs and gives expected results, submit a PR and it can be reviewed, if it doesn't seek help regarding the error. Screenshot of a code doesn't give much context to comment on.
Feb 28 2020
In T246438#5927946, @Chtnnh wrote:
Jul 9 2018
ROC_AUC:
roc_auc (micro=0.943, macro=0.948):
------------------------------------------- -----
Geography.Maps 0.971
Geography.Europe 0.929
Culture.Media 0.951
STEM.Physics 0.975
Geography.Oceania 0.966
STEM.Meteorology 0.987
Culture.Internet culture 0.969
History_And_Society.Military and warfare 0.968
Culture.Performing arts 0.982
STEM.Engineering 0.954
Culture.Language and literature 0.949
STEM.Space 0.987
STEM.Geosciences 0.972
STEM.Technology 0.942
Geography.Landforms 0.987
STEM.Biology 0.956
Culture.Broadcasting 0.973
Culture.Sports 0.977
STEM.Chemistry 0.98
Assistance.Maintenance 0.838
Culture.Visual arts 0.969
Culture.Plastic arts 0.966
History_And_Society.Transportation 0.977
STEM.Mathematics 0.98
Culture.Entertainment 0.971
STEM.Medicine 0.974
STEM.Information science 0.969
STEM.Meteorology 0.987
Culture.Internet culture 0.969
History_And_Society.Military and warfare 0.968
Culture.Performing arts 0.982
STEM.Engineering 0.954
Culture.Language and literature 0.949
STEM.Space 0.987
STEM.Geosciences 0.972
STEM.Technology 0.942
Geography.Landforms 0.987
STEM.Biology 0.956
Culture.Broadcasting 0.973
Culture.Sports 0.977
STEM.Chemistry 0.98
Assistance.Maintenance 0.838
Culture.Visual arts 0.969
Culture.Plastic arts 0.966
History_And_Society.Transportation 0.977
STEM.Mathematics 0.98
Culture.Entertainment 0.971
STEM.Medicine 0.974
STEM.Information science 0.969
STEM.Time 0.973
History_And_Society.Education 0.969
History_And_Society.Politics and government 0.941
Culture.Food and drink 0.975
Assistance.Contents systems 0.95
History_And_Society.Business and economics 0.948
Assistance.Article improvement and grading 0.684
Geography.Countries 0.893
History_And_Society.History and society 0.868
Culture.Philosophy and religion 0.936
Assistance.Files 0.773
STEM.Science 0.935
Geography.Cities 0.969
Culture.Crafts and hobbies 0.965
Culture.Arts 0.985
Geography.Bodies of water 0.987Jul 1 2018
May 5 2018
counts (n=84480): [598/1636]
label n TP FP FN TN
--------------------------------------------- ----- --- ----- ---- ---- -----
'STEM.Mathematics' 1454 --> 938 516 98 82928
'Assistance.Files' 350 --> 28 322 111 84019
'Culture.Food and drink' 2264 --> 1559 705 156 82060
'STEM.Biology' 3134 --> 1772 1362 266 81080
'History_And_Society.Business and economics' 6075 --> 2993 3082 834 77571
'Assistance.Contents systems' 1953 --> 686 1267 142 82385
'Culture.Language and literature' 19588 --> 14199 5389 2390 62502
'Culture.Media' 2039 --> 596 1443 261 82180
'Culture.Philosophy and religion' 3840 --> 1693 2147 451 80189
'STEM.Physics' 2376 --> 1259 1117 360 81744
'STEM.Chemistry' 2083 --> 1287 796 265 82132
'History_And_Society.Military and warfare' 3921 --> 2453 1468 392 80167
'Geography.Europe' 15349 --> 8930 6419 2580 66551
'History_And_Society.Education' 2633 --> 1603 1030 252 81595
'Geography.Landforms' 2148 --> 1710 438 139 82193
'Assistance.Article improvement and grading' 67 --> 16 51 3082 81331
'Culture.Plastic arts' 3717 --> 2116 1601 404 80359
'STEM.Space' 2117 --> 1731 386 102 82261
'Geography.Maps' 2421 --> 1370 1051 69 81990
'Culture.Performing arts' 4180 --> 3313 867 389 79911
'Geography.Cities' 791 --> 493 298 111 83578
'Culture.Broadcasting' 2807 --> 1586 1221 434 81239
'STEM.Engineering' 2133 --> 768 1365 267 82080
'Assistance.Maintenance' 5028 --> 1112 3916 244 79208
'History_And_Society.History and society' 7010 --> 1371 5639 520 76950
'STEM.Time' 2216 --> 1520 696 102 82162
'Culture.Sports' 4844 --> 3970 874 369 79267
'Culture.Crafts and hobbies' 1988 --> 1138 850 64 82428
'STEM.Information science' 2037 --> 1148 889 117 82326
'History_And_Society.Politics and government' 4047 --> 1572 2475 508 79925
'History_And_Society.Transportation' 3680 --> 2508 1172 341 80459
'Culture.Arts' 1999 --> 1488 511 101 82380
'Geography.Countries' 24068 --> 14352 9716 4136 56276
'Geography.Bodies of water' 2232 --> 1732 500 154 82094
'STEM.Meteorology' 1753 --> 1360 393 72 82655
'Geography.Oceania' 4025 --> 2479 1546 213 80242
'STEM.Medicine' 1951 --> 1116 835 266 82263
'Culture.Visual arts' 4563 --> 2594 1969 544 79373
'STEM.Science' 2133 --> 545 1588 160 82187
'Culture.Internet culture' 1839 --> 922 917 222 82419
'STEM.Technology' 3825 --> 1330 2495 597 80058
'Culture.Entertainment' 5529 --> 3597 1932 577 78374
'STEM.Geosciences' 1987 --> 1183 804 125 82368
'STEM.Medicine' 1951 --> 1116 835 266 82263
'Culture.Visual arts' 4563 --> 2594 1969 544 79373
'STEM.Science' 2133 --> 545 1588 160 82187
'Culture.Internet culture' 1839 --> 922 917 222 82419
'STEM.Technology' 3825 --> 1330 2495 597 80058
'Culture.Entertainment' 5529 --> 3597 1932 577 78374
'STEM.Geosciences' 1987 --> 1183 804 125 82368pr_auc (micro=0.761, macro=0.724): [22/1636]
------------------------------------------- -----
Culture.Arts 0.911
Culture.Internet culture 0.685
Culture.Language and literature 0.871
Culture.Performing arts 0.912
History_And_Society.Transportation 0.858
Assistance.Files 0.042
STEM.Science 0.498
STEM.Medicine 0.743
Culture.Crafts and hobbies 0.813
History_And_Society.Military and warfare 0.812
STEM.Technology 0.56
STEM.Meteorology 0.919
Assistance.Maintenance 0.458
Culture.Philosophy and religion 0.633
STEM.Engineering 0.578
Culture.Entertainment 0.84
History_And_Society.Business and economics 0.7
Geography.Landforms 0.927
STEM.Biology 0.748
Assistance.Contents systems 0.611
Geography.Maps 0.835
STEM.Geosciences 0.8
History_And_Society.Education 0.777
Geography.Bodies of water 0.914
STEM.Mathematics 0.845
History_And_Society.Politics and government 0.615
Geography.Europe 0.763
STEM.Physics 0.717
Assistance.Article improvement and grading 0.004
STEM.Space 0.938
History_And_Society.History and society 0.486
Geography.Oceania 0.838
Geography.Countries 0.779
STEM.Time 0.86
STEM.Chemistry 0.779
Geography.Cities 0.73
Culture.Food and drink 0.856
Culture.Broadcasting 0.735
STEM.Information science 0.79
Culture.Sports 0.914
Culture.Media 0.497
Culture.Visual arts 0.776
Culture.Plastic arts 0.774
------------------------------------------- -----May 4 2018
Looks like an issue with [[0]] being returned on an empty string '' by wordvectors instead of the usual null vector of dimensions (300,)
May 3 2018
Apr 17 2018
Apr 2 2018
Mar 22 2018
Mar 21 2018
In T190288#4068904, @Halfak wrote:I wonder if you could figure out where the hangup is happening by adding "--debug" to the tune utility call.
Mar 20 2018
Yeah we'll need scipy >= 0.18.1 but i see for revscoring scipy is already set as - scipy >= 0.13.3, < 1.0.999
Mar 16 2018
Mar 15 2018
The recommended order for review should be - 18, 20, 19
Final resolution done by using a wrapper function - https://github.com/wiki-ai/revscoring/pull/394
Mar 13 2018
In T189364#4047619, @Halfak wrote:I made a demo of this problem to try to see if I could reproduce it in isolation. See https://github.com/halfak/demo_shared_memory
TL;DR: it didn't work. I get the exact same output for both strategies!
@Ragesoss there's ongoing work around topic modeling for English Wikipedia using WikiProject topics as bases. If Education Program Dashboard has some similar categorization of articles around pre-defined topics, a similar model can be built to predict topics as well as recommend them. Let me know if you wanna talk more about it.
In T188892#4023246, @Paarmita wrote:@Jayprakash12345 Could I take up this?
In T189364#4046883, @awight wrote:@Sumit please link to the code changes you're making that seem to improve memory sharing.
Refer to the gist in the first comment for the code changes that make it multiprocessing friendly.
Mar 10 2018
Test code for benchmarking using word2vec as an external module contained in english_vectors:
from multiprocessing import Pool, cpu_count import functools from revscoring.dependencies import solve from revscoring.datasources.meta import vectorizers from revscoring.features.meta import aggregators from revscoring.languages import english from revscoring.languages.english_vectors import google_news_kvs from revscoring.datasources import revision_oriented
Test code for benchmarking vectorizers with a global keyed_vector in the vectorizers file( https://gist.github.com/codez266/bde0d2384ef1cda0e105b8f59d25524a#file-vectors_only_once-py-L21 ):
Mar 8 2018
with wordvectors blockers now cleared, building drafttopic model on ores-stat-01
Feb 27 2018
In T187217#4001947, @awight wrote:Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/
@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?
Feb 13 2018
Feb 5 2018
Jan 29 2018
Jan 22 2018
A common use case of fetch_text is augmenting the dataset with X info from Y api. This will address:
- fetching edits - currently supported by revscoring
- fetching text - currently required by wikiclass, drafttopic and draftquality for getting article text
- fetch_item_info - currently required by wikiclass for fetching item info from Wikidata.
Jan 17 2018
The binary *was* on ores-misc-01 which is now nuked. I'll upload it to ores-staging-01 from my system again from where it can be put somewhere public.
Jan 16 2018
I've taken backup of the tuning reports, and the GradientBoosting and RandomForest models.
Dec 22 2017
Dec 20 2017
Dec 11 2017
Nov 28 2017
We now have a dataset at figshare - https://doi.org/10.6084/m9.figshare.5640526.v1 \o/
In T179311#3793141, @Halfak wrote:@Sumit, please move to the "done" column before closing tasks. We need this in order to consistently report what has been "done".
In T179311#3723831, @Halfak wrote:Looks like we don't include the top level category names yet. @Sumit said he'd like to do that in a separate PR.
Nov 22 2017
Nov 21 2017
Nov 6 2017
Nov 4 2017
Nov 3 2017
Could free up 2.2G more...
Removed 800MB of my stuff which included cached models and datasets.
