For certain analysis tasks we'll need an easily accessible dataset of ORES goodfaith and damaging judgements (an SQL dump with revision / damaging / goodfaith, presumably). @Halfak says the ores Python package can be used to produce it.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Bencemac | T210224 Revert FlaggedRevs changes on the Hungarian Wikipedia | |||
Open | None | T223892 [EPIC] Support Hungarian Wikipedia editor retention project | |||
Resolved | Tgr | T209224 Analyze effect of huwiki FlaggedRevs configuration change on problematic edits and new user retention | |||
Resolved | • ACraze | T228078 Retrain damaging/goodfaith models for huwiki | |||
Resolved | Halfak | T223882 Re-label huwiki damaging and badfaith edits | |||
Open | None | T223899 Information about finished campaigns should be accessible in Wikilabels | |||
Open | None | T325925 Analyze effect of huwiki FlaggedRevs configuration change on problematic edits and new user retention (round2) | |||
Open | None | T223900 Create ORES dataset for huwiki edits in the last two years or so |
Event Timeline
@Halfak says the ores Python package can be used to produce it.
It only seems to have utilities to fetch a given revision ID though, not all revision IDs in a time range, so using the action API seems simpler. That would mean paging through
action: query generator: allrevisions garvprop: garvlimit: max garvnamespace: * garvstart: 2019-07-01T00:00:00.000Z garvend: 2017-07-01T00:00:00.000Z prop: revisions rvprop: oresscores
with a continuation-aware client.
My usual process is to generate a random sample of revisions using quarry, and then to score those revisions using the ORES API.
That query will work, but I don't believe there will be any oresscores for revisions that are no longer present in the recentchanges table (30 days old).
I would load the "json-lines" output of this query into the ores score_revisions utility as follows.
cat huwiki_2017-2019_revisions.json | \ ores score_revisions https://ores.wikipedia.org "youremail@domain.com" huwiki damaging goodfaith > \ huwiki_2017-2019_revisions.scored.json
The ORES score_revisions utility will handle batching and parallelization in a way that maximizes performance while minimizing the load on the ORES service.
On Toolforge that command gets OOM-killed after consuming around 1G of memory, if I read the logs right:
Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021409] ores invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0 ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021463] memory: usage 1048576kB, limit 1048576kB, failcnt 14613 ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021465] Memory cgroup stats for /user.slice/user-2355.slice: cache:0KB rss:1042516KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:174236KB active_anon:868280KB inactive_file:0KB active_file:0KB unevictable:0KB ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.086951] oom_reaper: reaped process 24615 (ores), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Maybe it's trying to store the whole JSON output in memory instead of dumping it sequentially?
Hmm. It's likely the input dataset that is the problem. Do you think we could work with a random sample of revisions instead of the whole huge data file? Alternatively, I could run the same script on a beefier machine that could handle the memory usage.
I can just split the file into chunks. It seems something worth fixing in the ores utility in the long run, though - there are a couple Pythong libraries like JSONAutoArray or jsonstreams that allow streaming JSON output.
SQL dump at https://people.wikimedia.org/~tgr/huwiki_2017-2019_revisions_ores.sql.7z (40M, 300M unzipped).
Probably worth redoing once T228078: Retrain damaging/goodfaith models for huwiki is done.
SQL dump with the new model at https://people.wikimedia.org/~tgr/huwiki_2017-2019_revisions_ores_v2.sql.7z
@Halfak is there a way to recover the threshold settings for the old (pre-T228078) model? I'd like to check how much the number of likely-damaging etc. anonymous edits have changed due to the update.
Oh good question. You'd need historical scores for it too, right? I wonder if @Groceryheist already has this data as he has been generating historical ORES scores and scanning for historical threshold settings.
In general that would be a nice dataset to make accessible. In this case I already have a dump of the old data for 2017-18 though, I just didn't realize at the time that the thresholds are not directly included in the MediaWiki configuration and so I didn't save those.
Got it. I'll let @Groceryheist respond because he's probably already done this digging, but if he doesn't have it, I can help out.
Hi Tgr. I'm working on this! I should be able to send the threshholds over in the next day or so. This is very much a research project so buyer beware! You can checkout my code at https://github.com/groceryheist/ores_bias_project/blob/master/ores_archaeologist.py
Hey @Tgr. Here's my best guess at what the historical thresholds were and when they changed. Missing values indicate that no threshold was set for a given class of edit. These numbers are the result of a fairly complicated process based on parsing old configuration files and loading old versions of the models and I'm still troubleshooting some aspects of it. So I will really appreciate it if you can let me know whether this looks right to you. Thanks!
Thanks @Groceryheist!
So the CSV has
2019-09-03 damaging_likelygood 0.061 2019-09-03 goodfaith_verylikelybad 0.131 2019-09-03 damaging_maybebad 0.073 2019-09-03 goodfaith_likelybad 0.131 2018-07-17 damaging_likelygood 0.048 2018-07-17 goodfaith_verylikelybad 0.007 2018-07-17 damaging_maybebad 0.097 2018-07-17 goodfaith_likelybad 0.007 2018-05-09 damaging_likelygood 0.048 2018-05-09 goodfaith_verylikelybad 0.007 2018-05-09 damaging_maybebad 0.097 2018-05-09 goodfaith_likelybad 0.007
(note the last two blocks are identical, I presume the tool is mistakenly detecting a change where there wasn't any)
Per the summary of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/536732 the current settings are
Damaging: maybebad 0.073, likelybad 0.767, verylikelybad 0.851, likelygood 0.05 Goodfaith: maybebad 0.924, likelybad 0.247, verylikelybad 0.169, likelygood 0.949
which seems completely different (also twice as many values). Spot-checking the first, damaging_likelygood is maximum recall @ precision >= 0.999, and https://ores.wikimedia.org/v3/scores/huwiki?models=damaging&model_info=statistics.thresholds.false."maximum%20recall%20@%20precision%20>=%200.999" gives threshold: 0.95, which has to be flipped since this is a false (1-x) model, so the raw threshold is 0.05 like the commit summary says... am I doing something wrong?
Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)