For certain analysis tasks we'll need an easily accessible dataset of ORES goodfaith and damaging judgements (an SQL dump with revision / damaging / goodfaith, presumably). @Halfak says the ores Python package can be used to produce it.
|Open||None||T223892 [EPIC] Support Hungarian Wikipedia editor retention project|
|Open||None||T209224 Analyze effect of huwiki FlaggedRevs configuration change on problematic edits and new user retention|
|Open||Tgr||T223900 Create ORES dataset for huwiki edits in the last two years or so|
|Open||ACraze||T228078 Retrain damaging/goodfaith models for huwiki|
|Resolved||Halfak||T223882 Re-label huwiki damaging and badfaith edits|
|Open||None||T223899 Information about finished campaigns should be accessible in Wikilabels|
@Halfak says the ores Python package can be used to produce it.
It only seems to have utilities to fetch a given revision ID though, not all revision IDs in a time range, so using the action API seems simpler. That would mean paging through
action: query generator: allrevisions garvprop: garvlimit: max garvnamespace: * garvstart: 2019-07-01T00:00:00.000Z garvend: 2017-07-01T00:00:00.000Z prop: revisions rvprop: oresscores
with a continuation-aware client.
My usual process is to generate a random sample of revisions using quarry, and then to score those revisions using the ORES API.
That query will work, but I don't believe there will be any oresscores for revisions that are no longer present in the recentchanges table (30 days old).
I would load the "json-lines" output of this query into the ores score_revisions utility as follows.
cat huwiki_2017-2019_revisions.json | \ ores score_revisions https://ores.wikipedia.org "firstname.lastname@example.org" huwiki damaging goodfaith > \ huwiki_2017-2019_revisions.scored.json
The ORES score_revisions utility will handle batching and parallelization in a way that maximizes performance while minimizing the load on the ORES service.
On Toolforge that command gets OOM-killed after consuming around 1G of memory, if I read the logs right:
Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021409] ores invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0 ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021463] memory: usage 1048576kB, limit 1048576kB, failcnt 14613 ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.021465] Memory cgroup stats for /user.slice/user-2355.slice: cache:0KB rss:1042516KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:174236KB active_anon:868280KB inactive_file:0KB active_file:0KB unevictable:0KB ... Jul 8 13:07:24 tools-sgebastion-07 kernel: [3545133.086951] oom_reaper: reaped process 24615 (ores), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Maybe it's trying to store the whole JSON output in memory instead of dumping it sequentially?
Hmm. It's likely the input dataset that is the problem. Do you think we could work with a random sample of revisions instead of the whole huge data file? Alternatively, I could run the same script on a beefier machine that could handle the memory usage.
I can just split the file into chunks. It seems something worth fixing in the ores utility in the long run, though - there are a couple Pythong libraries like JSONAutoArray or jsonstreams that allow streaming JSON output.
SQL dump at https://people.wikimedia.org/~tgr/huwiki_2017-2019_revisions_ores.sql.7z (40M, 300M unzipped).
Probably worth redoing once T228078: Retrain damaging/goodfaith models for huwiki is done.