Page MenuHomePhabricator

Create ORES dataset for huwiki edits in the last two years or so
Open, Needs TriagePublic

Description

For certain analysis tasks we'll need an easily accessible dataset of ORES goodfaith and damaging judgements (an SQL dump with revision / damaging / goodfaith, presumably). @Halfak says the ores Python package can be used to produce it.

Event Timeline

@Halfak says the ores Python package can be used to produce it.

It only seems to have utilities to fetch a given revision ID though, not all revision IDs in a time range, so using the action API seems simpler. That would mean paging through

action: query
generator: allrevisions
garvprop:
garvlimit: max
garvnamespace: *
garvstart: 2019-07-01T00:00:00.000Z
garvend: 2017-07-01T00:00:00.000Z
prop: revisions
rvprop: oresscores

with a continuation-aware client.

My usual process is to generate a random sample of revisions using quarry, and then to score those revisions using the ORES API.

That query will work, but I don't believe there will be any oresscores for revisions that are no longer present in the recentchanges table (30 days old).

I would load the "json-lines" output of this query into the ores score_revisions utility as follows.

cat huwiki_2017-2019_revisions.json | \
  ores score_revisions https://ores.wikipedia.org "youremail@domain.com" huwiki damaging goodfaith > \
  huwiki_2017-2019_revisions.scored.json

The ORES score_revisions utility will handle batching and parallelization in a way that maximizes performance while minimizing the load on the ORES service.

On Toolforge that command gets OOM-killed after consuming around 1G of memory, if I read the logs right:

Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021409] ores invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021463] memory: usage 1048576kB, limit 1048576kB, failcnt 14613
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021465] Memory cgroup stats for /user.slice/user-2355.slice: cache:0KB rss:1042516KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:174236KB active_anon:868280KB inactive_file:0KB active_file:0KB unevictable:0KB
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.086951] oom_reaper: reaped process 24615 (ores), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Maybe it's trying to store the whole JSON output in memory instead of dumping it sequentially?

Hmm. It's likely the input dataset that is the problem. Do you think we could work with a random sample of revisions instead of the whole huge data file? Alternatively, I could run the same script on a beefier machine that could handle the memory usage.

I can just split the file into chunks. It seems something worth fixing in the ores utility in the long run, though - there are a couple Pythong libraries like JSONAutoArray or jsonstreams that allow streaming JSON output.

@Halfak is there a way to recover the threshold settings for the old (pre-T228078) model? I'd like to check how much the number of likely-damaging etc. anonymous edits have changed due to the update.

Oh good question. You'd need historical scores for it too, right? I wonder if @Groceryheist already has this data as he has been generating historical ORES scores and scanning for historical threshold settings.

You'd need historical scores for it too, right?

In general that would be a nice dataset to make accessible. In this case I already have a dump of the old data for 2017-18 though, I just didn't realize at the time that the thresholds are not directly included in the MediaWiki configuration and so I didn't save those.

Got it. I'll let @Groceryheist respond because he's probably already done this digging, but if he doesn't have it, I can help out.

Hi Tgr. I'm working on this! I should be able to send the threshholds over in the next day or so. This is very much a research project so buyer beware! You can checkout my code at https://github.com/groceryheist/ores_bias_project/blob/master/ores_archaeologist.py

@Tgr, It sounds like you have the old scores right?

Hey @Tgr. Here's my best guess at what the historical thresholds were and when they changed. Missing values indicate that no threshold was set for a given class of edit. These numbers are the result of a fairly complicated process based on parsing old configuration files and loading old versions of the models and I'm still troubleshooting some aspects of it. So I will really appreciate it if you can let me know whether this looks right to you. Thanks!

Thanks @Groceryheist!

So the CSV has

2019-09-03    damaging_likelygood        0.061
2019-09-03    goodfaith_verylikelybad    0.131
2019-09-03    damaging_maybebad          0.073
2019-09-03    goodfaith_likelybad        0.131

2018-07-17    damaging_likelygood        0.048
2018-07-17    goodfaith_verylikelybad    0.007
2018-07-17    damaging_maybebad          0.097
2018-07-17    goodfaith_likelybad        0.007

2018-05-09    damaging_likelygood        0.048
2018-05-09    goodfaith_verylikelybad    0.007
2018-05-09    damaging_maybebad          0.097
2018-05-09    goodfaith_likelybad        0.007

(note the last two blocks are identical, I presume the tool is mistakenly detecting a change where there wasn't any)

Per the summary of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/536732 the current settings are

Damaging: maybebad 0.073, likelybad 0.767, verylikelybad 0.851, likelygood 0.05
Goodfaith: maybebad 0.924, likelybad 0.247, verylikelybad 0.169, likelygood 0.949

which seems completely different (also twice as many values). Spot-checking the first, damaging_likelygood is maximum recall @ precision >= 0.999, and https://ores.wikimedia.org/v3/scores/huwiki?models=damaging&model_info=statistics.thresholds.false."maximum%20recall%20@%20precision%20>=%200.999" gives threshold: 0.95, which has to be flipped since this is a false (1-x) model, so the raw threshold is 0.05 like the commit summary says... am I doing something wrong?