Page MenuHomePhabricator

Create ORES dataset for huwiki edits in the last two years or so
Open, Needs TriagePublic

Description

For certain analysis tasks we'll need an easily accessible dataset of ORES goodfaith and damaging judgements (an SQL dump with revision / damaging / goodfaith, presumably). @Halfak says the ores Python package can be used to produce it.

Event Timeline

Tgr created this task.May 20 2019, 11:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 20 2019, 11:50 AM
Tgr moved this task from Backlog to Huwiki on the User-Tgr board.May 20 2019, 11:53 AM
Hsync7 added a subscriber: Hsync7.May 24 2019, 7:02 AM
Tgr added a comment.Sat, Jun 29, 7:28 PM

@Halfak says the ores Python package can be used to produce it.

It only seems to have utilities to fetch a given revision ID though, not all revision IDs in a time range, so using the action API seems simpler. That would mean paging through

action: query
generator: allrevisions
garvprop:
garvlimit: max
garvnamespace: *
garvstart: 2019-07-01T00:00:00.000Z
garvend: 2017-07-01T00:00:00.000Z
prop: revisions
rvprop: oresscores

with a continuation-aware client.

Halfak added a comment.Mon, Jul 1, 2:14 PM

My usual process is to generate a random sample of revisions using quarry, and then to score those revisions using the ORES API.

That query will work, but I don't believe there will be any oresscores for revisions that are no longer present in the recentchanges table (30 days old).

I would load the "json-lines" output of this query into the ores score_revisions utility as follows.

cat huwiki_2017-2019_revisions.json | \
  ores score_revisions https://ores.wikipedia.org "youremail@domain.com" huwiki damaging goodfaith > \
  huwiki_2017-2019_revisions.scored.json

The ORES score_revisions utility will handle batching and parallelization in a way that maximizes performance while minimizing the load on the ORES service.

Tgr added a comment.Mon, Jul 8, 10:30 PM

On Toolforge that command gets OOM-killed after consuming around 1G of memory, if I read the logs right:

Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021409] ores invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021463] memory: usage 1048576kB, limit 1048576kB, failcnt 14613
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.021465] Memory cgroup stats for /user.slice/user-2355.slice: cache:0KB rss:1042516KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:174236KB active_anon:868280KB inactive_file:0KB active_file:0KB unevictable:0KB
...
Jul  8 13:07:24 tools-sgebastion-07 kernel: [3545133.086951] oom_reaper: reaped process 24615 (ores), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Maybe it's trying to store the whole JSON output in memory instead of dumping it sequentially?

Hmm. It's likely the input dataset that is the problem. Do you think we could work with a random sample of revisions instead of the whole huge data file? Alternatively, I could run the same script on a beefier machine that could handle the memory usage.

Tgr added a comment.Sun, Jul 14, 6:50 PM

I can just split the file into chunks. It seems something worth fixing in the ores utility in the long run, though - there are a couple Pythong libraries like JSONAutoArray or jsonstreams that allow streaming JSON output.