Page MenuHomePhabricator

Generate dump of scored-revisions from 2018-2020 for English Wikipedia
Closed, ResolvedPublic

Description

I'm working on a project with @Suriname0 to help people audit ORES. We're looking for a dataset of revisions saved to English Wikipedia between 2018 and 2020 with their associated ORES predictions. I believe that there is a table in HDFS containing this data.

Is there a place where we can already download the contents of this table? Or is there a way we could access the table from Toolforge? If not, we'd like to have someone generate a dump of the table between Jan 2018 and Jan 2021 so that we can use it for research/audits.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak updated the task description. (Show Details)

This looks super interesting, when the data is out there I'd love to have it posted for a potential internship or GSoC project.

Thanks to @elukey for chatting with me in the IRC and prompting me to provide additional info. @Halfak may be able to provide more targeted info about the HDFS table, but I can tell you what data we need.

Currently, this data release is blocking the ongoing development of the Toolforge tool ORES-Inspect (read more on the research page for this tool). We have an existing dataset of the 55.5 million enwiki rev_ids between 2018-01-01 and 2019-01-01 in all namespaces (with derived columns added on top of what exists in the replica tables). What we need are the ORES edit quality predictions for these 55.5 million rev_ids. While we could retrieve these via the ORES API (and we already have retrieved 32 million of them), the ORES model has changed between 2018 and now: what we need are the historical predictions from when the edit actually happened. As the purpose of this tool is to inspect historical ORES predictions and the community response to the corresponding edits, we need these historical predictions to fulfill the goal of the interface. HDFS contains this data: the cached predictions from back in 2018 when they were originally generated by whatever version of the ORES model was running when the revision was made.

So, the basic data we need has just three columns/keys: rev_id, prediction_timestamp, ores_damaging_prediction. While we're currently blocked on most backend development until we can get access to this data, a few weeks is fine; otherwise, we'll revise our expectations for the tool based on the data that is available. I'll note that while we only really need the damaging model predictions, the good faith model predictions would be useful as well if those exist in the same table. Similarly, data for 2019 and 2020 (not just 2018) would also be useful. If there are any questions, I'm happy to help.

Thanks @JAllemandou! I don't seem to have permission to access the newly-created file dumps. Please let me know if I missed some documentation for authing against analytics.wikimedia.org! I have access to other files in the one-off folder, so this may only be affecting these ORES dump files.

Arf - My bad - Let me try to fix that :)

Hi @JAllemandou, thanks so much for doing this. Inspecting the 2021 data, format looks great (and coverage looks good too, with 17M revs in Jan-Mar 2021 and only ~4K multi-requested revs, which sounds right to me). Two concerns:

  1. I can't actually download the 2019 and 2020 files. I've tried on a few different machines/networks over the last day, but analytics.wikimedia.org terminates the connection before the full file can be retrieved. I'm not sure if this is a problem with the connection settings or with the files themselves. But, I left wget going for a bit and it hit 20 retries before terminating both last night and this morning. Trying with wget -c, but it doesn't seem to be making much progress. Note that I'm /not/ attempting simultaneous downloads from any other *.wikimedia.org site, so it should just be the one open connection. Any thoughts?
  2. I am actually primarily seeking data from 2018. If the 2018 data is easy to generate, that would be great. But, the 2019 and 2020 data is much appreciated.

Hi @Suriname0
About 1. I prefer to defer to one of our SREs (@Ottomata, @razzi - Any idea?)
On 2. we have only data starting December 4th for year 2018, this is why I haven't generated it.

I've tried on a few different machines/networks over the last day, but analytics.wikimedia.org terminates the connection before the full file can be retrieved.

Is it 15 minutes? There is a 15 minute timeout on http connections. However, I'd expect a download to be able to continue with a multipart byte range request. I'd expect wget -c to work. If you add --show-progress, does it print anything useful about starting from where it left off?

@JAllemandou thanks for clarifying! I will go without the 2018 data :)

@Ottomata It seems to be flatly stuck on byte 5363371137 of ores_scores_damaging-goodfaith_enwiki_2019_01-12.json.gz.

Left it running overnight with wget -c --tries=inf --no-check-certificate https://analytics.wikimedia.org/published/datasets/one-off/ores/scores_dumps/damaging_goodfaith_enwiki/ores_scores_damaging-goodfaith_enwiki_2019_01-12.json.gz. No dice. If this is some super-weird problem that only affects me, I'll try to get a colleague to download these, but I've tried from two different servers and my local browser, all failing.

I was able to get the 2019 data downloaded on a Toolforge server (connecting via dev.toolforge.org), which I figured would make things skip along better, but the toolforge user has still ground to a halt on the 2020 data.

Milimetric claimed this task.
Milimetric triaged this task as High priority.
Milimetric added a project: Analytics-Kanban.
Milimetric moved this task from Next Up to Done on the Analytics-Kanban board.