Page MenuHomePhabricator

Review request for data export
Closed, ResolvedPublic

Description

Per Data Access Guidelines we need the approval of Analytics and Security before exporting data from WMF servers/cluster. We have a request for data export. See below for more.

@Groceryheist 's access to the servers is up on 2020-09-30. They need to scp data out of the servers/cluster. They have submitted a request to export the following items. Can you please review and approve or raise questions?

Note: I understand the request is last minute. Let's do what we can to accommodate. If they lose access and you allow me to export, I can do that and send it their way.

  1. readingtime_nonsensitive contains code (including jupyter notebooks) and plots only (no data) from the reading time analysis I did in 2018.
  2. ores_project_code contains code from the ores project I worked on with halfak in 2019 and 2020.
  3. ores_project_bias_analysis contains code and data for an analysis of calibration and balance of the ores models. All the data in this sub-project is public and is entirely derived from wmf.mediawiki_page_history, wmf.mediawiki_history, wikidata parquet files located at /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202 and /usr/joal/wmf/data/wmf/wikidata/item_page_link/20191202 and human labeled edits from wikilabels.
  4. ores_project_data contains an additional full dataset and a stratified sample. Not all inputs to this data are public. Specifically ORES scores from event.mediawiki_revision_score are joined with data from wmf.mediawiki_page_history and wmf.mediawiki_history. I think releasing revision scores is low risk as they can be obtained from the ORES api or generated by running publicly available source code. In fact, initially in this project I was generating scores from ORES models running locally. This approach was bug-prone which is why I switched to using the event.mediawiki_revision_score table. Publicly available information obtained from SAL and wikimedia open source code repositories was also used to select date ranges and identify times that ORES models were deployed.
  5. mw_revert_tool_detector is a source-code repository for a tool I began developing to attempt to identify tools used in reverting edits.
  6. ores_bias_plots contains a handful of plots that I would like to export.

Event Timeline

leila created this task.Sep 30 2020, 11:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2020, 11:55 PM

No objection from me. The ORES data referred to in this ticket is not sensitive or PII.

Nuria added a subscriber: Nuria.Oct 1 2020, 6:16 PM

I think these should fine to export, agreed.

Great! Can I get some help transferring this data?

razzi assigned this task to Ottomata.Oct 15 2020, 4:00 PM
razzi moved this task from Incoming to Operational Excellence on the Analytics board.

Hiya @Groceryheist are you still on IRC? Ping me sometime and I'll help you get the data.

@MoritzMuehlenhoff, would it be ok to temporarily re-add @Groceryheist's access while he copies out some data? If not, I'll somehow get it to him manually.

@MoritzMuehlenhoff, would it be ok to temporarily re-add @Groceryheist's access while he copies out some data?

There's no current active MOU, so that would be quite some overhead (renewing with Legal etc).

If not, I'll somehow get it to him manually.

That sounds better.

Ottomata closed this task as Resolved.Oct 19 2020, 10:03 PM

I temporarily copied the nathante_wmf_export.tar.gz file that @Groceryheist prepared before he left to analytics.wikmiedia.org/published/datasets/one-off and he downloaded it from there. Closing this task.