Per Data Access Guidelines we need the approval of Analytics and Security before exporting data from WMF servers/cluster. We have a request for data export. See below for more.
@Groceryheist 's access to the servers is up on 2020-09-30. They need to scp data out of the servers/cluster. They have submitted a request to export the following items. Can you please review and approve or raise questions?
Note: I understand the request is last minute. Let's do what we can to accommodate. If they lose access and you allow me to export, I can do that and send it their way.
- readingtime_nonsensitive contains code (including jupyter notebooks) and plots only (no data) from the reading time analysis I did in 2018.
- ores_project_code contains code from the ores project I worked on with halfak in 2019 and 2020.
- ores_project_bias_analysis contains code and data for an analysis of calibration and balance of the ores models. All the data in this sub-project is public and is entirely derived from wmf.mediawiki_page_history, wmf.mediawiki_history, wikidata parquet files located at /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202 and /usr/joal/wmf/data/wmf/wikidata/item_page_link/20191202 and human labeled edits from wikilabels.
- ores_project_data contains an additional full dataset and a stratified sample. Not all inputs to this data are public. Specifically ORES scores from event.mediawiki_revision_score are joined with data from wmf.mediawiki_page_history and wmf.mediawiki_history. I think releasing revision scores is low risk as they can be obtained from the ORES api or generated by running publicly available source code. In fact, initially in this project I was generating scores from ORES models running locally. This approach was bug-prone which is why I switched to using the event.mediawiki_revision_score table. Publicly available information obtained from SAL and wikimedia open source code repositories was also used to select date ranges and identify times that ORES models were deployed.
- mw_revert_tool_detector is a source-code repository for a tool I began developing to attempt to identify tools used in reverting edits.
- ores_bias_plots contains a handful of plots that I would like to export.