Ideally, we can load a dumpfile with a snapshot articlequality score for each page. In the worst case it would be possible to request a score for every page.
The goal is that we can compare ORES scores from now against 1-2 years later, and see if there are correlations (either positive or negative) between score shifts and indicators related to our reference feature work.
Perhaps https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/ORES/Historified_scores ? Check the data source to understand how long the data is retained--do we need to copy a subset of the data in order to preserve it? Is this allowed according to our data retention policies? Are we confident that a comparable data source will still exist in 1-2 years? Can we be alerted before the data source becomes deprecated?
Results:
- The Hive data source doesn't exist / never existed.
- Articlequality scores are stored in mysql under the ores_classification table and could be queried or exported in bulk.
- These rows are written by the ORES extension. The data source is definitely "endangered" however may be safe over our 1-2 year time frame because RecentChanges filtering still depends on this data.
- The only way to stay ahead of future changes is to communicate with the WMF machine learning team.
- There should be no issues with data retention. The table includes data back to inception in 2018, and the contents of each row are purely numeric and internally-generated.