Page MenuHomePhabricator

Find efficient ORES articlequality data source
Closed, ResolvedPublic

Description

Ideally, we can load a dumpfile with a snapshot articlequality score for each page. In the worst case it would be possible to request a score for every page.

The goal is that we can compare ORES scores from now against 1-2 years later, and see if there are correlations (either positive or negative) between score shifts and indicators related to our reference feature work.

Perhaps https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/ORES/Historified_scores ? Check the data source to understand how long the data is retained--do we need to copy a subset of the data in order to preserve it? Is this allowed according to our data retention policies? Are we confident that a comparable data source will still exist in 1-2 years? Can we be alerted before the data source becomes deprecated?

Results:

  • The Hive data source doesn't exist / never existed.
  • Articlequality scores are stored in mysql under the ores_classification table and could be queried or exported in bulk.
  • These rows are written by the ORES extension. The data source is definitely "endangered" however may be safe over our 1-2 year time frame because RecentChanges filtering still depends on this data.
  • The only way to stay ahead of future changes is to communicate with the WMF machine learning team.
  • There should be no issues with data retention. The table includes data back to inception in 2018, and the contents of each row are purely numeric and internally-generated.

Event Timeline

Sadly, the page about ORES in Hive was created by me many years ago, and as far as I can tell has always been wrong. There is no such data store at the moment.

awight claimed this task.
awight updated the task description. (Show Details)
awight moved this task from Sprint Backlog to Done on the WMDE-TechWish-Sprint-2023-03-14 board.