We currently have several levels of caching for ORES scores, but they're all short-term caches and only contain an incomplete set of scores. Researchers may often need the entire set of scores for a given model, which we could store in Hadoop for convenient retrieval.
|Open||None||T209611 [Epic] Make ORES scores for wikidata available as a dump|
|Resolved||None||T209731 Choose HDFS paths and partitioning for ORES scores|
|Open||None||T209732 Wire ORES recent_score events into Hadoop|
|Resolved||Ottomata||T197000 Modify revision-score schema so that model probabilities won't conflict|
|Resolved||Halfak||T197828 Fix "score_schema" -- invalid JSON Schema|
|Open||None||T214545 Emit synthetic mediawiki.revision-score events for both datacenters|
|Open||None||T209734 Include feature values in ORES changeprop stream|
|Open||None||T209737 Backfill ORES Hadoop scores with historical data|
|Open||None||T209739 Produce dump files for ORES scores|
|Open||None||T209742 Purge ORES scores from Hadoop and begin backfill when model version changes|
|Open||None||T211069 Decide whether we will include raw features|
|Open||None||T214723 Modify revscoring extract utility to include root datasources|
|Open||None||T212264 Precache should include bot edits to wikidata|
Notes from IRC:
- Backfill old values by hitting production with a maintenance job.
- Pipe new scores into ORES by reading from the changeprop stream.
- May want to implement a new query parameter to prevent caching in Redis.
We have an open request (in email) from the Human-Centered Computing team at Freie Universität Berlin, to retrieve all itemquality scores for wikidata. We should gather other use cases in order to prioritize the order of backfilling.
When data gets stored in Hadoop, it is easy to supply pageviews-like dumps files.
About how to compute the scores, hitting the ORES API is probably the shortest solution in term of schedule, next steps might be to try to reproduce ORES scoring in PySpark. I already started some POC of that, but didn't move to the core of the thing: feature-extraction from mediawiki-text instead of mediawiki-api.
Hey heyyy! We deployed changes for T197000 today. I also re-enabled Hive refinement of this data, so we now have an event.mediawiki_revision_score table with this schema:
database string meta struct<domain:string,dt:string,id:string,request_id:string,schema_uri:string,topic:string,uri:string> page_id bigint page_namespace bigint page_title string rev_id bigint rev_parent_id bigint rev_timestamp string scores array<struct<model_name:string,model_version:string,prediction:array<string>,probability:array<struct<name:string,value:double>>>> datacenter string year bigint month bigint day bigint hour bigint # Partition Information # col_name data_type comment datacenter string year bigint month bigint day bigint hour bigint