Page MenuHomePhabricator

[Epic] Make ORES scores for wikidata available as a dump
Open, LowestPublic

Description

We currently have several levels of caching for ORES scores, but they're all short-term caches and only contain an incomplete set of scores. Researchers may often need the entire set of scores for a given model, which we could store in Hadoop for convenient retrieval.

If we can also supply analytics dump files like we do for pageviews, it would lift the obstacle of having to arrange for research data access.

Event Timeline

awight created this task.Nov 15 2018, 5:41 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptNov 15 2018, 5:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Notes from IRC:

  • Backfill old values by hitting production with a maintenance job.
  • Pipe new scores into ORES by reading from the changeprop stream.
  • May want to implement a new query parameter to prevent caching in Redis.

We have an open request (in email) from the Human-Centered Computing team at Freie Universität Berlin, to retrieve all itemquality scores for wikidata. We should gather other use cases in order to prioritize the order of backfilling.

awight updated the task description. (Show Details)Nov 15 2018, 6:07 PM
awight renamed this task from Store ORES scores in Hadoop to [Epic] Make ORES scores available in Hadoop and as a dump.Nov 16 2018, 10:32 PM
awight updated the task description. (Show Details)Nov 16 2018, 11:33 PM

When data gets stored in Hadoop, it is easy to supply pageviews-like dumps files.
About how to compute the scores, hitting the ORES API is probably the shortest solution in term of schedule, next steps might be to try to reproduce ORES scoring in PySpark. I already started some POC of that, but didn't move to the core of the thing: feature-extraction from mediawiki-text instead of mediawiki-api.

fdans triaged this task as Normal priority.Nov 19 2018, 5:19 PM
fdans raised the priority of this task from Normal to Needs Triage.
fdans moved this task from Incoming to Geowiki on the Analytics board.
fdans moved this task from Geowiki to Radar on the Analytics board.

@bmansurov This might be interesting to you. Please let us know if the design will be compatible with your article suggestion model!

Hey heyyy! We deployed changes for T197000 today. I also re-enabled Hive refinement of this data, so we now have an event.mediawiki_revision_score table with this schema:

database            	string
meta                	struct<domain:string,dt:string,id:string,request_id:string,schema_uri:string,topic:string,uri:string>
page_id             	bigint
page_namespace      	bigint
page_title          	string
rev_id              	bigint
rev_parent_id       	bigint
rev_timestamp       	string
scores              	array<struct<model_name:string,model_version:string,prediction:array<string>,probability:array<struct<name:string,value:double>>>>
datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

# Partition Information
# col_name            	data_type           	comment

datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

See also https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/1.yaml#L81

@awight thanks for the ping. I'll keep an eye on the task.

Ladsgroup triaged this task as High priority.Nov 28 2018, 6:33 AM
Ladsgroup raised the priority of this task from High to Needs Triage.
Ladsgroup triaged this task as High priority.
Ladsgroup moved this task from Untriaged to Research & analysis on the Scoring-platform-team board.

I'm reducing the scope of this task to just one pilot integration, for wikidata.

awight renamed this task from [Epic] Make ORES scores available in Hadoop and as a dump to [Epic] Make ORES scores for wikidata available as a dump.Dec 18 2018, 11:33 PM
Harej lowered the priority of this task from High to Low.Mar 19 2019, 9:28 PM
Harej added a subscriber: Harej.

Having scores available as a dump is a great idea but unfortunately I don't think it's a pressing priority. (If you feel strongly otherwise, I am interested in hearing.)

awight removed a subscriber: awight.Mar 21 2019, 4:04 PM
Harej lowered the priority of this task from Low to Lowest.Mar 26 2019, 9:19 PM
Harej removed a subscriber: Harej.Jul 4 2019, 9:26 AM