Page MenuHomePhabricator

Invalid field names in ORES models causing downstream Hive ingestion to fail
Closed, ResolvedPublic

Description

We just did T167180: Emit revision-score event to EventBus and expose in EventStreams. A cool feature of doing this, is that all ORES events will be added to a Hive table called event.mediawiki_revsision_score. Hive is a SQL engine, and as such does not allow numeric column names. Event keys are mapped directly to column names, and it seems that some ORES models have numeric keys, e.g. in this event. The one we've seen so far is for the pagelevel model:

"scores":[  
      {  
         "model_name":"pagelevel",
         "model_version":"0.1.0",
         "prediction":"4",
         "probability":{  
            "0":0.00692137910807248,
            "1":0.2913254357756565,
            "3":0.33305123865790265,
            "4":0.3687019464583685
         }
      }
   ]

Can we fix this (and any other offending models) so that numeric keys are never used?

See also: https://github.com/wiki-ai/articlequality/blob/master/Makefile

Event Timeline

Ottomata created this task.May 30 2018, 3:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 436529 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist mediawiki_revision_score from Hive refinement

https://gerrit.wikimedia.org/r/436529

Change 436529 merged by Ottomata:
[operations/puppet@production] Blacklist mediawiki_revision_score from Hive refinement

https://gerrit.wikimedia.org/r/436529

Nuria moved this task from Incoming to Radar on the Analytics board.May 31 2018, 4:38 PM
Ladsgroup added a subscriber: Tpt.Jun 1 2018, 1:30 PM

Pinging @Tpt since he was involved in the discussion. Do you think it's okay to change the keys for ores responses from "1" to "one" (or "un"). Do you know who uses French Wikisource models? Thank you!

Tpt added a comment.Jun 1 2018, 7:12 PM

Do you think it's okay to change the keys for ores responses from "1" to "one".

The key are the "page quality levels" managed by ProofreadPage. I would do this renaming:

  • "0" to "without_text"
  • "1" to "not_proofread"
  • "2" to "problematic"
  • "3" to "proofread"
  • "4' to "validated"

Do you know who uses French Wikisource models?

Currently no one at my knowledge. It is an experiment we did with Aaron at the Vienna hackathon and we have not worked on it since this time.

https://github.com/wiki-ai/articlequality/pull/68

Also announced it in wikisource-l, ai-l and Village pump of French Wikisource.

Oh, cool, the fix is deployed? If so I'll reenable Hive stuff.

awight added a comment.Jun 5 2018, 1:28 PM

@Ottomata Pending deployment, we'll update here when deployed.

Deployed, but I'm not resolving the task until after the Hive revert.

Thanks @awight. I just tried to re-enable, but there are more (possibly MANY more) problems with this data. I just noticed the following:

"scores": [
    {
      "model_name": "drafttopic",
      "model_version": "0.1.0",
      "prediction": [
        "STEM.Technology"
      ],
      "probability": {
        "Assistance.Article improvement and grading": 0.0006265457823861294,
...

No way we can support . or in column names. The more I think about it, I'm not so sure we will really be able to support revision-score events in hive. The single hour I just tried to refine ended up attempting to create a Hive table like:

8/06/12 13:44:34 INFO DataFrameToHive: Running Hive DDL statement:
CREATE EXTERNAL TABLE `otto`.`mediawiki_revision_score` (
`database` string,
`meta` struct<`domain`:string,`dt`:string,`id`:string,`request_id`:string,`schema_uri`:string,`topic`:string,`uri`:string>,
`page_id` bigint,
`page_namespace` bigint,
`page_title` string,
`rev_id` bigint,
`rev_parent_id` bigint,
`rev_timestamp` string,
`scores` array<struct<model_name:string,model_version:string,prediction:string,probability:struct<A:double,Assistance.Article improvement and grading:double,Assistance.Contents systems:double,Assistance.Files:double,Assistance.Maintenance:double,B:double,C:double,Culture.Arts:double,Culture.Broadcasting:double,Culture.Crafts and hobbies:double,Culture.Entertainment:double,Culture.Food and drink:double,Culture.Internet culture:double,Culture.Language and literature:double,Culture.Media:double,Culture.Performing arts:double,Culture.Philosophy and religion:double,Culture.Plastic arts:double,Culture.Sports:double,Culture.Visual arts:double,D:double,E:double,Geography.Bodies of water:double,Geography.Cities:double,... 30 more fields>>>
)
PARTITIONED BY (
`datacenter` string,
`year` bigint,
`month` bigint,
`day` bigint,
`hour` bigint
)
STORED AS PARQUET
LOCATION '/user/otto/external/eventbus6/mediawiki_revision_score'
18/06/12 13:44:34 ERROR DataFrameToHive: Error executing Hive-DDL commands
Error while compiling statement: FAILED: ParseException line 10:117 mismatched input '.' expecting : near 'Assistance' in column specification
18/06/12 13:44:34 ERROR Refine: Failed refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/event/eqiad_mediawiki_revision-score/hourly/2018/06/12/11 -> `otto`.`mediawiki_revision_score` (datacenter="eqiad",year=2018,month=6,day=12,hour=11).

Every field in a found in every model will be merged into the same struct field. In general, this would work, but only if every model that used the same field name also used the same field type. E.g. we could never one model with a field "A": true and other one with a field "A": 99.9.

We can keep doing this as is, but if we ever want to support ingestion of this type of data into systems outside of ORES, we are going to need stricter schemas for the model scores.

Hm, I wonder...what if instead of a scores array, we used a scores object with model_name keys. This might get a little unweildy, but at least it would solve the problem of the schemaless probability object causing conflicts, because they would be different struct fields altogether, rather than having all model probability schemas merged into one.

I'll bring the event schema issues up in another ticket with Petr. But for now, can we keep using this ticket to address bad field names in downstream models? I'll change the title.

Ottomata renamed this task from Numeric keys in ORES models causing downstream Hive ingestion to fail to Invalid field names in ORES models causing downstream Hive ingestion to fail.Jun 12 2018, 1:50 PM

@Ottomata Thanks for the investigation and explanations! This should be fun ;-)

Hey @Ladsgroup @awight ...

https://ores.wikimedia.org/v3/scores/enwiki/?model_info=score_schema

What is the likelihood we could fix these models so that they don't have bad field names? (I'm assuming not good).

We'd want to not use keywords like "false" and "true" for field names, and also not use field names that have characters like '.' and ' ' in them. All field names should be useable as SQL columns! :)

Halfak added a subscriber: Halfak.Jun 25 2018, 9:30 PM

Well, the field names have nothing wrong with them. Essentially, "true" and "false" are very useful terms in ORES predictions. I think this needs to be fixed downstream of us.

Vvjjkkii renamed this task from Invalid field names in ORES models causing downstream Hive ingestion to fail to gzbaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii removed Ladsgroup as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Ladsgroup; removed: gerritbot, Aklapper.
MusikAnimal updated the task description. (Show Details)Jul 1 2018, 5:08 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:18 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:20 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:21 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:22 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:25 AM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:25 AM
Community_Tech_bot renamed this task from gzbaaaaaaa to Invalid field names in ORES models causing downstream Hive ingestion to fail.Jul 1 2018, 5:25 AM
Community_Tech_bot raised the priority of this task from High to Needs Triage.

^ Rollback script gone wrong, bear with me!

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJul 1 2018, 5:36 AM
Community_Tech_bot renamed this task from Invalid field names in ORES models causing downstream Hive ingestion to fail to Upgrade prometheus-jmx-exporter on all services using it.Jul 1 2018, 5:36 AM
Community_Tech_bot edited subscribers, added: Pnorman, herron, RobH and 4 others; removed: MusikAnimal, gerritbot, Halfak and 4 others.
CommunityTechBot renamed this task from Upgrade prometheus-jmx-exporter on all services using it to Invalid field names in ORES models causing downstream Hive ingestion to fail.Jul 3 2018, 3:28 AM
CommunityTechBot updated the task description. (Show Details)
Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.Aug 22 2018, 6:09 PM
Ladsgroup closed this task as Resolved.Apr 17 2019, 7:00 PM
Ladsgroup added a project: ORES.

This is done