Page MenuHomePhabricator

Change event.mediawiki_revision_score schema to use map types
Closed, ResolvedPublic13 Estimated Story Points

Description

When we first set up the revision-score event, we could not support map types in JSON data. This is because it is not possible distinguish between a 'struct' and a 'map' from JSON data alone. Maps look identical to structs. To work around this, we used arrays of score model objects that themselves had arrays of scores as score name, score value probability.

We can now support map types by declaring a field to be an object with arbitrary (string) keys with explicitly declared value types. Querying array based data in Hive is difficult, so we'd like to use map types instead.

Currently, a scores data field is an array with objects, like:

[
  {
    "model_name": "awesomeness",
    "model_version": "1.0.1",
    "prediction": ["yes", "mostly"],
    "probability": [
      {"name": "yes",    "value": 0.99},
      {"name": "mostly", "value": 0.90},
      {"name": "hardly", "value": 0.01}
    ]
  },
  {
    "model_name": "other_model_name",
    ...
  },
  ...
]

I believe (we should check with @JAllemandou and @Pchelolo) that we'd like to change the schema so that the scores field will be an object like:

{
  "awesomeness" {
    "model_name": "awesomeness",
    "model_version": "1.0.1",
    "prediction": ["yes", "mostly"],
    "probability": {
      "yes": 0.99,
      "mostly", 0.90,
      "hardly", 0.01
    }
  },
  "other_model_name": { ... },
  ...
}

The change here is that instead of an array of scores, scores will be a map (object) of model name to score object. The score object will be mostly the same, except we want to change the probabilty field to also be a map of prediction names to probability (decimal) values.

In JSONSchema, we define a 'map' type to be an object with additionalProperties given a particular type, like:

"map_field": {
  "type": "object",
  "additionalProperties": {
    "type": "string" // or whatever type your values are.
  }
}

See the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/api/request/current.yaml#L46-L52 | api/request schema params ]] field for an example. Note that the map value "type" does not have to be a simple primitive like a "string"; it can also be an "object", as long as the object schema is explicitly declared. This will be the case for the top level scores field. It should have additionalProperties: { type: object, properties: { ... }}.

So, we need to change the JSONSchema for mediawiki/revision/score to accept the new map type data. (We'll also likely want to change the errors field to be a similar map keyed by model name.) Please read https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md and https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md for some information about how to make schema changes.

change-propogation emits the revision-score event, so we'll need to modify the producer code there to send the new format. @Pchelolo to provide instructions here.

Note that this will be a backwards incompatible change! No one is using this data at the moment (we never released it publicly in EventStreams), so making this change backwards incompatible is ok. Once the schema change and data is live, it means that we will have to do some work on the analytics cluster side of things to most likely move and delete the old incompatible data out of the way. @Ottomata will handle that part.

See also: T167180: Emit revision-score event to EventBus and expose in EventStreams T197000: Modify revision-score schema so that model probabilities won't conflict

Event Timeline

This would be using a Map type instead of Arrays of keys, values for the probability field.

fdans triaged this task as Medium priority.Jun 6 2019, 4:37 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

@JAllemandou @Pchelolo Q: do we want to have scores be a map type keyed by model name, or still an array of objects? (We'll certainly change the probability field to a map of prediction -> value.)

@JAllemandou @Pchelolo Q: do we want to have scores be a map type keyed by model name, or still an array of objects? (We'll certainly change the probability field to a map of prediction -> value.)

In my experience having it mapped by the name is easier, but I'd rather go with the expert opinion of @JAllemandou

I like the idea of having the model-name as map-key. Only limitation I can think of is that only one model version can be reported on a revision, except if we put an array (or a version map !) as the map value ... Seems overkill.

I think we were told by ORES folks that that would never happen (at least not for a given revision-score event). If a revision is again scored by a different version (and we get an event somehow...which we won't from precache + change-prop right now), it would be totally new event.

I recall that as well @Ottomata - My note was purely theoretical :) I'm fully in favor of having a map keyed by model names.

Confirmed. We expect no overlap in revision-scored where a revision is scored twice, but should that happen, new event.

@Halfak I think we've confirmed this before too, but I want to make super sure! All predictions are strings (or can be cast to strings), and all probabilities are decimal values. Yes?

Hmm yes. You can cram a bool or int into a string. Predictions are all int, bool, string, or list of strings.

Change 536301 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/event-schemas@master] Change event.mediawiki_revision_score schema to use map types

https://gerrit.wikimedia.org/r/536301

Change 536301 merged by Ottomata:
[mediawiki/event-schemas@master] Change event.mediawiki_revision_score schema to use map types

https://gerrit.wikimedia.org/r/536301

Change 538927 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump mediawiki/event-schemas to 6b90d96 to get new revision/score version

https://gerrit.wikimedia.org/r/538927

Change 538927 merged by Ottomata:
[eventgate-wikimedia@master] Bump mediawiki/event-schemas to 6b90d96 to get new revision/score version

https://gerrit.wikimedia.org/r/538927

Change 539373 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Bump eventgate-main image version and pre cache revision-score 2.0.0

https://gerrit.wikimedia.org/r/539373

Change 539373 merged by Ottomata:
[operations/deployment-charts@master] Bump eventgate-main image version and pre cache revision-score 2.0.0

https://gerrit.wikimedia.org/r/539373

Mentioned in SAL (#wikimedia-operations) [2019-09-26T17:41:34Z] <ppchelko@deploy1001> Started deploy [changeprop/deploy@2db4bff]: Modify ORES processor for new-style events T225211

Mentioned in SAL (#wikimedia-operations) [2019-09-26T17:43:38Z] <ppchelko@deploy1001> Finished deploy [changeprop/deploy@2db4bff]: Modify ORES processor for new-style events T225211 (duration: 02m 04s)

This is looking great so far!

I will fix the Hive table on Monday.

The Hive table is looking good and new data is coming in. I moved the old data away and created a table in my user database with the old stuff. Now to backfill...

Backfilling was really hard! Impossible in SQL, so I switched to Spark. This took me all day, but I think I'm very close!

@JAllemandou here's what I'm working with:
https://gist.github.com/ottomata/b9c59bc0858832bdf4ed1ebcd7187397

The insert into my temp table is running overnight...hopefully it'll be done in the morning. (Who am I kidding it will probably die somewhere tonight...🚽 )

Andrew the script to back fill might have been a bit aggressive, the HDFS RPC queue jumped like crazy from 22 UTC onward:

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&from=1569964002384&to=1569974708471

From the hdfs audit log I can see a lot of:

[..]
2019-10-01 22:40:22,440 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.53.35 cmd=getfileinfo src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215709_0013_m_000408_0/datacenter=eqiad/year=2019/month=4/day=5/hour=1/part-00408-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet      dst=null        perm=null       proto=rpc
2019-10-01 22:40:22,440 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.36.109        cmd=getfileinfo src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215735_0013_m_001228_0/datacenter=eqiad/year=2019/month=4/day=6/hour=22/part-01228-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet     dst=null        perm=null       proto=rpc
2019-10-01 22:40:22,440 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.36.109        cmd=getfileinfo src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215734_0013_m_001229_0/datacenter=eqiad/year=2019/month=4/day=3/hour=23/part-01229-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet     dst=null        perm=null       proto=rpc
2019-10-01 22:40:22,440 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.5.29  cmd=create      src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215704_0013_m_000078_0/datacenter=eqiad/year=2019/month=4/day=12/hour=1/part-00078-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet     dst=null        perm=otto:hadoop:rw-r--r--      proto=rpc
2019-10-01 22:40:22,441 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.5.20  cmd=create      src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215713_0013_m_000967_0/datacenter=eqiad/year=2019/month=4/day=6/hour=1/part-00967-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet      dst=null        perm=otto:hadoop:rw-r--r--      proto=rpc
2019-10-01 22:40:22,441 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.21.112        cmd=create      src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215715_0013_m_000705_0/datacenter=eqiad/year=2019/month=4/day=4/hour=9/part-00705-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet      dst=null        perm=otto:hadoop:rw-r--r--      proto=rpc
2019-10-01 22:40:22,441 INFO FSNamesystem.audit: allowed=true   ugi=otto (auth:SIMPLE)  ip=/10.64.21.123        cmd=getfileinfo src=/user/hive/warehouse/otto.db/mediawiki_revision_score_all_backfill0/_temporary/0/_temporary/attempt_20191001215709_0013_m_000913_0/datacenter=eqiad/year=2019/month=3/day=30/hour=9/part-00913-cdbed28b-f405-4538-9bd1-bbdc607a29ce.c000.snappy.parquet     dst=null        perm=null       proto=rpc
[..]

The number of files on HDFS jumped from ~38M to 56M :D

The good thing is that the Namenode JVM didn't trash! The G1 GC seems to work fine!

Iinnnteresting. And the job failed too (and deleted all the temporary output files it created).

I did get it to work for a smaller set...I will try and reduce the number of partitions it works on at once.

Ok, now doing half a month at a time, and writing directly to parquet files rather than going through Hive. Seems to be working (and gist is updated). I'll then move the files back into the event.mediawiki_revision_score table location and MSCK REPAIR TABLE.

Woo hoo!

hive -e 'show partitions event.mediawiki_revision_score' | wc -l
10897

hdfs dfs -du -s -h /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/*

8.0 G  24.0 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=1

7.8 G  23.5 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=2
9.2 G  27.7 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=3
9.5 G  28.5 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=4
7.8 G  23.5 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=5
6.5 G  19.4 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=6
4.8 G  14.4 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=7
6.1 G  18.4 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=8
4.8 G  14.4 G  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=9
264.0 M  792.1 M  /wmf/data/event/mediawiki_revision_score/datacenter=eqiad/year=2019/month=10

And!

select page_title, scores['goodfaith'].prediction[1] as prediction, scores['goodfaith'].probability as prob from mediawiki_revision_score where "database" = 'enwiki' and year=2019 and month=1 and day=10 and hour=0 limit 10;

                  page_title                   | prediction |                          prob
-----------------------------------------------+------------+--------------------------------------------------------
 199798_Highland_Football_League              | true       | {false=0.0023274666025727697, true=0.9976725333974272}
 Wikipedia:Featured_article_candidates         | true       | {false=0.010649714224484796, true=0.9893502857755152}
 Danil_Faizullin                               | true       | {false=0.0029648242013771142, true=0.9970351757986229}
 2018_Mississippi_State_Bulldogs_football_team | true       | {false=0.012042837978452625, true=0.9879571620215474}
 AIDS-Holocaust_metaphor                       | true       | {false=0.0029851674260443772, true=0.9970148325739556}
 1962_Detroit_Lions_season                     | true       | {false=0.004996510513510244, true=0.9950034894864898}
 Medical_claims_on_The_Dr._Oz_Show             | true       | {false=0.017238768537503724, true=0.9827612314624963}
 Tibor_Živković                                | true       | {false=0.012514010325418323, true=0.9874859896745817}
 Hideki_Tojo                                   | true       | {false=0.008441624631921107, true=0.9915583753680789}
 201819_Atlanta_Hawks_season                  | true       | {false=0.014599823651628374, true=0.9854001763483716}
(10 rows)

Query 20191002_163926_00050_vanex, FINISHED, 5 nodes
Splits: 106 total, 76 done (71.70%)
0:00 [25.1K rows, 24.8KB] [65.9K rows/s, 65.2KB/s]
Ottomata set the point value for this task to 13.