Page MenuHomePhabricator

Discrepancy between `revert_risk` scores retrieved from Liftwing and `risk_observatory.revert_risk_predictions`
Closed, ResolvedPublic

Description

While some variance between the two sources is expected as user information for features retrieved from the mediawiki API in LiftWing would be more recent than in wmf.mediawiki_history monthly snapshots, this discrepancy can also be observed for revisions made by users with only a single edit.

The cause seems to be a bug in the assemble_features_vector transformation in research datasets, which treats all feature values as float when deserializing them from json. This causes the booleans to become null and then zero after a df.na.fill(0, ...). This behavior can also be reproduced via this small example:

df = spark.createDataFrame(
    data=(('{"has_comment": true}',),),
    schema=("example_json",),
)
schema = T.StructType(
    [
        T.StructField("has_comment", T.FloatType(), True),
    ]
)
df.select(F.from_json("example_json", schema=schema)).show()
+-----------------------+
|from_json(example_json)|
+-----------------------+
|                 {null}|
+-----------------------+

This does mean that all previous risk_observatory.revert_risk_predictions snapshots are affected by this bug.

Originally from this slack thread in research-engineering.