While some variance between the two sources is expected as user information for features retrieved from the mediawiki API in LiftWing would be more recent than in wmf.mediawiki_history monthly snapshots, this discrepancy can also be observed for revisions made by users with only a single edit.
The cause seems to be a bug in the assemble_features_vector transformation in research datasets, which treats all feature values as float when deserializing them from json. This causes the booleans to become null and then zero after a df.na.fill(0, ...). This behavior can also be reproduced via this small example:
df = spark.createDataFrame( data=(('{"has_comment": true}',),), schema=("example_json",), ) schema = T.StructType( [ T.StructField("has_comment", T.FloatType(), True), ] ) df.select(F.from_json("example_json", schema=schema)).show()
+-----------------------+
|from_json(example_json)|
+-----------------------+
| {null}|
+-----------------------+This does mean that all previous risk_observatory.revert_risk_predictions snapshots are affected by this bug.
Originally from this slack thread in research-engineering.