Page MenuHomePhabricator

Add LiftWing streams data to event_sanitized (increase data retention)
Closed, ResolvedPublic

Assigned To
Authored By
isarantopoulos
Sep 23 2025, 2:01 PM
Referenced Files
F66735951: image (4).png
Oct 7 2025, 10:57 AM
F66735948: image (3).png
Oct 7 2025, 10:57 AM
Restricted File
Oct 7 2025, 10:48 AM
Restricted File
Oct 7 2025, 10:48 AM
Restricted File
Oct 7 2025, 10:38 AM
Restricted File
Oct 7 2025, 10:38 AM

Description

As an ML engineer I would like to have model scores coming from events persisted in the data lake so that I can:

  • Evaluate model performance in its application. For instance, in the case of the revertrisk model, performance can be assessed by calculating precision, recall, and F1 score based on revision outcomes (i.e., whether or not they resulted in a revert)
  • Calculate thresholds and create buckets for downstream applications. For example in the recent changes filters we want to calculate the buckets based on the % of false positives.

At the moment everything contained in the hive event database has a 90 day retention period. We would like to start with the following 2 tables:

  1. event.mediawiki_page_outlink_topic_prediction_change_v1
  2. event.mediawiki_page_revert_risk_prediction_change_v1 and add them to the event_sanitized schema.

useful links:
https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Event_Data_retention
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/
https://wikitech.wikimedia.org/wiki/Data_Platform/Event_Sanitization

Event Timeline

Change #1192900 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[analytics/refinery@master] expand_event_sanitized_analytics_allowlist: Add revert_risk prediction results to allowlist.

https://gerrit.wikimedia.org/r/1192900

Update

I will make two independent patches one for each table.

For revert_risk_prediction_change I used the following schema:

mediawiki_page_revert_risk_prediction_change_v1:
    dt 
    revision:
        rev_id
    predicted_classification:
        model_name
        model_version
        predictions
        probabilities

Not sure, but I thought that we can keep the rev_id for further usage (joins, etc...)

Change #1192909 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[analytics/refinery@master] expand_event_sanitized_analytics_allowlist: Add outlink_topic prediction results to allowlist.

https://gerrit.wikimedia.org/r/1192909

Change #1192909 abandoned by Gkyziridis:

[analytics/refinery@master] expand_event_sanitized_analytics_allowlist: Add outlink_topic prediction results to allowlist.

Reason:

The changes added in https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1192900

https://gerrit.wikimedia.org/r/1192909

Change #1192900 merged by Aqu:

[analytics/refinery@master] expand_event_sanitized_analytics_allowlist: Add revert_risk prediction results to allowlist.

https://gerrit.wikimedia.org/r/1192900

Update

Since both tables do not include PII data, we configure them under static_data/sanitization/event_sanitized_main_allowlist.yaml in analytics/refinery repo:

  • event.mediawiki_page_outlink_topic_prediction_change_v1: keep_all
  • event.mediawiki_page_revert_risk_prediction_change_v1: keep_all

So this way, both tables are exported to event_sanitized schema keeping all of their columns.

Can we verify that the tables exist before we resolve this? I ran a quick check and the table event_sanitized.mediawiki_page_outlink_topic_prediction_change_v1 doesn't seem to exist. iiuc from the documentation there is a cron job that runs every hours but perhaps something else is needed the first time (?)

Can we verify that the tables exist before we resolve this? I ran a quick check and the table event_sanitized.mediawiki_page_outlink_topic_prediction_change_v1 doesn't seem to exist. iiuc from the documentation there is a cron job that runs every hours but perhaps something else is needed the first time (?)

Indeed there is a cron job that runs per hour and export data in event_sanitized, but we first need to deploy refinery on the hdfs.
So, just merging the changes into refinery repo seems not enough, it also needs deployment.

I will update the task when the changes will be deployed.

Thanks for clearing that up! I assume data eng are the ones that deploy this but we could open the patch for it.

Thanks for clearing that up! I assume data eng are the ones that deploy this but we could open the patch for it.

They just deployed it.
I found this: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Refine/Deploy_Refinery instructions for deploying using scap, but they (DE) are the ones need to deploy it.

image (3).png (938×2 px, 161 KB)

image (4).png (944×2 px, 178 KB)

kostajh subscribed.

@gkyziridis I'm testing this out today but only seeing revertrisk-language-agnostic for an example revision on enwiki, is that expected?

spark-sql (default)> select predicted_classification from event.mediawiki_page_revert_risk_prediction_change_v1 where revision.rev_id = 1333904928;
predicted_classification
{"model_name":"revertrisk-language-agnostic","model_version":"3","predictions":["false"],"probabilities":{"false":0.7348057627677917,"true":0.26519423723220825}}

@gkyziridis I'm testing this out today but only seeing revertrisk-language-agnostic for an example revision on enwiki, is that expected?

spark-sql (default)> select predicted_classification from event.mediawiki_page_revert_risk_prediction_change_v1 where revision.rev_id = 1333904928;
predicted_classification
{"model_name":"revertrisk-language-agnostic","model_version":"3","predictions":["false"],"probabilities":{"false":0.7348057627677917,"true":0.26519423723220825}}

Yes this is expected since this patch is deployed: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1192900 . We keep the predictions from language agnostic model.

We are currently do not store anywhere the predictions from the rr-multilingual model so we cannot export them in the same way that we are doing for the rr-language-agnostic one.
If there is this necessity, I can open a new Phabricator task in order to start developing the first step of saving the slice of the rr-multilingual predictions into the event stream, and then we can add them to the refinery and export them into the event_sanitized as we do for the rr-langugage-agnostic.