As an ML engineer I would like to have model scores coming from events persisted in the data lake so that I can:
- Evaluate model performance in its application. For instance, in the case of the revertrisk model, performance can be assessed by calculating precision, recall, and F1 score based on revision outcomes (i.e., whether or not they resulted in a revert)
- Calculate thresholds and create buckets for downstream applications. For example in the recent changes filters we want to calculate the buckets based on the % of false positives.
At the moment everything contained in the hive event database has a 90 day retention period. We would like to start with the following 2 tables:
- event.mediawiki_page_outlink_topic_prediction_change_v1
- event.mediawiki_page_revert_risk_prediction_change_v1 and add them to the event_sanitized schema.
useful links:
https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Event_Data_retention
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/
https://wikitech.wikimedia.org/wiki/Data_Platform/Event_Sanitization

