We have identified data quality issues with the event.page_title column in the EditorJourney schema's table in the Data Lake. A patch to stop logging data for that column is being prepared in T213974. We would like to have the existing data in the column blanked (replaced with an empty string).
Description
Description
Related Objects
Related Objects
- Mentioned Here
- T213974: EditorJourney records HTML tags in page_title field
Event Timeline
Comment Actions
@nettrom_WMF data in hive cannot be modified without it being deleted and rewritten (that is right, no alters in hadoop, you can change a column type but not its data) so we do not do this type of modification. Our recommendation here would be to document issues on talk page .
Comment Actions
Editorjourney data is not in the whitelist or am I totally spacing out? https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml
Comment Actions
We're fine with letting this data get purged as it otherwise would, so I'm closing this.