Page MenuHomePhabricator

Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech
Closed, ResolvedPublic

Event Timeline

nettrom_WMF triaged this task as Medium priority.Jul 2 2019, 8:59 PM

@ifried or @aezell : not sure which one of you to contact, so I'm pinging you both, sorry! Would be great if someone on the team can review this. If the team has any EventLogging schemas that are whitelisted, and said whitelisting stores a token, a patch of the whitelist should be created to hash related tokens.

Thanks @nettrom_WMF. We will take a look. I don't know if we have this off the top of my head.

Is there a way for me to discover the schemas that this team has created? Our institutional memory is fuzzy. It'd be great if I had an authoritative source of the schemas we created which are still gathering data.

Hi @aezell

The white-list which we use as single source of truth as of what EL data gets sanitized and kept indefinitely is:
https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml
There you can maybe find your schemas by name?

If you have access to the Analytics cluster (i.e. stats machines), you can see a list of the schemas that have seen new data since 2017 with:
hdfs dfs -ls /wmf/data/event_sanitized

Also, you can visit each schema's talk page on Meta, where possibly it should indicate who was the schema owner/creator/team.
For instance: https://meta.wikimedia.org/wiki/Schema_talk:Popups

Cheers!

@mforns Thanks for that info. It was very helpful.

@nettrom_WMF As best I can tell from the info provided, Comm Tech is not responsible for any of the EL data that is kept indefinitely and should be sanitized. So, I don't think there's anything for this team to do?

@aezell : I agree with you. I went through all the schemas listed on meta, and from what I could tell the only schema that Comm Tech owns is TemplateWizard. That schema was labelled as "in development" but appears to be actively gathering data, so I updated its status to reflect that. It isn't whitelisted, though, so as far as I can tell, there's nothing more to do here. Closing as "resolved".

@aezell We would need to go through a list of our team's projects and review whether each of them is using EL or not.

@MaxSem Thanks for that. Turns out that it's not specifically just using EL but having whitelisted schemas whose data is kept for longer than the normal retention period. We do not have any of these special schemas.