Page MenuHomePhabricator

Hash all pageTokens or temporary identifiers from the EL Sanitization white-list
Closed, ResolvedPublic


Some background for this task:

Historically we agreed to hash all permanent identifiers of EventLogging data in the EL sanitization white-list.
When hashing, we add a salt to the token, and we rotate the salt every 3 months.
This way, hashed tokens keep the ability to link events generated by the same token-holder,
but only for events that belong to the same 3-month period (salt rotation breaks link).

There is, though, one potential weakness to this approach: Around the salt rotation time (end/beginning of quarter)
if another non-hashed identifier, like a short lived token that we decided not to hash, spans a little bit before
and after the salt rotation time, then it can be used to link events before and after the salt rotation,
and associate hashed tokens before and after, thus creating a chain that defeats the effect of hashing+salting.

I believe that the risk of this weakness is not high, because the proportion of users that will generate events
close enough to the salt rotation time so that the non-hashed temporary tokens can generate the chain, is really small.
However, the longer-lived the temporary token the higher risk, i.e. non-hashed tokens that span half an hour or less are OK,
but non-hashed tokens that span 1 week or more are not OK.

The actual task:

This task is about finding such tokens that are not hashed and mark them to be hashed in the EL white-list.
Initially the only disadvantage of doing so is they will be "cut" at end/start of quarter. For example: if a non-hashed
token "session" spans about half an hour, and is placed around salt rotation time, then hashing will split it into two sessions,
one before the salt rotation, and another after. All other sessions not around end/start of quarter, will be intact.
The shorter-lived a token, the less negative effect the hashing will have on it.

Here app_install_id is hashed, but session_token is not, thus potentially invalidating the effect of hashing+salting app_install_id.

        action: keep
        appInstallID: hash
        app_install_id: hash
        newLang: keep
        oldLang: keep
        sessionToken: keep
        source: keep
        timeSpent: keep
        client_dt: keep
        os_family: keep
        wmf_app_version: keep
    webhost: keep
    wiki: keep

Event Timeline

fdans moved this task from Incoming to Data Quality on the Analytics board.
mforns added a project: Product-Analytics.
mforns added a subscriber: nettrom_WMF.

This is unlikely to be an issue for the schemas that the Growth Team works with, as those only store short-term tokens.

Currently the token in the HelpPage schema is not hashed. I'd like to wait to submit a patch for that until I also whitelist the two Homepage-related schemas, as the token is shared between those.

Adding the other analysts as this might affect schemas they work with.

kzimmerman added a subscriber: kzimmerman.

@mforns it looks like the subtasks we had for this have all been resolved; can this be closed?

@mforns it looks like the subtasks we had for this have all been resolved; can this be closed?

And thank you *a lot* @kzimmerman, @nettrom_WMF and all others for taking care of this annoying task.