Page MenuHomePhabricator

Some event data (like the one that comes from mediawiki events such us revision create) should not get sanitized
Open, NormalPublic

Description

Now, all data is going into events database that is subjected to sanitization, but some data is by nature public and thus should probably not be subjected to sanitization. It could go to a different DB or rather, should all public data be whitelisted?

Event Timeline

Nuria created this task.Feb 27 2019, 6:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2019, 6:02 PM
Milimetric renamed this task from Public event data incoming from eventgate should go into db that does not get sanitized? to Some event data should not get sanitized.Feb 28 2019, 5:40 PM
Milimetric triaged this task as Normal priority.
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.
Nuria renamed this task from Some event data should not get sanitized to Some event data (like the one that comes from mediawiki events such us revision create) should not get sanitized.May 17 2019, 4:54 PM
Nuria updated the task description. (Show Details)
mforns added a subscriber: mforns.EditedMay 17 2019, 5:02 PM

I believe, as Andrew suggested once, that instead of having an "event" db and an "event_sanitized" db, we should have an "event_unsanitized" db and an "event" db.
This way, event_unsanitized would only contain only temporary data that will be deleted after 90 days, and the event db will contain final (sanitized if necessary) data.
The data sets that we control and do not need sanitization could be directly ingested into the final event database.
This would make everything easier.

Nuria added a comment.May 17 2019, 6:21 PM

Idea looks fine but I do not think it I do not think it will be wise to change naming at this stage.

If I understand correctly now the event database contains the eventlogging data up to 90 days and all the other events that never get sanitized (they have only public data) since inception. The event_sanitized database contains now only sanitized eventlogging data.

It is confusing cause we are maintaining the mediawiki_blah events on event database cause we not dropping them, not cause they are being published to event_sanitized, right?

Idea looks fine but I do not think it I do not think it will be wise to change naming at this stage.

I think changing naming now will be a indeed bit of work (database, jobs, coordinate deployment, documentation, notify people, etc.).
And it's likely that we mess up and have to do backfilling and such.
But I think the advantages of this approach are also significant:

  • No need of blacklisting of the deletion-after-90-days script (which is dangerous), if it fails, non-EL data could be deleted from event database.
  • No need of blacklisting of sanitization process (not so dangerous, but avoidable).
  • Better organization of data, which would allow for more data sets getting deleted-after-90-days and sanitized in an easier way and avoid confusion overall.

If I understand correctly now the event database contains the eventlogging data up to 90 days and all the other events that never get sanitized (they have only public data) since inception. The event_sanitized database contains now only sanitized eventlogging data.

Yes

It is confusing cause we are maintaining the mediawiki_blah events on event database cause we not dropping them, not cause they are being published to event_sanitized, right?

Yes, exactly. You said it, it's confusing :]
But I think we should not copy them over to the sanitized_database, they should not undergo an all-whitelisted sanitization process, if they are already non-privacy-sensitive. They should be ingested directly to the sanitized database. Only it should be named just "event". This way we make the sanitized version of the data the "default" and natural option. And leave the unsanitized as temporary/secondary.