Page MenuHomePhabricator

Investigate the effects of IP Masking on Data Eng systems
Closed, ResolvedPublic5 Estimated Story Points

Description

Plan: this feels like a "many eyes make all bugs shallow" kind of thing. There are a lot of data pipelines involved and we can find all the affected ones together!

Those of us who are familiar with the pipelines, especially @JAllemandou, @mforns, @Milimetric for any batch jobs and @Ottomata for any EventBus-generated streams, should definitely be involved. But @ntsako, @Snwachukwu, and @Antoine_Quhen have all worked on at least some of the affected jobs. It's probably useful for everyone to search and think on their own for a couple hours and then have a couple of meetings where we brainstorm together. Beyond that though, if we miss something, the impact should be very obvious and we should be able to recover easily by going back to the source data and re-running jobs. So it probably isn't worth spending a huge amount of time making sure we don't miss a single line of SQL.

Steps

  • Work in pairs or small groups to discuss what data pipelines might be affected by IP masking
  • Brainstorm in a bigger meeting the workflows, processes or data pipelines affected (if any)
  • Decide as a team if IP masking affects our team or not and communicate it via this phab ticket
  • Fix accordingly

Timeline: If there are any fixes needed, they should be done by Q1 next fiscal year

*useful links*

Event Timeline

EChetty updated the task description. (Show Details)
Milimetric raised the priority of this task from High to Needs Triage.Jan 26 2023, 5:28 PM
Milimetric updated the task description. (Show Details)
EChetty set the point value for this task to 5.

@mforns to take the lead on this.

Quick update from our last conversation:

It seems that an is_temp column was considered but rejected by Data Persistence as something that would cause too much trouble. We decided to investigate that a bit, to understand exactly what would be involved, because the tech debt we have to incur by not adding the column is also significant. So we're currently trying to schedule a conversation with Data Persistence.

The general plan, currently being reviewed by Product-Analytics, is as follows:

  • Change pipelines as far upstream as possible to make temporary users look pretty much like anonymous users
    • So anywhere we have user_is_anonymous or similar, this should be true for old anon users and new temporary users
    • However, since new temporary users will have a real user_id, downstream consumers will still need to verify their logic does not rely on something like user_id == 0 to mean "anonymous"
  • Track new data about temporary users and check our assumption that it does not affect or blur top line metrics too much
  • After some analysis, decide whether to make this decision permanent or add new fields (eg. user_is_temporary) to help create more granular metrics that treat anonymous and temporary separately

We are moving this task to done, since it was an initial investigation. The next steps are:

  • Continue conversation about this plan and validate that it's sound by looking at affected code
  • Plan and estimate the work that we'll need to do in Data-Engineering
NOTE: The main rationale for this plan is that if we change our mind at any point, we don't lose any data in this pipeline. So we can regenerate output with any new approach we decide on. This makes it safer to keep changes to a minimal right now and re-assess.