Page MenuHomePhabricator

Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2)
Closed, ResolvedPublic8 Estimated Story Points

Description

When the research on the 2017 survey is finished (December 2017) the privacy-sensitive fields should be removed from the white-list for the schemas to be in compliance with the data retention guidelines.

The fields to remove are:

  • QuickSurveyInitiation: userAgent
  • QuickSurveysResponses: userAgent, event_pageId, event_pageTitle

We also need to issue an update statement to EL's slaves to nullify the corresponding table fields since the beginning of time.

Event Timeline

Change 405727 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Remove sensitive fields from whitelist for QuickSurvey schemas

https://gerrit.wikimedia.org/r/405727

@fdans this is ongoing research and we cannot lose this data now. Please do not merge the code.

@mforns we're ready to proceed with this task. Can you, for archive happiness, call out which fields will be dropped after the purge? And, let's make sure we don't lose the whole table please. :)

Ping @fdans to continue with task looks like @leila needs to review the whitelist changes and fields that will be kept

@leila the fields to be nullified are the ones detailed in the task:

QuickSurveyInitiation: userAgent
QuickSurveysResponses: userAgent, event_pageId, event_pageTitle

These fields will be removed from the whitelist and then we'll nullify them from the beginning of time. The rest of the data for these two schemas will be kept intact.

@fdans got it. If I remember correctly this was the initial list and when @mforns and I talked a few weeks ago we arrived at a different list that would allow us to keep the event_pageId, event_pageTitle. @mforns can you let us know?

Yes, @leila, @fdans and all

I totally forgot to formalize our agreement in this task, sorry. Here it goes:

  • QuickSurveyInitiation: DROP userAgent; KEEP everything else.
  • QuickSurveysResponses: DROP userAgent, event_userLanguage, event_skin, event_countryCode; KEEP everything else (including event_pageId and event_pageTitle).

Added myself to the gerrit change as a reviewer.

Change 405727 merged by Ottomata:
[operations/puppet@production] Remove sensitive fields from whitelist for QuickSurvey schemas

https://gerrit.wikimedia.org/r/405727

@fdans to apply changes to historical data

Quick sanity check:

UPDATE QuickSurveyInitiation_15278946 SET userAgent = null;

UPDATE QuickSurveysResponses_15266417 SET userAgent = null, event_countryCode = null, event_userLanguage = null, event_skin = null;

These are the update queries taking into account 3 months ago, but:

MariaDB [log]> select count(*) from QuickSurveyInitiation_15278946 WHERE timestamp < '20171221000000';
+-----------+
| count(*)  |
+-----------+
| 321655959 |
+-----------+
1 row in set (1 min 22.47 sec)

MariaDB [log]> select count(*) from QuickSurveyInitiation_15278946 WHERE timestamp > '20171221000000';
+----------+
| count(*) |
+----------+
|        5 |
+----------+
1 row in set (0.00 sec)

MariaDB [log]> select count(*) from QuickSurveysResponses_15266417 WHERE timestamp > '20171221000000';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [log]> select count(*) from QuickSurveysResponses_15266417 WHERE timestamp < '20171221000000';
+----------+
| count(*) |
+----------+
|  1182164 |
+----------+
1 row in set (0.40 sec)

So we are thinking about removing the where clause to just sanitize all. @mforns whenever you have time can you code review?

@elukey
Queries look good to me!
Yea, we can remove the where timestamp < ... clauses.
I hope the updates do not take days...
+2

Mentioned in SAL (#wikimedia-analytics) [2018-03-26T13:09:19Z] <fdans> stopped 2 mysql consumers as precaution for T174386

Update queries executed on db1108 and db1107, all good!

fdans set the point value for this task to 8.Apr 6 2018, 11:15 AM