Page MenuHomePhabricator

Make non-nullable columns in EL database nullable
Closed, ResolvedPublic13 Story Points

Description

Get all columns that are non-nullable in EL tables (i.e. via a select statement)
And alter-table them to become nullable.


Read more for context here:

Use case:
We need to apply data retention guidelines to EL database.
To do that, we're developing this script in T156933.
It uses this white-list to decide which tables and fields to purge.
There are 2 types of purge: full purge and partial purge. Full purge deletes the whole records after 90 days, whereas partial purge only sets to NULL the fields of the table that are not in the white-list (after 90 days as well).

Issue:
EL schemas allow to specify that a field is "required". EL interprets this and creates the corresponding table with the corresponding non-nullable field. This makes total sense, but prevents the purging script to set that field to NULL, in cases where the field is part of a partially purged schema.

Proposed solution:
Change EL so that all table fields are nullable, even if they are specified to be "required" in the schema definition. The "required" flag would not loose its meaning, because EL's event processor would still validate the event against the schema and check that the received event indeed possesses the required field. Only the database would be more "relaxed" in the way it stores the data, thus allowing the purging script to set values to NULL for all fields.

Derived tasks:

  • Modify EventLogging code so that all generated columns are nullable
  • Alter tables in EventLogging database to make all non-nullable columns nullable (THIS TASK)

Other potential solutions:

  1. Use a garbage value instead of NULL depending on field type, i.e.: 0 for numbers, false for booleans, '' for strings, etc. This is the easiest to implement, but it's super inconvenient for querying the tables, because you don't know whether a value is redacted or not.
  2. Do not allow schemas with required fields to be partially purged and force full purge. Also easy to implement, but super inconvenient for schema owners, because the would loose data that otherwise could be kept following the data retention guidelines.

Details

Related Gerrit Patches:

Event Timeline

mforns created this task.Jun 6 2017, 5:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2017, 5:37 PM
mforns claimed this task.Jun 7 2017, 7:25 PM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 357665 had a related patch set uploaded (by Mforns; owner: Mforns):
[eventlogging@master] [WIP] Make required fields into nullable columns

https://gerrit.wikimedia.org/r/357665

elukey added a comment.EditedJun 12 2017, 1:32 PM

I came up with a preliminary list of ALTER statements to run: https://phabricator.wikimedia.org/P5570

Procedure:

  1. SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, CHARACTER_MAXIMUM_LENGTH FROM information_schema.COLUMNS WHERE TABLE_SCHEMA='log' AND IS_NULLABLE='NO' AND COLUMN_NAME NOT IN ('id', 'uuid', 'timestamp');
  1. Transformed in a CSV
  1. Generated SQL statements with the following Python script:
from __future__ import print_function
import csv
from collections import defaultdict

with open('not_nullable.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    tables = defaultdict(list)
    for row in reader:
        tables[row[0]].append((row[1],row[2],row[3]))
    for table in tables:
	alter = "ALTER TABLE {} ".format(table)
    	for attribute in tables[table]:
	    if "int" not in attribute[1] and "float" not in attribute[1]:
		alter += "MODIFY COLUMN `{}` {}({}) NULL, ".format(attribute[0],attribute[1],attribute[2])
            else:
                alter += "MODIFY COLUMN `{}` {} NULL, ".format(attribute[0],attribute[1])
	alter = alter.rstrip().rstrip(",")
	print(alter + ";")
mforns added a comment.EditedJun 12 2017, 5:27 PM

I think the idea and script are pretty good!

If we want to reduce the number of alter statements, we could join all statements that pertain to the same table and have a statement like:

ALTER TABLE blah_123 MODIFY col_1 type_1 NULL, MODIFY col_2 type_2 NULL, ... MODIFY col_N type_N NULL;

We'd go from 1627 to less than 400, though I don't think this is super necessary.

mforns reassigned this task from mforns to elukey.Jun 13 2017, 3:10 PM
Nuria set the point value for this task to 3.Jun 13 2017, 3:10 PM

Sorry, the gerrit patch belongs to another task.
This task is about the alter tables to modify non-nullable columns in EL database.

Nuria removed the point value for this task.Jun 13 2017, 3:15 PM

Change 357665 merged by Nuria:
[eventlogging@master] Make required fields into nullable columns

https://gerrit.wikimedia.org/r/357665

Following @Marostegui's advice I changed the script to generate more precise alter tables, adding to varchar and varbinary their length (like varchar(191)).

Milimetric triaged this task as Medium priority.Jun 22 2017, 3:07 PM
elukey moved this task from Backlog to In Progress on the User-Elukey board.Jun 23 2017, 10:22 AM

Mentioned in SAL (#wikimedia-operations) [2017-06-27T08:46:32Z] <elukey> executing alter tables to the log database on db1047 for https://phabricator.wikimedia.org/T167162#3340421

Mentioned in SAL (#wikimedia-operations) [2017-06-27T09:49:54Z] <marostegui> executing alter tables to the log database on dbstore1002 for https://phabricator.wikimedia.org/T167162#3340421

We'll use this task to track the work on db1047 and dbstore1002, but it will likely take days before completion.

I think the idea and script are pretty good!
If we want to reduce the number of alter statements, we could join all statements that pertain to the same table and have a statement like:

ALTER TABLE blah_123 MODIFY col_1 type_1 NULL, MODIFY col_2 type_2 NULL, ... MODIFY col_N type_N NULL;

We'd go from 1627 to less than 400, though I don't think this is super necessary.

You were completely right, this morning I changed the script to produce the new grouped alter table statements after reading how mysql implements them internally (copy/create tmp table, alter schema, inserts rows, rename tmp table).

Tested with the Edit table and I was able to run 4 modify column in the same time as 1.

@elukey @Marostegui

Hey! I think I have good news.

tl;dr
At least the top 10 biggest EL tables in the log DB do not need to be altered, because all the fields that need to be sanitized are already nullable.

detailed
By some divine chance, the top 10 biggest (in MB) tables do not need to be altered, because their non-nullable fields are already in the white-list, so those fields will not be set to NULL by the purging script.
Potential downside: It's not ideal to leave them non-nullable, because (in an unlikely scenario) we might want to remove such a non-nullable field from the white-list. If that case, the script would (gracefully) fail.
However, these fields are white-listed because they are non-sensitive in their context, so it would be really unlikely that we have to remove them from the white-list in the future.

See the tables affected in the following analysis:

+------------------------------------------------------------+------------+
| Table                                                      | Size in MB |
+------------------------------------------------------------+------------+
| MobileWebUIClickTracking_10742159_15423246                 |  531785.73 |
| PageContentSaveComplete_5588433_15423246                   |  486468.91 |
| MediaViewer_10867062_15423246                              |  346605.74 |
| Edit_11448630                                              |  171497.45 |
| _Edit_11448630_old                                         |  169625.44 |
| MobileWikiAppToCInteraction_10375484_15423246              |  146949.85 |
| Edit_13457736_15423246                                     |  136356.73 |
| MobileWebSectionUsage_15038458                             |   88912.17 |
| MobileWikiAppToCInteraction_8461467                        |   85811.01 |
| MobileWikiAppSearch_10641988_15423246                      |   85581.97 |



non-nullable fields in MobileWebUIClickTracking_10742159_15423246:
wiki, event_mobileMode -> both already white-listed -> YAY!

non-nullable fields in PageContentSaveComplete_5588433_15423246:
none -> YAY!

non-nullable fields in MediaViewer_10867062_15423246:
wiki, event_action, event_samplingFactor -> all already white-listed -> YAY!

non-nullable fields in Edit_11448630:
wiki, event_action, event_editingSessionId, event_editor, event_integration, event_mediawiki.version, event_platform, event_user.editCount, event_user.id, event_version
-> all already white-listed -> YAY!

non-nullable fields in _Edit_11448630_old:
(same as Edit_11448630) -> this table is not in the whitelist, and will be fully purged -> YAY!

non-nullable fields in MobileWikiAppToCInteraction_10375484_15423246:
wiki, event_action, event_appInstallID -> all already white-listed -> YAY!

non-nullable fields in Edit_13457736_15423246:
(same as Edit_11448630) -> all already white-listed -> YAY!

non-nullable fields in MobileWebSectionUsage_15038458:
wiki, event_eventName, event_hasServiceWorkerSupport, event_isTablet, event_namespace, event_pageId, event_sectionCount, event_sessionId
-> all already white-listed -> YAY!

non-nullable fields for MobileWikiAppToCInteraction_8461467:
wiki, event_action, event_tocInteractionToken -> all already white-listed -> YAY!

non-nullable fields for MobileWikiAppSearch_10641988_15423246:
wiki, event_action, event_appInstallID, event_searchSessionToken -> all already white-listed -> YAY!

Let me know if this makes sense to you!
Cheers

Hey,

I don't have much to said about what needs to be NULLABLE and what not really, that is for you guys to decide, as you have all the context.
However, I am happy that those do not need to be altered, as I am pretty sure it would take weeks, yes, weeks to get it done because I tried to alter revision table on db1047, which size is:

+----------+------------+
| Table    | Size in MB |
+----------+------------+
| revision |  266623.53 |
+----------+------------+
1 row in set (0.11 sec)

So smaller than those, and it took 13 days to complete around 48% of it...
@elukey maybe it is worth generating the alters again excluding those tables?

Thanks!

I am currently running the alter tables without the any table having #rows > 250M. I'll leave them running until Monday since it seems that 250M is a achievable limit (even if db1047 is still trashing). In case this doesn't work I'll exclude also the tables that Marcel mentioned ok?

Thanks @Marostegui for the note on alter table timing! Certainly it wouldn't be possible :]

Awesome @elukey, let's see how fast those tables are. If they are still a bit slow, I can look at the second half of the top 20 biggest tables and see if they can also be left without alter table! It would take me 1 hour or so, and if we're lucky we can skip all tables with #rows > 100M as well!

I excluded the following tables from the alters:

PageContentSaveComplete_5588433
PageCreation_7481635_15423246
MobileWebSectionUsage_14321266
NavigationTiming_15485142_15423246
Popups_15906495
MobileWikiAppMediaGallery_10923135_15423246
MobileWikiAppPageScroll_14591606_15423246
MobileWebEditing_8599025
MobileWikiAppArticleSuggestions_11448426_15423246
MobileWebSectionUsage_15038458
MobileWikiAppToCInteraction_8461467
MobileWikiAppSearch_10641988_15423246
Edit_13457736_15423246
Edit_11448630
MobileWikiAppToCInteraction_10375484_15423246
MediaViewer_10867062_15423246
MobileWebUIClickTracking_10742159_15423246

These are the ones containing more than 100M rows if I got it correctly :)

Mentioned in SAL (#wikimedia-operations) [2017-07-03T14:40:41Z] <elukey> running EventLogging alter tables on dbstore1002 (script in /home/elukey/dbstore1002.sql) - T167162

elukey added a comment.EditedJul 6 2017, 9:48 AM

Alter tables completed on db1047, these are the remaining tables with NOT NULLABLE columns that are NOT whitelisted afaics:

MobileWikiAppArticleSuggestions_11448426_15423246	event_pageTitle
MobileWikiAppArticleSuggestions_11448426_15423246	event_readMoreList

MobileWikiAppMediaGallery_10923135_15423246	event_imageTitle
MobileWikiAppMediaGallery_10923135_15423246	event_pageTitle

MobileWikiAppPageScroll_14591606_15423246	event_appInstallID

PageCreation_7481635_15423246	userAgent

Size in MB of those tables:

MobileWikiAppArticleSuggestions_11448426_15423246	81512.68
MobileWikiAppMediaGallery_10923135_15423246	55320.82
MobileWikiAppPageScroll_14591606_15423246	46139.81
PageCreation_7481635_15423246	42493.36

The alters would be the following:

ALTER TABLE MobileWikiAppPageScroll_14591606_15423246 MODIFY COLUMN `wiki` varchar(191) NULL, MODIFY COLUMN `event_appInstallID` varchar(191) NULL, MODIFY COLUMN `event_maxPercentViewed` bigint NULL, MODIFY COLUMN `event_pageHeight` bigint NULL, MODIFY COLUMN `event_pageID` bigint NULL, MODIFY COLUMN `event_scrollFluxDown` bigint NULL, MODIFY COLUMN `event_scrollFluxUp` bigint NULL, MODIFY COLUMN `event_timeSpent` bigint NULL;
ALTER TABLE MobileWikiAppArticleSuggestions_11448426_15423246 MODIFY COLUMN `wiki` varchar(191) NULL, MODIFY COLUMN `event_action` varchar(191) NULL, MODIFY COLUMN `event_appInstallID` varchar(191) NULL, MODIFY COLUMN `event_pageTitle` varchar(191) NULL, MODIFY COLUMN `event_readMoreList` varchar(191) NULL, MODIFY COLUMN `event_version` bigint NULL;
ALTER TABLE MobileWikiAppMediaGallery_10923135_15423246 MODIFY COLUMN `wiki` varchar(191) NULL, MODIFY COLUMN `event_action` varchar(191) NULL, MODIFY COLUMN `event_appInstallID` varchar(191) NULL, MODIFY COLUMN `event_gallerySessionToken` varchar(191) NULL, MODIFY COLUMN `event_imageTitle` varchar(191) NULL, MODIFY COLUMN `event_pageTitle` varchar(191) NULL;
ALTER TABLE PageCreation_7481635_15423246 MODIFY COLUMN `userAgent` varchar(191) NULL;

I am currently trying the one for PageCreation_7481635_15423246, if doable I'll proceed with the other ones around the same size but I am a bit afraid that MobileWikiAppArticleSuggestions_11448426_15423246 will take ages (or just simply not finish because the host refuses to do so).

elukey added a comment.Jul 7 2017, 1:09 PM

Alters just finished, ran the following two queries from stat1002:

elukey@stat1002:~$ mysql -h analytics-slave.eqiad.wmnet <<< "SELECT distinct TABLE_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA='log' AND IS_NULLABLE='NO' AND COLUMN_NAME NOT IN ('id', 'uuid', 'timestamp');"
TABLE_NAME
Edit_11448630
Edit_13457736_15423246
MediaViewer_10867062_15423246
MobileWebEditing_8599025
MobileWebSectionUsage_14321266
MobileWebSectionUsage_15038458
MobileWebUIClickTracking_10742159_15423246
MobileWikiAppSearch_10641988_15423246
MobileWikiAppToCInteraction_10375484_15423246
NavigationTiming_15485142_15423246
PageContentSaveComplete_5588433
Popups_15906495
elukey@stat1002:~$ mysql -h analytics-store.eqiad.wmnet <<< "SELECT distinct TABLE_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA='log' AND IS_NULLABLE='NO' AND COLUMN_NAME NOT IN ('id', 'uuid', 'timestamp');"
TABLE_NAME
Edit_11448630
Edit_13457736_15423246
MediaViewer_10867062_15423246
MobileWebEditing_8599025
MobileWebSectionUsage_14321266
MobileWebSectionUsage_15038458
MobileWebUIClickTracking_10742159_15423246
MobileWikiAppSearch_10641988_15423246
MobileWikiAppToCInteraction_10375484_15423246
MobileWikiAppToCInteraction_8461467
NavigationTiming_15485142_15423246
PageContentSaveComplete_5588433
Popups_15906495
_Edit_11448630_old

As far as I can see, all the tables with NOT NULLABLE fields are whitelisted, so this task should be done!

elukey set the point value for this task to 13.Jul 7 2017, 1:09 PM
elukey added a comment.Jul 7 2017, 1:41 PM

Re-checked for both -slave and -store all the not nullable fields, they are whitelisted.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Jul 7 2017, 1:41 PM
mforns added a comment.Jul 7 2017, 1:55 PM

Awesooome!

elukey moved this task from In Progress to Done on the User-Elukey board.Jul 10 2017, 8:42 AM
Nuria closed this task as Resolved.Jul 10 2017, 3:29 PM