Page MenuHomePhabricator

[EL Sanitization] Translate TSV whitelist into new YAML whitelist
Closed, ResolvedPublic5 Estimate Story Points

Description

We should write a short script that automatically translates the current TSV whitelist to the new YAML format.
Note the new YAML format allows for partial purging of nested fields. In T164125, we decided to completely whitelist the parsed userAgent field until we could apply partial purging to it. So, now it's time, we should revisit that task and modify the whitelist to only whitelist the part of the userAgent map that is safe.

Event Timeline

mforns created this task.Mar 14 2018, 3:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2018, 3:40 PM
mforns claimed this task.Mar 14 2018, 4:30 PM
mforns edited projects, added Analytics-Kanban; removed Analytics.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.
mforns added a comment.EditedMar 14 2018, 7:59 PM

I wrote the script that translates the old TSV whitelist into the new YAML format.
I did not push the code to any repo, because this is a one-off. Here's the code:

translate_whitelist.py
import sys

tsv_lines = sys.stdin.readlines()
whitelist = {}

for line in tsv_lines:

    # Parse line.
    split_line = line.split('\t')
    schema_name = split_line[0].strip()
    field_name = split_line[1].strip()

    # Open block for schema whitelist.
    if not whitelist.has_key(schema_name):
        whitelist[schema_name] = {}

    # Wrap fields that belong to the core event.
    if field_name.startswith('event_'):
        if not whitelist[schema_name].has_key('event'):
            whitelist[schema_name]['event'] = {}
        event_field_name = field_name[6:]
        whitelist[schema_name]['event'][event_field_name] = 'keep'

    # Wrap user agent fields.
    elif field_name.lower() == 'useragent':
        whitelist[schema_name]['userAgent'] = {
            'os_family': 'keep',
            'wmf_app_version': 'keep'
        }

    # Remove some obsolete fields.
    elif field_name.lower() in ('clientvalidated', 'istruncated'):
        continue

    # Other capsule fields.
    else:
        whitelist[schema_name][field_name] = 'keep'

# Write yaml.
def print_yaml(whitelist, indentation='', separator=None):
    for key in sorted(whitelist.keys()):
        value = whitelist[key]
        if type(value) is dict:
            print(indentation + key + ':')
            print_yaml(value, indentation + '    ')
        else:
            print(indentation + key + ': keep')
        if separator is not None:
            print(separator)

print("""
__defaults__:
    dt: keep
    recvFrom: keep
    revision: keep
    seqId: keep
    timestamp: keep
    topic: keep
    uuid: keep
""")
print_yaml(whitelist, separator='')

Note that all userAgent fields that are TSV-whitelisted, will appear as partially purged in the YAML whitelist. That is actually the correct purging strategy for that field, see: https://phabricator.wikimedia.org/T164125#3284763

To thoroughly test that the output YAML whitelist was equivalent to the input TSV one, I wrote another script that does the reverse process. Here's the second script:

untranslate_whitelist.py
import yaml

# Load yaml into a dict.
with open('yaml_whitelist.yaml') as yaml_file:
	whitelist = yaml.load(yaml_file)

# Iterate over schemas.
for schema_name in sorted(whitelist.keys()):
	if schema_name.lower() == '__defaults__':
		continue

	schema_whitelist = whitelist[schema_name]

	# Iterate over fields.
	for field_name in sorted(schema_whitelist.keys()):
		field_spec = schema_whitelist[field_name]

		# Handle nested event subfields.
		if field_name == 'event':
			for subfield_name in sorted(field_spec.keys()):
				subfield_spec = field_spec[subfield_name]
				print(schema_name + '\t' + 'event_' + subfield_name)

		# Handle user agent nested field.
		elif field_name == 'userAgent':
			print(schema_name + '\tuserAgent')

		# Other capsule fields
		else:
			print(schema_name + '\t' + field_name)

The results match. If you try this, make sure you sort both input TSV whitelist and translated-untranslated TSV whitelist before you diff them (the scripts output the whitelist in alphabetical order, but the current TSV whitelist is not completely ordered in puppet).

mforns added a comment.EditedMar 14 2018, 8:03 PM

The resulting new EL whitelist in YAML format, that supports partial purging of nested fields is:

mforns set the point value for this task to 5.Mar 14 2018, 8:03 PM
mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Apr 12 2018, 2:25 PM
mforns moved this task from Done to In Code Review on the Analytics-Kanban board.
mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Apr 17 2018, 6:28 PM
Nuria closed this task as Resolved.May 8 2018, 10:47 PM