Page MenuHomePhabricator

Switch A/B test logs over to a more easily analysable format
Closed, ResolvedPublic

Description

One of the things to come out of the first A/B test in discovery was that the format for our logs is hard to analyse. The suggested fix is outlined at https://docs.google.com/document/d/1oJQb0EcMPlCBU7R1Gc3lOP9jeZwU12zneRWREhX-FZU/edit# but, TL;DR:

  1. Drop xff and group;
  2. Replace group with an x_analytics-like, key=value pair format so we can test multiple parameters at once;
  3. Remove JSON-style formatting and use tab-separation if possible, with query data (other than the specified fields) stored as JSON blobs /in/ a field;
  4. Sanitise the user_agent and query fields to remove tabs and quotes, if at all possible.

Event Timeline

Ironholds assigned this task to EBernhardson.
Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a subscriber: Ironholds.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2015, 8:42 PM

Whats the escaping mechanism for data that includes tabs?

its a common problem in parsing, when the input data contains your separation marker what do you do? It looks like you want to strip them from UA and encode them to json in the query string? I think that covers our bases for user generated data here, but is fragile.

To make sure we mean the same thing, is this basically the format you want?

place script below into ~/old-to-new.py on fluorine.eqiad.wmnet (or wherever)
usage is: head /path/to/CirrusSearchUserTesting | python old-to-new.py

import fileinput
import json

prefix = 'CirrusSearchUserTesting DEBUG: '
for line in fileinput.input():
  pos = line.find(prefix)
  if pos == -1:
    continue
  encoded = line[pos+len(prefix):-2]
  try:
    decoded = json.loads(encoded)
  except ValueError as e:
    # bugs make it possible for two lines to be written
    # out at same time and clobber each other
    continue
  print "\t".join([
    decoded['wiki'],
    'suggest-confidence-' + decoded['tests']['suggest-confidence'],
    json.dumps(decoded['queries']),
    str(decoded['hits']),
    decoded['source'],
    str(decoded['queries'][0]['elasticTook']),
    decoded['ip'],
    decoded['userAgent'].translate("\t\"'")
  ])

If you mean "what do we do when there are tabs", like I said, let's strip them from the query string and the user agent. Quote marks are more complicated but I can just teach the software at my end to not care about quotes, which should work fine ;p.

Will test now!

Deskana triaged this task as High priority.Aug 13 2015, 8:44 PM
Deskana added a subscriber: Deskana.

Good job @EBernhardson! Works great!

Change 232199 had a related patch set uploaded (by EBernhardson):
Switch A/B test logs over to a more easily analysable format

https://gerrit.wikimedia.org/r/232199

Change 232199 merged by jenkins-bot:
Switch A/B test logs over to a more easily analysable format

https://gerrit.wikimedia.org/r/232199

@EBernhardson There's a problem with running the script on more than head of the logs:

> cat ~/CirrusSearchUserTesting.log-20150808 | python old-to-new.py > ~/CirrusSearchUserTesting.log-20150808.tsv
Traceback (most recent call last):
  File "old-to-new.py", line 24, in <module>
    decoded['userAgent'].translate("\t\"'")
AttributeError: 'bool' object has no attribute 'translate'

> tail ~/CirrusSearchUserTesting.log-20150808 | python old-to-new.py > ~/CirrusSearchUserTesting.log-20150808.tsv
Traceback (most recent call last):
  File "old-to-new.py", line 24, in <module>
    decoded['userAgent'].translate("\t\"'")
AttributeError: 'bool' object has no attribute 'translate'
EBernhardson added a comment.EditedAug 19 2015, 8:31 PM

Here is an updated version that handles that boolean properly, and adds some extra error handling:

from __future__ import print_function
import fileinput
import json
import sys

blanked = 0
errors = 0
prefix = 'CirrusSearchUserTesting DEBUG: '
for line in fileinput.input():
  pos = line.find(prefix)
  if pos == -1:
    continue
  encoded = line[pos+len(prefix):-2]
  try:
    decoded = json.loads(encoded)
  except ValueError as e:
    # bugs make it possible for two lines to be written
    # out at same time and clobber each other
    continue
  try:
    if type(decoded['userAgent']) == str:
      userAgent = decoded['userAgent'].translate("\t\"'")
    else:
      userAgent = ''

    output = [
      decoded['wiki'],
      'suggest-confidence-' + decoded['tests']['suggest-confidence'],
      json.dumps(decoded['queries']),
      str(decoded['hits']),
      decoded['source'],
      str(decoded['queries'][0]['elasticTook']),
      decoded['ip'],
      userAgent
    ]
    for i in range(len(output)):
      if type(output[i]) != str:
        output[i] = ''
        blanked += 1

    print("\t".join(output))
  except:
    errors += 1

if errors > 0:
  print("WARNING: swallowed %d errors" % (errors), file=sys.stderr)
if blanked > 0:
  print("WARNING: blanked %d values" % (blanked), file=sys.stderr)
if errors == 0 and blanked == 0:
  print("SUCCESS", file=sys.stderr)
cat CirrusSearchUserTesting.log-20150808 | python old-to-new_v2.py > CirrusSearchUserTesting.log-20150808.tsv
WARNING: swallowed 119774 errors
WARNING: blanked 75238980 values

In fact, the last MANY rows of the resulting dataset just have:

		[{"query": "Best dancer in bhharat", "elasticTook": "", "index": "enwiki_content", "queryType": "prefix", "suggest": ""}]	0				
		[{"query": "No", "elasticTook": "", "index": "ptwiki_content", "queryType": "prefix", "suggest": ""}, {"query": "no", "elasticTook": "", "index": "ptwiki_content", "queryType": "full_text", "suggest": ""}]	689228				
		[{"query": "Virgen", "elasticTook": "", "index": "ptwiki_content", "queryType": "prefix", "suggest": ""}]	7				
		[{"query": "Ef\u00e9pa bom deu", "elasticTook": "", "index": "ptwiki_content", "queryType": "prefix", "suggest": ""}, {"query": "Ef\u00e9pa bom deu", "elasticTook": "", "index": "ptwiki_content", "queryType": "prefix", "suggest": ""}]	0

So something is so awry in one or more of the log entries that it messes up the whole thing, resulting in a dataset that is mostly useless.

Although if it works line-by-line then I don't understand how a malformed line can have such an impact on everything else after it :(

Good news: No more errors.

Bad news: WARNING: blanked 60670280 values :( still getting lots of blanks where there should be data

EBernhardson added a comment.EditedAug 21 2015, 4:42 AM

one more update, it was rejecting unicode strings as not strings. this also handles the changeover from the old style format to the new style format in the logs. Additionally it now reports malformed lines, whereas before it just skipped them.

from __future__ import print_function
import fileinput
import sys

try:
  import simplejson as json
except ImportError:
  import json

blanked = 0
errors = 0
malformed = 0
prefix = 'CirrusSearchUserTesting DEBUG: '
for line in fileinput.input():
  pos = line.find(prefix)
  if pos == -1:
    continue
  encoded = line[pos+len(prefix):-2]
  try:
    decoded = json.loads(encoded)
  except ValueError as e:
    # logs already in new format
    if encoded.count("\t") == 8:
      print(encoded)
    else:
      # bugs make it possible for two lines to be written
      # out at same time and clobber each other
      malformed += 1
    continue
  try:
    if isinstance(decoded['userAgent'], basestring):
      userAgent = decoded['userAgent'].translate("\t\"'")
    else:
      userAgent = ''

    output = [
      decoded['wiki'],
      'suggest-confidence-' + decoded['tests']['suggest-confidence'],
      json.dumps(decoded['queries']),
      str(decoded['hits']),
      decoded['source'],
      str(decoded['queries'][0]['elasticTook']),
      decoded['ip'],
      userAgent
    ]
    for i in range(len(output)):
      if not isinstance(output[i], basestring):
        output[i] = ''
        blanked += 1

    print("\t".join(output))
  except:
    errors += 1

if errors > 0:
  print("WARNING: swallowed %d errors" % (errors), file=sys.stderr)
if blanked > 0:
  print("WARNING: blanked %d values" % (blanked), file=sys.stderr)
if malformed > 0:
  print("NOTICE: %d malformed lines" % (malformed), file=sys.stderr)
if errors == 0 and blanked == 0:
  print("SUCCESS", file=sys.stderr)

Awesome job, Erik! Works great!

WARNING: swallowed 301 errors
NOTICE: 8 malformed lines

and the data looks good! Will let you know if I run into any more problems but I think we're good for now. Thank you!

Deskana closed this task as Resolved.Sep 9 2015, 2:30 AM