Page MenuHomePhabricator

Convert CirrusSearchUserTesting log channel to new format
Closed, DuplicatePublic

Description

Motivation

Right now, the log data looks like:

YYYY-MM-DD HH:MM:SS <SERVER NUMBER> <PROJ> CirrusSearchUserTesting DEBUG: 
{
            "wiki": "enwiki",
            "tests": {
                "suggest-confidence": <"a" or "b">
            },
            "queries": [...],
            "hits": N,
            "source": "api",
            "ip": "<IP ADDRESS>",
            "xff": "<IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>",
            "userAgent": "<USER AGENT>"
        }

which is not conducive to data analysis in most statistical software (like R) where data must be tidy before it can be analyzed. Here is the definition of tidy data:

  1. Observations are in rows
  2. Variables are in columns
  3. Contained in a single dataset

Since we control the format, we can redesign it to be conducive. To that end, we propose the following.

Proposed Change

Tabs in queries should be stripped out (and queries sanitized in general) and the values written out (with tab delimiters) with the structure:

projectexperimentquerieshitssourcetime_takenipuser_agentparameters
enwiki, mediawiki, meta, etc.title like "suggestion confidence"as JSON (for now)napi, etc.elastic search time takenpre-calculated actual IPua infocontinuous values or group assignment. The format for listing parameters should be: key1=value1; key2=value2; …; keyN=valueN

The experiment column will enable us to filter out other experiments we may be running (in parallel or serially) that might make it into the logs, as well as provide a reference point for when the particulars of the experiment need to be looked up in documentation.

The parameters column will allow us to perform complex experiments where we have categorical variables (e.g. binary membership to 'a' or 'b') but also have continuous variables (e.g. smoothing parameter would be a random number from 0.5 to 3.0).

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.
EBernhardson renamed this task from Convert CirrusSearchUserTesting channel to new format to Convert CirrusSearchUserTesting log channel to new format.Aug 12 2015, 8:29 PM
EBernhardson added a project: CirrusSearch.
EBernhardson set Security to None.