Convert CirrusSearchUserTesting log channel to new format
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	EBernhardson
	Aug 12 2015, 8:28 PM

Description

Motivation

Right now, the log data looks like:

YYYY-MM-DD HH:MM:SS <SERVER NUMBER> <PROJ> CirrusSearchUserTesting DEBUG: 
{
            "wiki": "enwiki",
            "tests": {
                "suggest-confidence": <"a" or "b">
            },
            "queries": [...],
            "hits": N,
            "source": "api",
            "ip": "<IP ADDRESS>",
            "xff": "<IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>",
            "userAgent": "<USER AGENT>"
        }

which is not conducive to data analysis in most statistical software (like R) where data must be tidy before it can be analyzed. Here is the definition of tidy data:

Observations are in rows
Variables are in columns
Contained in a single dataset

Since we control the format, we can redesign it to be conducive. To that end, we propose the following.

Proposed Change

Tabs in queries should be stripped out (and queries sanitized in general) and the values written out (with tab delimiters) with the structure:

project	experiment	queries	hits	source	time_taken	ip	user_agent	parameters
enwiki, mediawiki, meta, etc.	title like "suggestion confidence"	as JSON (for now)	n	api, etc.	elastic search time taken	pre-calculated actual IP	ua info	continuous values or group assignment. The format for listing parameters should be: `key1=value1; key2=value2; …; keyN=valueN`

The experiment column will enable us to filter out other experiments we may be running (in parallel or serially) that might make it into the logs, as well as provide a reference point for when the particulars of the experiment need to be looked up in documentation.

The parameters column will allow us to perform complex experiments where we have categorical variables (e.g. binary membership to 'a' or 'b') but also have continuous variables (e.g. smoothing parameter would be a random number from 0.5 to 3.0).

Event Timeline

EBernhardson created this task.Aug 12 2015, 8:28 PM

EBernhardson raised the priority of this task from to Needs Triage.

EBernhardson updated the task description. (Show Details)

EBernhardson subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2015, 8:28 PM

EBernhardson renamed this task from Convert CirrusSearchUserTesting channel to new format to Convert CirrusSearchUserTesting log channel to new format.Aug 12 2015, 8:29 PM

EBernhardson added a project: CirrusSearch.

EBernhardson set Security to None.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptAug 12 2015, 8:29 PM

mpopov updated the task description. (Show Details)Aug 13 2015, 8:49 PM

mpopov added a subscriber: Ironholds.

mpopov subscribed.Aug 14 2015, 11:05 PM

EBernhardson closed this task as a duplicate of T108869: Switch A/B test logs over to a more easily analysable format.Aug 17 2015, 5:02 PM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:08 AM

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 31 2015, 5:08 AM

Convert CirrusSearchUserTesting log channel to new formatClosed, DuplicatePublicActions

Description

Motivation

Proposed Change

Event Timeline

Convert CirrusSearchUserTesting log channel to new format
Closed, DuplicatePublic
Actions