Motivation
Right now, the log data looks like:
YYYY-MM-DD HH:MM:SS <SERVER NUMBER> <PROJ> CirrusSearchUserTesting DEBUG: { "wiki": "enwiki", "tests": { "suggest-confidence": <"a" or "b"> }, "queries": [...], "hits": N, "source": "api", "ip": "<IP ADDRESS>", "xff": "<IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>, <IP ADDRESS>", "userAgent": "<USER AGENT>" }
which is not conducive to data analysis in most statistical software (like R) where data must be tidy before it can be analyzed. Here is the definition of tidy data:
- Observations are in rows
- Variables are in columns
- Contained in a single dataset
Since we control the format, we can redesign it to be conducive. To that end, we propose the following.
Proposed Change
Tabs in queries should be stripped out (and queries sanitized in general) and the values written out (with tab delimiters) with the structure:
project | experiment | queries | hits | source | time_taken | ip | user_agent | parameters |
enwiki, mediawiki, meta, etc. | title like "suggestion confidence" | as JSON (for now) | n | api, etc. | elastic search time taken | pre-calculated actual IP | ua info | continuous values or group assignment. The format for listing parameters should be: key1=value1; key2=value2; …; keyN=valueN |
The experiment column will enable us to filter out other experiments we may be running (in parallel or serially) that might make it into the logs, as well as provide a reference point for when the particulars of the experiment need to be looked up in documentation.
The parameters column will allow us to perform complex experiments where we have categorical variables (e.g. binary membership to 'a' or 'b') but also have continuous variables (e.g. smoothing parameter would be a random number from 0.5 to 3.0).