Design and agree on an Avro schema for cirrus search request logging to hadoop
Closed, ResolvedPublic2 Estimated Story Points
Actions

Description

Some wiki page should probably define exactly what this is and what the data means. This is the first draft of the avro schema, it is a direct transliteration of the existing CirrusSearchRequests log:

'CirrusSearchRequests' => array(
        'type' => 'record',
        'name' => 'CirrusSearchRequests',
        'namespace' => 'org.wikimedia.search',
        'fields' => array(
                array( 'name' => 'query',         'type' => 'string' ),
                array( 'name' => 'queryType',     'type' => 'string' ),
                array( 'name' => 'numBatch',      'type' => array( 'int', 'null' ) ),
                array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                array( 'name' => 'source',        'type' => 'string' ),
                array( 'name' => 'executor',      'type' => 'int' ),
                array( 'name' => 'identity',      'type' => 'string' ),
                array( 'name' => 'index',         'type' => 'string' ),
                array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                array( 'name' => 'suggestion',    'type' => 'string' ),
        ),
),

Related Objects
Search...

Status	Assigned	Task
Declined	None	T112846 Display automata and humans separately on zero results rate graph
Resolved	EBernhardson	T103505 Create analytics-centric Cirrus logs and have them import into HDFS
Resolved	mpopov	T110590 Add breakdown of zero results rate by language/project pair to dashboard
Resolved	Ironholds	T112295 Design and agree on an Avro schema for cirrus search request logging to hadoop

Event Timeline

EBernhardson created this task.Sep 11 2015, 4:59 PM

EBernhardson raised the priority of this task from to Needs Triage.

EBernhardson updated the task description. (Show Details)

EBernhardson added projects: Discovery-Search (Current work), Discovery-Analysis (Current work), CirrusSearch.

EBernhardson added subscribers: EBernhardson, mpopov, Ironholds.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptSep 11 2015, 4:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ksmith added a project: OKR-Work.Sep 11 2015, 9:53 PM

• ksmith set Security to None.

mpopov added a parent task: T110590: Add breakdown of zero results rate by language/project pair to dashboard.Sep 11 2015, 11:12 PM

@Ironholds @mpopov Erik would like your input on this.

Hey, taking a look now.

I thought we'd been pretty clear (and pretty regularly clear) that the existing schema is nowhere near what we need for a data analysis point of view or a "tracking down user stupid" point of view and we'd like something pretty similar to the A/B test output. What I'd really like is the A/B testing framework with the query-related information stored in a map (or a series of maps). It will drastically reduce the complexity of dashboarding in a way that translating the current logs won't and the fact that we've been using the A/B logs to track down user sillies is...a good sign it's good for that.

In T112295#1639006, @Ironholds wrote:

I thought we'd been pretty clear (and pretty regularly clear) that the existing schema is nowhere near what we need for a data analysis point of view or a "tracking down user stupid" point of view and we'd like something pretty similar to the A/B test output.

Indeed. Erik explicitly mentioned that he doubted what you want is similar to the existing schema. It might be worth you two pairing up over hangouts to hash out the ideal schema rather than doing back-and-forth over Phab.

Coming into this now. Not clear what exactly Oliver means by "A/B testing framework" unless if he means the fields we're collecting now.

@Ironholds can you please include me in that future meeting with Erik? Thanks.

Ironholds claimed this task.Sep 15 2015, 11:16 PM

Ironholds edited a custom field.

Ironholds moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

A quick transliteration (untested) of the data we collect in backend AB testing to avro would be:

'CirrusSearchUserTesting' => array(
        'type' => 'record',
        'name' => 'CirrusSearchTesting',
        'namespace' => 'org.wikimedia.search',
        'fields' => array(
                array( 'name' => 'wikiId',              'type' => 'string' ),
                array( 'name' => 'testsIncludedIn',     'type' => array( 'type' => 'array', 'items' => 'string' ),
                array( 'name' => 'queries',             'type' => array( 'type' => 'array', 'items' => 'string' ),
                array( 'name' => 'hits',                'type' => 'int' ),
                array( 'name' => 'executionContext',    'type' => array( 'type' => 'enum', 'name' => 'executionContext', 'symbols' => array( 'web', 'api', 'cli' ) ),
                array( 'name' => 'elasticTookMs',       'type' => 'int' ),
                array( 'name' => 'ip',                  'type' => 'string' ),
                array( 'name' => 'userAgent',           'type' => 'string' ),
                array( 'name' => 'parameters',          'type' => array( 'type' => 'map', 'values' => 'string' ) )
        ),
),

I think this loses much of the information we are currently collecting, especially considering this represents multiple queries. If the goal is, as we have discussed, to batch together all requests made during a single execution (i can't say web request, because its not all web) we could use a nested record approach (also untested, and only quickly shuffled the existing arguments into two groups). The Avro support for hive purports to support arbitrarily nested schemas:

'CirrusSearchSomething' => array(
    'type' => 'record',
    'name' => 'CirrusSearchSomething',
    'namespace' => 'org.wikimedia.search',
    'fields' => array(
        array( 'name' => 'wikiId', 'type' => 'string' ),
        array( 'name' => 'source',        'type' => 'string' ),
        array( 'name' => 'executor',      'type' => 'int' ),
        array( 'name' => 'identity',      'type' => 'string' ),
        array( 'name' => 'requests', 'type' => array(
            'type' => 'array',
            'items' => array(
                'name' => 'request',
                'type' => 'record',
                'fields' => array(
                    array( 'name' => 'query',         'type' => 'string' ),
                    array( 'name' => 'queryType',     'type' => 'string' ),
                    array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'index',         'type' => 'string' ),
                    array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                    array( 'name' => 'suggestion',    'type' => 'string' ),
                ),
            ),
        ),
    ),

We should probably loop in @dcausse as well because we have found use for some of these data points when debugging. perhaps we could also log some information useful for evaluating scoring methods?

EBernhardson added a subscriber: dcausse.Sep 16 2015, 5:34 AM

If it is not too late I'd like to add :

the number of results (limit) and offset requested by the client, in some cases it can be a good signature
an array of strings query_features we could populate with things like (special_syntax_incategory, special_syntax_insource, special_syntax_insource_regex, common_term_query, wildcard, phrase_query...)
number of words/clauses as we will certainly extract this information very soon (if not already done).

Sorry I hadn't thought about that the first time Oliver asked for feedback concerning this schema but as we are implementing more features I think this would be really useful to us.
It's hard also to predict what we'll need in the future, is it something that is extremely hard to extend?
If it's hard and everything should be added now I'd say we could add something like :

nested array of profiles used (suggestion, lang, maybe common term will have some profiles). It could help to encode various param we use into profile names without changing the schema.

Concerning the nested schema if it's possible it'd be awesome, we'd able to add a new value like 'fallback_method' (rewritten_from_suggestion, rewritten_from_langdetect).

For the scoring methods I don't know yet how it could help at query level, I'd say that it will be certainly useful to add an array like query_features for rescore_features we could populate with all the rescore methods we use today (incoming_links, boostTemplates on some wiki, phrase_rescore ...).

With everything I said it would look like :

'CirrusSearchSomething' => array(
    'type' => 'record',
    'name' => 'CirrusSearchSomething',
    'namespace' => 'org.wikimedia.search',
    'fields' => array(
        array( 'name' => 'wikiId', 'type' => 'string' ),
        array( 'name' => 'source',        'type' => 'string' ),
        array( 'name' => 'executor',      'type' => 'int' ),
        array( 'name' => 'identity',      'type' => 'string' ),
        array( 'name' => 'requests', 'type' => array(
            'type' => 'array',
            'items' => array(
                'name' => 'request',
                'type' => 'record',
                'fields' => array(
                    array( 'name' => 'query',         'type' => 'string' ),
                    array( 'name' => 'queryType',     'type' => 'string' ),
                    array( 'name' => 'queryFeatures',     'type' => array( 'array', items => 'string' ) ), 
                    array( 'name' => 'queryWords',     'type' => 'int' ), 
                    array( 'name' => 'rescoreFeatures',     'type' => array( 'array', items => 'string' ) ), 
                    array( 'name' => 'limit',     'type' => 'int' ),
                    array( 'name' => 'offset',     'type' => 'int' ),
                    array( 'name' => 'profiles',     array( 'type' => 'array' , items => array(
                             'name' => 'profile',
                             'type' => 'record',
                             'fields' => array(
                                   array('name' => 'type', 'type' => 'string') // The profile type (did_you_mean_suggest, completion_suggester, common_term)
                                   array('name' => 'name', 'type' => 'string') // The profile name (default, strict ... )
                             )
                    ),
                    array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'index',         'type' => 'string' ),
                    array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                    array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                    array( 'name' => 'suggestion',    'type' => 'string' ),
                ),
            ),
        ),
    ),

sorry :)

avro has schema evolution support[1], but it is limited (and we will probably to need adjust the final schema to take these limits into account).Schema evolution happens on deserialization. Avro requires to have both a 'writer schema' which is the version of the schema the data was written with, and a 'reader schema' which is the version of the schema you are expecting at the application level. The rules (copy/paste) are:

These are the modifications you can safely perform to your schema without any concerns:

A field with a default value is added.

A field that was previously defined with a default value is removed.

A field's doc attribute is changed, added or removed.

A field's order attribute is changed, added or removed.

A field's default value is added, or changed.

Field or type aliases are added, or removed.

A non-union type may be changed to a union that contains only the original type, or vice-versa.

It would probably be good to actually test this with the hive integration (which transforms the avro schema into a hive table/types).

[1] http://docs.oracle.com/cd/E26161_02/html/GettingStartedGuide/schemaevolution.html

Erik and David's schemas both look good. I'd like to trend towards more information, sure.

More pressing than what the Avro itself looks like (I'll be honest, I don't really care, I care what data it has), what does this look like in Hive? How is an array, or an array of arrays, represented?

What the avro looks like is important, because it restricts what exactly we can do. As your brought up how that transforms from avro into hive also restricts what we can do.

Do we have access to create tables in prod hive to try? Is there a hive setup (vagrant? labs?) where we can try this outside of production?

To answer questions about how we query this data i'm pretty sure the only viable option is to TIAS (try it and see).

Good question. @Ottomata ?

Yes, you can create whatever tables you want in prod, just do so in a database under your username, if you can. You should be able to create Hive databases too:

create database ebernhardson;
use ebernhardson;
create table ...

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Sep 16 2015, 8:20 PM

I've created a table named test in the ebernhardson database with the schema david posted above.

The table was create with:

CREATE TABLE test
   ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
   STORED AS INPUTFORMAT
   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
   OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
   TBLPROPERTIES (
      'avro.schema.literal'='{"type":"record","name":"CirrusSearchSomething","namespace":"org.wikimedia.search","fields":[{"name":"wikiId","type":"string"},{"name":"source","type":"string"},{"name":"executor","type":"int"},{"name":"identity","type":"string"},{"name":"requests","type":{"type":"array","items":{"name":"request","type":"record","fields":[{"name":"query","type":"string"},{"name":"queryType","type":"string"},{"name":"queryFeatures","type":{"type":"array","items":"string"}},{"name":"queryWords","type":"int"},{"name":"rescoreFeatures","type":{"type":"array","items":"string"}},{"name":"limit","type":"int"},{"name":"offset","type":"int"},{"name":"profiles","type":{"type":"array","items":{"name":"profile","type":"record","fields":[{"name":"type","type":"string"},{"name":"name","type":"string"}]}}},{"name":"tookMs","type":["int","null"]},{"name":"index","type":"string"},{"name":"elasticTookMs","type":["int","null"]},{"name":"hitsTotal","type":["int","null"]},{"name":"hitsReturned","type":["int","null"]},{"name":"hitsOffset","type":["int","null"]},{"name":"namespaces","type":{"type":"array","items":"int"}},{"name":"suggestion","type":"string"}]}}}]}'
   );

I used the php avro serializer to generate 10 rows (with the same data, doesn't matter). Copied the file to stat1002 and loaded it into the hive table with:

LOAD DATA LOCAL INPATH '/home/ebernhardson/sample.avro' INTO TABLE test

My quick summary is it looks doable, some parts are annoying.

As you might expect you can query top level attributes normally

select wikiId, count(wikiId) from test group by wikiId

If you don't want to aggregate the nested values and just want it to return an array of the nested values, you can easily pluck individual fields from the nested requests:

SELECT wikiId, requests.hitsTotal
FROM test

Which might return something like

wikiid  hitstotal
enwiki  [1492,0]

In theory you could sum values of the nested requests as follows. Only caveat is it looks like a custom array_sum UDF will have to be written (pretty straight forward i imagine, i mean its a sum of integers) that accepts an array of integers and returns the sum.

SELECT wikiId, array_sum(requests.hitsTotal) AS aggHitsTotal
FROM test
HAVING aggHitsTotal = 0;

We can also break out all the nested records into their own row. This would be, in a way, querying the new data as if it was in the old per-es-req style. This can be done by exploding the array with a lateral view.

SELECT wikiId
FROM test
LATERAL VIEW explode(requests) exploded_table AS es_requests 
WHERE es_requests.queryType = 'full_text';

From the docs:

A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias

I still need to fiddle around with how schema upgrades work with hive+avro, but this seems acceptable to me. @Ironholds?

Gotcha. I suspect the LATERAL VIEW approach is going to be rare - so it's effectively an array, then? or, an array of arrays? We can work with that, I think. Writing UDFs for it might be thorny but I can talk to Nuria.

If we use this format we'd probably want to use UDFs for pretty much everything, which is totally fine.

So yeah, this approach LGTM - combining all the existing data with the metadata we need for tracking down issues (IP, UA).

• Deskana mentioned this in T103505: Create analytics-centric Cirrus logs and have them import into HDFS.Sep 17 2015, 4:38 PM

Spent some time playing with the schema evolution. For some things it works plenty well. Add and remove columns (following the rules about defaults) to various top level and nested records works fine. Changing the types of anything is unsupported as expected. Also found out that hive support for unions, beyond nullable types, is incredibly minimal. Basically we can select union fields but can not filter or do calculations on them.

But presumably we can treat the "requests" field as an array of arrays in Java, da? This is why I want to rely more on UDFs; I want to create a class of UDF abstractions that mean people basically never have to touch that array directly.

Ironholds added a parent task: T103505: Create analytics-centric Cirrus logs and have them import into HDFS.Sep 17 2015, 8:26 PM

So it looks like (absent David's modifications) we'd be looking at:

'CirrusSearchSomething' => array(

'type' => 'record',
'name' => 'CirrusSearchSomething',
'namespace' => 'org.wikimedia.search',
'fields' => array(
    array( 'name' => 'wikiId', 'type' => 'string' ),
    array( 'name' => 'source',        'type' => 'string' ),
    array( 'name' => 'executor',      'type' => 'int' ),
    array( 'name' => 'identity',      'type' => 'string' ),
    array( 'name' => 'ip',                  'type' => 'string' ),
    array( 'name' => 'userAgent',           'type' => 'string' ),
    array( 'name' => 'requests', 'type' => array(
        'type' => 'array',
        'items' => array(
            'name' => 'request',
            'type' => 'record',
            'fields' => array(
                array( 'name' => 'query',         'type' => 'string' ),
                array( 'name' => 'queryType',     'type' => 'string' ),
                array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                array( 'name' => 'index',         'type' => 'string' ),
                array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                array( 'name' => 'suggestion',    'type' => 'string' ),
            ),
        ),
    ),
),

Does that look okay to everyone? @mpopov ? @dcausse, what's the case for the additional information inclusion, and @EBernhardson, how much of a PITA would it be to include it post-hoc?

I think we'll add this extra info later if proven useful and not too painful.
Would you mind just adding "limit" (same as hitsReturned)?

Isn't that handled on a per-query basis? I left it in the "requests" array accordingly.

It's per client query. I'd like to know what's the size requested by the client. I've seen some weird numbers like 51 (reading hitsReturned only when hitsTotal > hitsReturned), with limit I'll be able to read it for all queries.

'CirrusSearchSomething' => array(

'type' => 'record',
'name' => 'CirrusSearchSomething',
'namespace' => 'org.wikimedia.search',
'fields' => array(
    array( 'name' => 'wikiId', 'type' => 'string' ),
    array( 'name' => 'source',        'type' => 'string' ),
    array( 'name' => 'executor',      'type' => 'int' ),
    array( 'name' => 'identity',      'type' => 'string' ),
    array( 'name' => 'ip',                  'type' => 'string' ),
    array( 'name' => 'userAgent',           'type' => 'string' ),
    array( 'name' => 'limit',      'type' => array( 'int', 'null' ) ),
    array( 'name' => 'requests', 'type' => array(
        'type' => 'array',
        'items' => array(
            'name' => 'request',
            'type' => 'record',
            'fields' => array(
                array( 'name' => 'query',         'type' => 'string' ),
                array( 'name' => 'queryType',     'type' => 'string' ),
                array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                array( 'name' => 'index',         'type' => 'string' ),
                array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                array( 'name' => 'suggestion',    'type' => 'string' ),
            ),
        ),
    ),
),

If it's per-query, we have hitsReturned already, no? Or am I missing something?

Oops I realized that I added it in the requests array in my first example yesterday.
Either way is OK for me.
Sorry for the confusion :)

hitsReturned will return only the number of hits we found, if I ask for 20 results per page and we found only 5 : hitsReturned will be 5 and I won't be able to know that the client requested 20 results.

Cool; no problem! Sounds like we have a final schema then? Or, a final initial schema? :P

Sounds good to me, thanks! :)

So as you said in irc, we plan to group all request that are part of the search conversation (prefix search then a full text search), limit should be in the requests array.
This is the last time I edit this ticket promess! Please excuse me! :)

'CirrusSearchSomething' => array(

'type' => 'record',
'name' => 'CirrusSearchSomething',
'namespace' => 'org.wikimedia.search',
'fields' => array(
    array( 'name' => 'wikiId', 'type' => 'string' ),
    array( 'name' => 'source',        'type' => 'string' ),
    array( 'name' => 'executor',      'type' => 'int' ),
    array( 'name' => 'identity',      'type' => 'string' ),
    array( 'name' => 'ip',                  'type' => 'string' ),
    array( 'name' => 'userAgent',           'type' => 'string' ),
    array( 'name' => 'requests', 'type' => array(
        'type' => 'array',
        'items' => array(
            'name' => 'request',
            'type' => 'record',
            'fields' => array(
                array( 'name' => 'query',         'type' => 'string' ),
                array( 'name' => 'queryType',     'type' => 'string' ),
                array( 'name' => 'tookMs',        'type' => array( 'int', 'null' ) ),
                array( 'name' => 'index',         'type' => 'string' ),
                array( 'name' => 'elasticTookMs', 'type' => array( 'int', 'null' ) ),
                array( 'name' => 'limit',         'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsTotal',     'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsReturned',  'type' => array( 'int', 'null' ) ),
                array( 'name' => 'hitsOffset',    'type' => array( 'int', 'null' ) ),
                array( 'name' => 'namespaces',    'type' => array( 'type' => 'array', 'items' => 'int' ) ),
                array( 'name' => 'suggestion',    'type' => 'string' ),
            ),
        ),
    ),
),

LGTM!

Looks good to me! Really excited to see this go live.

One more round of bikeshedding, CirrusSearchRequests as a name is already our existing logging methods, what do we call the new system? With no votes i'm probably going to call it Daryl.

cirrusrequests - it matches webrequests.

talked with Otto, we agreed to call the schema CirrusSearchRequestSet in the org.wikimedia.mediawiki.search namespace. This will be inserted to the mediawiki_CirrusSearchRequestSet kafka topic. I think that means the hive table will have the same name.

Hive table is gonna be up to you guys! You get to manage it! :)

OO, this will be interesting, as I haven't used Avro with Camus (Kafka -> HDFS) yet. Avro should have better support than JSON in Camus, but we will see...

EBernhardson moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Sep 22 2015, 5:04 PM

Ironholds moved this task from In progress to Done on the Discovery-Analysis (Current work) board.Sep 22 2015, 5:38 PM

• Deskana closed this task as Resolved.Sep 23 2015, 5:00 AM

• Deskana triaged this task as High priority.

• Deskana moved this task from Done to Resolved on the Discovery-Analysis (Current work) board.

couple minor adjustments to add defaults, make null always first in unions (the default value has to be the first type of the union). Removed the executor id since it was created to group the requests together and well, the requests are already grouped. I also added doc strings. We can change these at any time so its no big deal, but please check it out and clean up the doc where possible. These strings will be shown in hive when describe the table, among other things.\

Also more on the UDF front I see @Ironholds had a few questions that went unanswered:

But presumably we can treat the "requests" field as an array of arrays in Java, da? This is why I want to rely more on UDFs; I want to create a class of UDF abstractions that mean people basically never have to touch that array directly.

The "requests" value will be an array(actually java.util.List) of objects. This avro schema will run through a code generation step and spit out all the relevant java classes. Can also look at the results of code generation, but its not pretty: https://gist.github.com/ebernhardson/4b90f88779bdc51d894d

{
	"type": "record",
	"name": "CirrusSearchRequestSet",
	"namespace": "org.wikimedia.mediawiki.search",
	"doc": "A set of requests made by CirrusSearch to the elasticsearch user for a single php execution context",
	"fields": [
		{
			"name": "wikiId",
			"doc": "The wiki making this request, such as dewiki or enwiktionary",
			"type": "string"
		},
		{
			"name": "source",
			"doc": "Where the request is coming from. Typically: web, api or cli",
			"type": "string"
		},
		{
			"name": "identity",
			"doc": "A hash identifying the requestor. Includes the IP address and User Agent when available.",
			"type": "string"
		},
		{
			"name": "ip",
			"doc": "The IP address (either ipv4 or ipv6) in string notation",
			"type": [ "null", "string" ],
			"default": null
		},
		{
			"name": "userAgent",
			"doc": "The HTTP User-Agent header, or null if not-applicable",
			"type": [ "null", "string" ],
			"default": null
		},
		{
			"name": "backendUserTests",
			"doc": "List of backend tests the requests are participating in",
			"type": { "type": "array", "items": "string" }
		},                
		{
			"name": "requests",
			"doc": "A list of requests made between mediawiki and elasticsearch in a single execution context",
			"type": {
				"type": "array",
				"items": {
					"name": "CirrusSearchRequest",
					"namespace": "org.wikimedia.mediawiki.search",
					"doc": "An individual request made between mediawiki and elasticsearch",
					"type": "record",
					"fields": [
						{
							"name": "query",
							"doc": "The actual search request",
							"type": "string"
						},
						{
							"name": "queryType",
							"doc": "The general type of query performed, such as full_text, prefix, etc.",
							"type": "string"
						},
						{
							"name": "index",
							"doc": "The list of indices the request was performed against",
							"type": { "type ": "array", "items": "string" }
						},
						{
							"name": "tookMs",
							"doc": "The number of milliseconds between passing the query to the client library and getting the response back in the application",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "elasticTookMs",
							"doc": "The number of milliseconds the query took, according to the elasticsearch response",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "limit",
							"doc": "The maximum number of results requested by the application",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "hitsTotal",
							"doc": "The approximate total number of documents matching the query",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "hitsReturned",
							"doc": "The number of results returned to the application",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "hitsOffset",
							"doc": "The offset of the query",
							"type": [ "null", "int" ],
							"default": null
						},
						{
							"name": "namespaces",
							"doc": "Each element is a mediawiki namespace id that was searched.",
							"type": { "type": "array", "items": "int" }
						},
						{
							"name": "suggestion",
							"doc": "The suggestion generated by elasticsearch, or null if not requested",
							"type": [ "null", "string" ],
							"default": null
						}
					]
				}
			}
		}
	]
}

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Sep 24 2015, 4:07 AM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:07 AM

Design and agree on an Avro schema for cirrus search request logging to hadoopClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Design and agree on an Avro schema for cirrus search request logging to hadoop
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...