Page MenuHomePhabricator

Load cirrussearch data into druid
Open, MediumPublic


We often have questions about how data breaks down in cirrussearch, for purposes of evaluating changes or making estimates for why something is the way it is. Druid should be great for these kinds of break downs, but we need to decide what to load. This ticket is to collect various questions we want to answer:

  • p95/p99 per-wiki for various query types (completion suggester, full text, regex, etc). A dimension on query length might also be useful.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJan 23 2017, 5:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Once we figure out what data we need I can workout the pipeline for getting the data in there. But first we should figure out what we want.

EBernhardson updated the task description. (Show Details)Jan 23 2017, 5:45 PM
EBernhardson added a subscriber: Amire80.
JAllemandou added a subscriber: JAllemandou.EditedJan 24 2017, 2:49 PM

Quick data volume checks:

  • How much data would this dataset represent(# lines, # Gb + file format + compression, # fields)
  • How much variability would be present in the various dimensions (# of distinct items per dimension)

Those info will help us make sure your need is fulfilable with the Druid infra we have :)
Thanks !

I'll start, for the dimensions I'd like to have:

  • query_type: single valued string (# of distinct values is around 10)
  • syntax_used: multivalued string (# of distinct values is the number of special syntax we support I'd say around 30 today but can increase in the future)
  • wiki: single valued (# of disctinct values is the # of wiki)
  • index: multivalued string (#distinct value is around 4* # of wikis: $wiki, $wiki_content, $wiki_general, $wiki_titlesuggest)
  • source (2: api/web)

For the fields :

  • backendTime
  • hitsReturned
  • hitsTotal

We probably need more...

Concerning size I think we need a line per request not per requestSet, I'll try to compute some numbers from hive.

I checked a week worth of our data, we are talking about 160 to 180 million lines per day currently. This will increase as we ship out the sister-search feature, to roughly 200M lines per day.

I would also include these fields (as numbered fields, probably a single byte is enough for the variance expected):

  • Number of terms in search query
  • Number of characters in search query

Thanks for the answers :)
Two things I forgot (which kinda have an importance) :

  • What smaller-time granularity are you willing to be able to query your events ?
  • And, how long do you want the data to be available (in a rolling fashion) ?

For time granularity, i think daily is probably sufficient for all or our use cases. If it makes a big difference in data size we could perhaps consider weekly, but I would have to think on that.

As to data availability, I'm also not sure. It is always useful to have historic data to look at changes. But generally when I am using data to make decisions about features, I'm looking at perhaps the last month of data, often just for getting results back in a timely fashion i only ask hive about a week of data.

More dimensions that came up during a meeting we were having earlier today:

  • UA based groupings. I don't know if we need as much granularity as the ones used for page view data, but reusing those UDF's is probably the least-friction way to go forward so i imagine we would have similar cardinality
debt triaged this task as Medium priority.Jan 26 2017, 11:18 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
Nuria moved this task from Incoming to Radar on the Analytics board.Jan 30 2017, 5:33 PM
debt moved this task from Up Next to later on... on the Discovery-Search board.Oct 19 2017, 5:22 PM
debt added a subscriber: debt.

Moving this to later, as we just don't have time for this right now and Analytics is still working on their portion of this type of work.

TJones added a subscriber: TJones.Jan 3 2018, 5:27 PM
TJones added a comment.Jan 5 2018, 5:52 PM

This ticket came up again in a discussion earlier in the week, and we decided that adding a few more use cases wouldn't hurt, even if we don't work on it for a while.

To add my 2¢ on time granularity, daily would be great, but weekly would be useful if it makes a big difference in data size, like Erik said.

I'd ideally like both a "plain" and "text" indexing of the queries themselves, though the stemmed version would depend on the language of the wiki (and possibly on the specific wiki), so that may be too much to ask. Otherwise, "text" could just be icu_normalized, I guess.

Along with the moon and a unicorn, I'd also like it if we could add some various tags, like many of the features in Zero to Hero, or TextCat language(s) detected, but those my be too expensive and not useful enough to precompute.

Would we want to—or could we—include user behavior or results quality metrics? I'm thinking rank of first click, number of clicks, rank of last click, whether the query was abandoned, etc.

Also, a search session ID would be kind of nice, because then you could group queries by session. Something that orders them properly within the session (time stamp or order ID) would be nice too.

(Did I mention that I wanted the moon and a unicorn, too? None of these is required for this to be useful, just icing on the cake.)

debt added a comment.Jan 5 2018, 9:35 PM

@TJones: 🌓 +🦄 == 🍰