Page MenuHomePhabricator

Create list of current functioning and reliable search instrumentation
Closed, ResolvedPublic

Description

As a search researcher/analyst, I want to know what search signals are currently available and reliable, so I can know what research questions can be answered with them, and what may need to be fixed/built.

Specifically, it'd be good to know what we can already capture or not from the list in https://docs.google.com/document/d/1MMsnbMZSABo32oSFS0tYA1hia6Zc16l7o_n_cGFCLwQ/edit#heading=h.7j7t7j1bnz5t

AC:

  • list of functioning and reliable search instrumentation for Ricardo (and others) to know what we can currently measure in the search experience

Event Timeline

In terms of functioning and reliable, I would dare say not much if anything. The following exist but the backend debug logging is the only one i would expect to be accurate.

Unprocessed Event Logs:

  • frontend desktop web data collection in hadoop at event.searchsatisfaction. Retention of ~90 days. Schema.
    • The UI has changed multiple times since we last looked at this data. Unclear how accurate it is.
  • Backend debug logging of all communication between the backend and the search engine in hadoop at event.mediawiki_cirrussearch_request. Retention of ~90 days. Schema.
    • These are likely the most reliable source of information, but they are structured debug logs and require familiarity with the implementation to extract meaning.

Processed data:

  • Click logs for full text desktop web search. Found in hadoop at discovery.query_clicks_{hourly,daily}. Schema.
    • Generated by joining the backend debug logging against incoming web requests.
    • Hasn't been reviewed for correctness in several years, but still generates data that trains plausible ranking models.
    • Hourly data contains all searches, click through or no, but isn't sessionized.
    • Daily data is sessionized and filters sessions without any clickthroughs.
  • Mjolnir intermediate data. Full text desktop web only. None of this was intended for consumption, rather it's intermediate data used when building LTR models. Still it might have some useful information. Found in hadoop in the mjolnir database. Further refines the click logs with labels from clickmodeling, light clustering of similar queries, collection of prod feature vectors, etc.

@EBernhardson what about any other source we can get (any) reliable search data from at scale? For example, webrequest logs. I admit that the data in webrequest logs specifically won't be sufficient for the majority of the questions we are considerting, however, I expect it to be reliable, and we may need to pull reliable data from multiple sources and combine to get what we need (at least in the short term), so having the fuller list will be helpful.

cc: @cchen , in case you have some additional insight into what search instrumentation/metrics are currently available

+1 to event.mediawiki_cirrussearch_request being reliable

In addition to the Search Satisfaction instrumentation on desktop, both Wikipedia iOS (Schema) and Android apps (Schema) have their search instrumented. Unfortunately until T305228 is resolved that's our only view into how users search on mobile.

Search Satisfaction is a rich dataset – although its reliability is suspicious – and having explicit session identification there is useful for linking multiple searches (e.g. distinct queries but also query reformulations) together that the user performed in a single session.

Based on last week's check in meeting, I think Erik's comment is a sufficient answer and we can close this ticket