Page MenuHomePhabricator

Try combining cirrus search logs with engagement data
Closed, ResolvedPublic2 Story Points

Description

In an email thread, Erik mentioned that

There is the ability now though to join TestSearchSatisfaction2 against CirrusSearchRequestSet, although we've never done it. The id field in CirrusSearchRequestSet should match the searchToken field in TestSearchSatisfaction2.

Which, if this is possible now, opens the door for us to, say, deconstruct queries into features in Hive via DeconstructSearchQueryUDF and then combine that with clicks data from TestSearchSatisfaction2.

Now, this task isn't to answer a question like "do queries that use regex have a higher clickthrough rate than queries that don't", but just to see if this is something we can actually do if a future analysis requires us to do this.

Event Timeline

mpopov created this task.Sep 8 2016, 7:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2016, 7:44 PM

This may not actually require the CirrusSearchRequestSet data, the TestSearchSatisfaction2 schema already holds the query strings and is accessible with hive. The following is based off of https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive. This is also a bit messy, but could be cleaned up into something sane-ish. Also note this is fairly slow and not particularly suited to hive due to the data size. Somehow this query took 278s, although it has a single partition with only 205kB of compressed data (it probably calculated everything, deserializing the json 3x per row, but still).

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
USE ebernhardson;
CREATE EXTERNAL TABLE `TestSearchSatisfaction2` (
  `json_string` string
)
PARTITIONED BY (
  year int,
  month int,
  day int,
  hour int
)
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 
  '/wmf/data/raw/eventlogging/eventlogging_TestSearchSatisfaction2';

ALTER TABLE TestSearchSatisfaction2
ADD PARTITION (year=2016,month=9,day=8,hour=20)
LOCATION '/wmf/data/raw/eventlogging/eventlogging_TestSearchSatisfaction2/hourly/2016/09/08/20';

ADD JAR /mnt/hdfs/wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION deconstruct as 'org.wikimedia.analytics.refinery.hive.DeconstructSearchQueryUDF';
SELECT query, deconstruct(query)
  FROM ( SELECT get_json_object(json_string, '$.event.query') as query
           FROM TestSearchSatisfaction2
          WHERE year=2016 
            AND get_json_object(json_string, '$.event.action') = 'searchResultPage'
            AND get_json_object(json_string, '$.event.source') = 'fulltext'
          LIMIT 10
       ) x;

This results in valid answers, which i wont post here though because i would have to validate the various queries as safe for release.

mpopov claimed this task.

@EBernhardson: Neat! :P

This also works:

SELECT *
FROM log.TestSearchSatisfaction2_15700292
WHERE LEFT(timestamp, 10) = '2016091112'
  AND event_source = 'fulltext'
  AND event_query IS NOT NULL
  AND event_searchToken = 'ds93k5gct8e30rlnsuufee87e';
SELECT
  id, identity,
  requests[size(requests)-1].query AS query,
  requests[size(requests)-1].querytype AS query_type
FROM wmf_raw.CirrusSearchRequestSet
WHERE
  year = '2016'
  AND month = '9'
  AND day = '11'
  AND id = 'ds93k5gct8e30rlnsuufee87e';

Great job, Erik & David! :D This is very cool.

debt closed this task as Resolved.Sep 16 2016, 7:00 PM

Yay!