Page MenuHomePhabricator

Extract a set of a few hundred most popular abandoned queries
Open, Stalled, MediumPublic

Description

The recent AB test for machine learned ranking was able to improve the average click position, but the clickthrough rate stayed relatively flat. This potentially suggests that while the machine learning is able to improve the results we already have, there may be an underlying recall issue preventing good results from being found. One way to investigate this will be to pull a list of the most popular queries that have high abandonment and look into them to see what is going on there.

Potentially this data can be extracted by looking at the frequency of distinct queries in the click logs vs frequency of those same distinct queries in the complete search logs. Doing some form of query normalization, like we do in mjolnir, may be helpful but perhaps too expensive to do on the full dataset of queries.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptSep 28 2017, 5:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson updated the task description. (Show Details)Sep 28 2017, 5:52 PM
EBernhardson updated the task description. (Show Details)
EBernhardson added a subscriber: mpopov.
debt triaged this task as Medium priority.Sep 28 2017, 6:02 PM
mpopov added a comment.EditedSep 28 2017, 6:36 PM

Using just the event logging data from 2017-08-01 to today (2017-09-28), here's a glimpse at queries from abandoned full-text searches:

SELECT
  search_query, dym_visible,
  COUNT(*) AS times_searched,
  SUM(had_clickthrough)/COUNT(*) AS clickthrough_rate
FROM (
  SELECT
    session_id,
    MAX(dym_visible) AS dym_visible,
    MAX(search_query) AS search_query,
    SUM(IF(event = 'click', 1, 0)) > 0 AS had_clickthrough
  FROM (
    SELECT DISTINCT
      event_searchSessionId AS session_id,
      TRIM(LOWER(CONVERT(event_query USING latin1))) AS search_query,
      event_hitsReturned AS hits_returned,
      event_action AS event,
      event_didYouMeanVisible AS dym_visible,
      event_uniqueID AS event_id
    FROM TestSearchSatisfaction2_16909631
    WHERE wiki = 'enwiki'
      AND event_subTest IS NULL
      AND event_source = 'fulltext'
      AND event_action IN('searchResultPage', 'click')
      AND LEFT(timestamp, 8) >= '20170801'
  ) AS deduped
  GROUP BY session_id
) AS searches
GROUP BY search_query, dym_visible
HAVING times_searched > 1 AND clickthrough_rate < 1 AND search_query != ''
ORDER BY clickthrough_rate, times_searched DESC, search_query;
search_querydym_visibletimes_searchedclickthrough_rate
vmwareno150.0%
smartbearyes100.0%
something from somethingno50.0%
foreign assassinationno20.0%
frivyes20.0%
ghosts of shepherdstownno20.0%
kim seon-ho (actor)yes20.0%
lady bloodfightno20.0%
list of displays by pixel densityno20.0%
list of spells in harry potterno20.0%
nicolette sheano20.0%
part of an urlno20.0%
pingdomyes20.0%
rich pianayes20.0%
badshahoyes250.0%
charles ross (actor)no250.0%
dotardno250.0%
iphone 8no250.0%
iphone 9no250.0%
it 2017no250.0%
it stephen kingno250.0%
life of the black tigerno250.0%
surviving escobarno250.0%
the night kingno250.0%
what carter lostno250.0%
it filmno475.0%

The next step would be to remove the times_searched > 1 condition and then join with the complete search log.

Edit: whoops, forgot to include DYM in the final output.
Update: added DYM in

EBernhardson added a comment.EditedSep 28 2017, 9:26 PM

I'd be tempted to try and source this data from the unsampled click logs that we have in hive at discovery.query_clicks_daily. This is already somewhat curious though, vmware returns the company first and then a bunch of related technologies. Doesn't seem all that bad. Smartbear does return somewhat poor results, with the title match down below the fold. I wonder if a feature on prefix matches could help that, but all the top 5 articles do seem to be about smartbear technologies.

TJones added a comment.Oct 2 2017, 3:45 PM

A few things that come to mind:

  • A nice large list would give us a better idea of the distribution of queries. Are there some really common things that people bail on, or is it all low frequency? One day isn't enough to tell, though it looks like the long tail is very long since @mpopov dropped the unique items, leaving a pretty short list.
  • Looking at the long tail could also be instructive (though PII is an issue, so posting here is probably not good). There may be patterns that indicate bots or bot-like sources. (i+am+still+thinking+of+that+one+weird+zrr+one Agentina)
  • Getting referers would be nice if they exist. Maybe there's a link or a form somewhere that it consistently directing people to Wikipedia, and then they just bounce. Or maybe not. But searching for some of these is kind of hard because they roll over to the article right away: vmware, iphone 8, iphone 9, and rich piana for example. Suspicious.
    • It may be more work than it is worth to figure out, but I wonder how often these same queries show up in a given session. I'm imagining an attempt to influence ranking by having bots searching for particular queries. "Hey, everyone is searching for X, so we should rank X higher in other results, too!" is not a terrible idea, and this could be an attempt to influence that.
  • Are we tracking sister search clicks? Does that count as abandonment? Should it?
  • It would be pretty subjective, but we could try to categorize results based on whether there's good info available in the snippets (e.g. for smartbear) or a good link in the sister search, or just crap results (and other categories, I'm sure, will emerge).
mpopov added a comment.Oct 2 2017, 7:19 PM

A few things that come to mind:

  • A nice large list would give us a better idea of the distribution of queries. Are there some really common things that people bail on, or is it all low frequency? One day isn't enough to tell, though it looks like the long tail is very long since @mpopov dropped the unique items, leaving a pretty short list.

Yeah, the unique items shorten the list considerably. Also this isn't 1 day. It's from almost 2 months of data. Needless to say, when randomly selecting users for tracking, we basically don't get query repeats.

  • Are we tracking sister search clicks? Does that count as abandonment? Should it?

Yeah, we're tracking those but I did not include those in the query (modifying the query to include "ss-click" would be easy). I think for the purposes of LTR that a user not clicking on a same-wiki result but instead clicking on another wiki's result should _not_ count, especially since the model could be deployed to other, non-Wikipedia projects.

  • It would be pretty subjective, but we could try to categorize results based on whether there's good info available in the snippets (e.g. for smartbear) or a good link in the sister search, or just crap results (and other categories, I'm sure, will emerge).

Oh shoot, looks like I forgot to add a HAVING hits_returned > 0 filter (after all, a user can't click on a result if there are none). Sorry!!! But yes, we can also look for presence of "iw":in event_extraParams for "searchResultPage" events to mark searches where there were sister project search results.

TJones added a comment.Oct 2 2017, 8:08 PM

Also this isn't 1 day. It's from almost 2 months of data. Needless to say, when randomly selecting users for tracking, we basically don't get query repeats.

D'oh—I read ..08.. to ..09.. as one day. So much for skimming being efficient, huh? So it's the sampling that's giving us such a low number of queries overall. If it's 1/200 (is that right?) over ~60 days, that's less than a third of day's worth.

More input! Input! Input! Input!

EBernhardson added a comment.EditedOct 3 2017, 5:41 AM

sampling is actually 1/2000 on enwiki. I'm working out a method to extract this from the hive tables we have, but it might be too expensive to calculate on the full webrequest history (60 days). Doing just raw fulltext clicks / fulltext searches is cheap enough, because we already extract clicks from webrequest data. It turns out though this ends up looking very funny because the go feature is making the click/search ratios very off, as many queries are those that should have triggered go. I reworked this to extract so called 'go clicks' where a search redirects a user to a page, but that requires reading the relatively raw webrequest data.

Maybe for future iterations of this kind of thing it would be useful to have an hourly oozie job that extracts these go clicks, i could see them being generally useful for looking into what is happening with search. We can't get these straight from search logs because some don't even trigger cirrussearch, as Title::newFromText($query) returns an appropriate page before even trying cirrussearch. For now i've started up a query that will populate ebernhardson.click_ratios in hive using the entirety of last months data. hopefully can get some plausible results out of it tomorrow.

EBernhardson added a comment.EditedOct 4 2017, 4:57 PM

First draft of abandoned queries from combining hive logs, (limited to WMF-NDA because i havn't verified there is no PII): P6075

Here queries is the number of full text searches + number of searches that used 'go' to end up directly at the article, clicks is number of full text clicks + number of 'go'. It's filtered to everything with clicks/searches < 0.15 and then ordered by number of searches descending

Same data but a slightly different view, this is all the data ordered by searches - clicks descending (same WMF-NDA limit): P6076

TJones added a comment.Oct 4 2017, 8:59 PM

First draft of abandoned queries from combining hive logs, (limited to WMF-NDA because i havn't verified there is no PII): P6075

I've reviewed the first 100 (lines 5-104). I suggest removing 79, 84, 89, and 90 as names not associated with people who have chosen to be in the public spotlight. 99 doesn't have a enwiki page, but has appeared on national TV programs as an expert. Personally, I'd change 72 to "[website URL]" or similar, since I don't like giving publicity to random websites.

I also recognized that media player with the queries "dvd1" film and "dvd2" film. They give results, but not good title matches, so they probably get ignored by the media player; and they are likely relatively common file names.

If someone else wants to review the top 100 and offer any others to remove, could we then share them without NDA?

Should I review the top of the list in P6076?

Another option would be to go through the list and replace many of the actual queries with categories ([person], [website], [url], etc.), so we can see the kinds of queries we're getting, if not the specific queries.


In other news, yesterday @mpopov and I were looking at another list of queries—most commonly asked questions—and investigated some that were too unusual to be repeated by chance—the question included a grammatical mistake and some extra spaces. The IP addresses were all very similar, and the queries all came from the same hour in the logs, so it looks like they could all have been done by one person. (There are a few scenarios where it seems plausible that one person could unintentionally reload the same search results page with slightly different IPs.) I discussed this with @EBernhardson
and @dcausse today, too.

I know it's not easy to get to IP and user agent info, but it could be interesting to see user agent info for abandoned queries—especially ones that are in the 5-10 range. I wonder if a particular OS or device is more likely to re-issue queries. I know (some?) iOS browsers will dump pages in other tabs to save memory, so changing tabs does a reload.

It may be too difficult to do automatically, but we need to keep in mind that when we have relatively small numbers over relatively short periods that the queries we see may not actually represent multiple users.

The data above was for a single day, but i can get a run going with a full month of data. I wanted to get a bit more data cleaning in there, will poke at your suggestions and see if i can find nicely automated ways to filter them out of the data.

debt added a subscriber: debt.Oct 10 2017, 11:28 PM

Thanks, @EBernhardson, the extra data will be useful and interesting! :)

This comment was removed by debt.
debt changed the task status from Open to Stalled.Dec 15 2017, 4:48 PM
debt added a project: Discovery-Analysis.
This comment was removed by debt.
Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:20 AM
MBinder_WMF moved this task from Triage to Backlog on the Product-Analytics board.May 3 2018, 8:26 PM