Page MenuHomePhabricator

Investigate intriguing behaviors in search metrics for desktop events
Closed, ResolvedPublic4 Estimated Story Points

Description

The graph that displays desktop search events is suspicious. The number of Results page opened seems to double from time to time with no particular reason.

date (2016)# events
Nov 516K
Nov 631K
Nov 718K
Nov 838K
Nov 917K

This behavior seems to affect only Results page opened other metrics remain relatively stable.

Event Timeline

Extracting raw data from TestSearchsatisfaction2 shows:

select LEFT(timestamp, 8) ts, event_source, event_action, count(*), count(distinct(event_searchSessionId)) from TestSearchSatisfaction2_15922352 where timestamp between '20161105000000' and '20161110000000' and event_action in ('searchResultPage') group by LEFT(timestamp, 8), event_source, event_action order by ts, event_source, event_action;
tsevent_sourceevent_actioncount(*)count(distinct(event_searchSessionId))
20161105autocompletesearchResultPage14719514068
20161105fulltextsearchResultPage178175846
20161106autocompletesearchResultPage168318 (+14%)16289
20161106fulltextsearchResultPage32591 (+82%)6775
20161107autocompletesearchResultPage205904 (+22%)19824
20161107fulltextsearchResultPage19786 (-39%)8573
20161108autocompletesearchResultPage200385 (-2%)19481
20161108fulltextsearchResultPage38847 (+27%)8526
20161109autocompletesearchResultPage186886 (-6%)18776
20161109fulltextsearchResultPage19173 (-50%)7955

This does not seem to be a bug in the dashboard but a problem in the data itself.
It sounds strange that the number of searchResultPage@fulltext events can double from one day to another while the number of searchResultPage@autocomplete remains stable.
The number of unique search sessions remains relatively stable as well.
Possible cause could be:

  • a bot?
  • duplicated data for searchResultPage@fulltext?

I'm leaning toward bot (with very low confidence), if it was a bug somewhere that generated duplicated events we would maybe see strange numbers elsewhere?

@mpopov I know that you're working on bot detection and I wonder if it'd be interesting to test your system on this data: TestSearchSatisfaction2_15922352 between 20161105 and 20161110 for searchResultPage@fulltext events.

Grouping by wikis it seems that the problem affects 2 wikis for 2 different days:

Number of events for enwiki:

tsevent_sourceevent_actioncount(*)count(distinct(event_searchSessionId))
20161105fulltextsearchResultPage89882091
20161106fulltextsearchResultPage224422361
20161107fulltextsearchResultPage69693134
20161108fulltextsearchResultPage84643069
20161109fulltextsearchResultPage70092830

Number of events for frwiki:

tsevent_sourceevent_actioncount(*)count(distinct(event_searchSessionId))
20161105fulltextsearchResultPage631282
20161106fulltextsearchResultPage752344
20161107fulltextsearchResultPage1062351
20161108fulltextsearchResultPage18152360
20161109fulltextsearchResultPage638322
debt triaged this task as Medium priority.Nov 15 2016, 9:11 PM
debt moved this task from Needs triage to Up Next on the Discovery-Analysis board.
mpopov set the point value for this task to 4.Nov 16 2016, 11:21 PM
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Checking for duplication of events:

SELECT
  DATE(LEFT(`timestamp`, 8)) AS `date`,
  event_source AS source,
  event_searchSessionId AS session_id,
  event_uniqueId AS event_id,
  COUNT(*) AS events
FROM `TestSearchSatisfaction2_15922352`
WHERE `timestamp` BETWEEN '20161105000000' AND '20161110000000'
  AND event_subTest IS NULL
  AND event_action = 'searchResultPage'
GROUP BY `date`, source, session_id, event_id
ORDER BY events DESC, `date`, source, session_id, event_id
LIMIT 20;
datesourcesession_idevent_idevents
2016-11-05autocomplete2b731e6d3d179e94iv53u8du3b3f950172d118eciv54n4p19
2016-11-05autocomplete2b731e6d3d179e94iv53u8dub4c7e8b5e7eccee7iv54dcxq9
2016-11-05autocomplete2b731e6d3d179e94iv53u8du8f4197b34c4fd9cciv54o8878
2016-11-05autocomplete2b731e6d3d179e94iv53u8du17ff6efbceb67299iv54r1gk5
2016-11-05autocomplete2b731e6d3d179e94iv53u8du8387530641777676iv544sim5
2016-11-05fulltext085f3100dbec7f6fiv5hy67g52df631925a2ea8civ5izkpk5
2016-11-08autocomplete7aaa4588433def4biv9f4tqs1a1cd9405f6685fciv9fy3pd5
2016-11-08autocomplete7aaa4588433def4biv9f4tqs7f36d5e42ef8cf61iv9ff1bo5
2016-11-08autocomplete7aaa4588433def4biv9f4tqsb2b771cb785562eeiv9gi4qa5
2016-11-08autocomplete7aaa4588433def4biv9f4tqsf54b7e660b8b1c71iv9feqxo5
2016-11-05autocomplete0b30b5135c0a8fc9iv5hfa248cce82eb010c376aiv5hl8ws4
2016-11-05autocomplete2c86b1f435320ff6iv54twn00923399cd72ba099iv54tzm64
2016-11-06autocomplete3761b84037f056e7iv6ozi85caa1ec2ee72e337eiv6pvjpo4
2016-11-06fulltextb3a992342cd6d7cdiv6qo5w1b98785523577db53iv6qowu64
2016-11-07autocomplete7ab9ad7d713549f1iv7hh0vbd1b4f83564bbfe4civ7hs0y44
2016-11-08autocomplete7aaa4588433def4biv9f4tqs4b2c59248ff00470iv9gikqq4
2016-11-08autocomplete7aaa4588433def4biv9f4tqs926eb6ff368a1d3div9fl04u4
2016-11-08autocomplete7aaa4588433def4biv9f4tqs94302876c90725f7iv9ffbfs4
2016-11-08autocomplete7aaa4588433def4biv9f4tqsc65e80d1239b273fiv9gs8ah4
2016-11-08autocomplete7aaa4588433def4biv9f4tqsd07a9762b764fb4div9gmoho4
SELECT
  `date`, source,
  SUM(IF(events > 1, 1, 0)) AS `SERP events duplicated at least once`,
  COUNT(DISTINCT(session_id)) AS `Unique session IDs`,
  SUM(events) AS `Total number of SERP events`,
  COUNT(DISTINCT(event_id)) AS `Number of unique SERP events`
FROM (
  SELECT
    DATE(LEFT(`timestamp`, 8)) AS `date`,
    event_source AS source,
    event_searchSessionId AS session_id,
    event_uniqueId AS event_id,
    COUNT(*) AS events
  FROM `TestSearchSatisfaction2_15922352`
  WHERE `timestamp` BETWEEN '20161105000000' AND '20161110000000'
    AND event_subTest IS NULL
    AND event_action = 'searchResultPage'
  GROUP BY `date`, source, session_id, event_id
) AS refined_events
GROUP BY `date`, source
ORDER BY source ASC, `date` ASC
LIMIT 1000;
datesourceSERP events duplicated at least onceUnique session IDsTotal number of SERP eventsNumber of unique SERP events
2016-11-05autocomplete24413518143051142768
2016-11-06autocomplete25815730163837163569
2016-11-07autocomplete48219172200489199996
2016-11-08autocomplete47718783194523194003
2016-11-09autocomplete35218090181082180727
2016-11-05fulltext3553361681216769
2016-11-06fulltext4662353149831448
2016-11-07fulltext4779111840918359
2016-11-08fulltext3178663747637443
2016-11-09fulltext3373411780217767

Broken down by enwiki/frwiki/other:

SELECT
  wiki_v2, source, `date`,
  SUM(IF(events > 1, 1, 0)) AS `SERP events duplicated at least once`,
  COUNT(DISTINCT(session_id)) AS `Unique session IDs`,
  SUM(events) AS `Total number of SERP events`,
  COUNT(DISTINCT(event_id)) AS `Number of unique SERP events`
FROM (
  SELECT
    DATE(LEFT(`timestamp`, 8)) AS `date`,
    CASE WHEN wiki IN('enwiki', 'frwiki') THEN wiki ELSE 'other' END AS wiki_v2,
    event_source AS source,
    event_searchSessionId AS session_id,
    event_uniqueId AS event_id,
    COUNT(*) AS events
  FROM `TestSearchSatisfaction2_15922352`
  WHERE `timestamp` BETWEEN '20161105000000' AND '20161110000000'
    AND event_subTest IS NULL
    AND event_action = 'searchResultPage'
  GROUP BY `date`, wiki_v2, source, session_id, event_id
) AS refined_events
GROUP BY `date`, source, wiki_v2
ORDER BY wiki_v2 ASC, source ASC, `date` ASC
LIMIT 1000;
wiki_v2sourcedateSERP events duplicated at least onceUnique session IDsTotal number of SERP eventsNumber of unique SERP events
enwikiautocomplete2016-11-059164936860868514
enwikiautocomplete2016-11-0612073617546175339
enwikiautocomplete2016-11-0717092739635596182
enwikiautocomplete2016-11-0821188468980489555
enwikiautocomplete2016-11-0914683848483184684
enwikifulltext2016-11-0511209189888976
enwikifulltext2016-11-062923612244222410
enwikifulltext2016-11-0721313469696946
enwikifulltext2016-11-089306984648455
enwikifulltext2016-11-0914283070096994
frwikiautocomplete2016-11-051476194469431
frwikiautocomplete2016-11-06138381033810325
frwikiautocomplete2016-11-07986587348725
frwikiautocomplete2016-11-08198751026810249
frwikiautocomplete2016-11-091981388778858
frwikifulltext2016-11-055282631626
frwikifulltext2016-11-060344752752
frwikifulltext2016-11-07935110621053
frwikifulltext2016-11-0853601815218147
frwikifulltext2016-11-090322638638
otherautocomplete2016-11-0513962646499764823
otherautocomplete2016-11-0612575317803877905
otherautocomplete2016-11-0730390349540095089
otherautocomplete2016-11-0824790629445194199
otherautocomplete2016-11-0918788938737487185
otherfulltext2016-11-0519296371937167
otherfulltext2016-11-0617353083048286
otherfulltext2016-11-071744261037810360
otherfulltext2016-11-081744371086010841
otherfulltext2016-11-091941891015510135

Okay, so we have a problem of duplicated events but not to such a high degree that it explains this particular problem. Hm...

SELECT
  wiki, `date`,
  MAX(total_serps) AS `Largest number of (total) SERPs per user`,
  MAX(unique_serps) AS `Largest number of (unique) SERPs per user`
FROM (
  SELECT
    wiki, `date`, session_id,
    SUM(events) AS total_serps,
    COUNT(DISTINCT(event_id)) AS unique_serps
  FROM (
    SELECT
      DATE(LEFT(`timestamp`, 8)) AS `date`,
      wiki,
      event_searchSessionId AS session_id,
      event_uniqueId AS event_id,
      COUNT(*) AS events
    FROM `TestSearchSatisfaction2_15922352`
    WHERE `timestamp` BETWEEN '20161105000000' AND '20161110000000'
      AND event_subTest IS NULL
      AND event_action = 'searchResultPage'
      AND wiki IN('enwiki', 'frwiki')
      AND event_source = 'fulltext'
    GROUP BY `date`, wiki, session_id, event_id
  ) AS refined_events
  GROUP BY `date`, wiki, session_id
) AS per_user
GROUP BY wiki, `date`
ORDER BY wiki ASC, `date` ASC
LIMIT 100;
wikidateLargest number of (total) SERPs per userLargest number of (unique) SERPs per user
enwiki2016-11-0534823477
enwiki2016-11-061737117356
enwiki2016-11-07139139
enwiki2016-11-08875875
enwiki2016-11-09195195
frwiki2016-11-054238
frwiki2016-11-062626
frwiki2016-11-07287285
frwiki2016-11-081699316989
frwiki2016-11-092020
SELECT
  wiki, `date`, session_id, user_agent,
  MAX(search_token) AS a_search_token,
  SUM(events) AS total_serps,
  COUNT(DISTINCT(event_id)) AS unique_serps
FROM (
  SELECT
    DATE(LEFT(`timestamp`, 8)) AS `date`,
    wiki,
    event_searchSessionId AS session_id,
    event_uniqueId AS event_id,
    userAgent AS user_agent,
    event_searchToken AS search_token,
    COUNT(*) AS events
  FROM `TestSearchSatisfaction2_15922352`
  WHERE `timestamp` BETWEEN '20161105000000' AND '20161110000000'
    AND event_subTest IS NULL
    AND event_action = 'searchResultPage'
    AND wiki IN('enwiki', 'frwiki')
    AND event_source = 'fulltext'
  GROUP BY `date`, wiki, session_id, event_id
) AS refined_events
GROUP BY `date`, wiki, session_id, user_agent
HAVING total_serps > 10000
ORDER BY total_serps, `date`, wiki
LIMIT 10;
wikidatesession_iduser_agenta_search_tokentotal_serpsunique_serps
frwiki2016-11-086c5e623e146472d1iv8pzs0dMozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; WOW64; Trident/7.0)zzfdexh3lg0qfsddn5ghj7zy1699316989
enwiki2016-11-06cf6e9903e4e61131iv5r7olhMozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; WOW64; Trident/7.0)zysoeo38hflh4bujconb7p2w1737117356

Let's check for IP addresses in search logs in Hive:

SELECT *
FROM CirrusSearchRequestSet
WHERE
  year = 2016 AND month = 11 AND day IN(6, 8)
  AND id IN('zzfdexh3lg0qfsddn5ghj7zy', 'zysoeo38hflh4bujconb7p2w');

@dcausse: Searches were by the same IP, belonging to Harvard University. My hypothesis is that the numbers we're seeing here aren't so unusual for them, but on those days they just happened to be randomly selected into TSS2 logging. Let's check that hypothesis by counting searches made by that particular IP address (not shown here for privacy reasons):

SELECT
  CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
  COUNT(DISTINCT(wikiid)) AS wikis_searched,
  COUNT(1) AS searches
FROM CirrusSearchRequestSet
WHERE
  year = 2016 AND month = 11 AND day BETWEEN 1 AND 14
  AND source = 'web'
  AND ip = '${HARVARDUNIV}'
  AND requests[size(requests)-1].querytype = 'full_text'
GROUP BY CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
ORDER BY date
LIMIT 100;

Weird. These are the results for Nov 1st-14th:

datewikis_searchedsearches
2016-11-05112920
2016-11-06117350
2016-11-07216079
2016-11-08216991

Not sure what happened there, but at least we have our answer :)

@mpopov Thanks! so this is likely an automata scrapping SERPs directly from web with javascript enabled?
Is it something we have some chances to detect in the future?
Sadly it has some non negligible impact on our dashboards.

@mpopov Thanks! so this is likely an automata scrapping SERPs directly from web with javascript enabled?

I can see that being the case, yes.

Is it something we have some chances to detect in the future?

Sort of. We have T150370+T150901 and T149440 in the works for zero results rate, but not for engagement stats.

Sadly it has some non negligible impact on our dashboards.

:(