Page MenuHomePhabricator

Classify fulltext search abandonment: sampling
Closed, ResolvedPublic3 Estimated Story Points

Description

Acceptance criteria:

  • For the top 10 most frequently visited Wikipedias (by user pageviews) get 50 fulltext search session abandonments apiece (approach for sampling probably includes a mix of fulltext head queries and some form of random sampling)
  • As a first step after figuring out sampling routine, generate queries from the abandoned sessions
  • Target namespace 0 article searches

Probable deliverable: a structured file and its re-runnable Jupyter notebook

Event Timeline

dr0ptp4kt set the point value for this task to 3.Oct 1 2024, 3:44 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt updated the task description. (Show Details)

I put together a notebook for this, some notes:

General sampling notes:

  • Data collection is scoped to the week of Sept 25-Sept 31 (inclusive).
  • Data collection is scoped to sessions our metrics dashboard considers abandonded fulltext sessions.
  • Uses pageview_hourly data to decide the top n wikis to sample from. n is set to 10 per the task description
  • Selects a random sample of n sessions from one week of source data per selected wiki. n is set to 50 per the task description.
  • Each session reports the number of fulltext SERP's shown, the domain name/database name/language code, the session id, and a list of events
  • Reports click (all variants), visit, and SERP events for both fulltext and autocomplete for each sampled session. Presuming that including autocomplete events might give more insight into the users intent.
  • The output is currently simply printed to the notebook output. It would be reasonably trivial to print these to a file, or a file per wiki, but wasn't sure what output is preferred. Could ponder different ways to format the data, but for now this is a very simple pprint(row.asDict()) (asDict() is not recursive, which has a side-benefit of keeping events as one line each but requires a wide monitor to reasonably view):

Per-event notes:

  • Each event includes a sourcePageId. This is the pageId that the event was performed on. If this is None the event came from a special page, presumably Special:Search but there is no guarantee.
  • When an autocomplete click event has position = -1 that means they submited the query to Special:Search. They could potentially still be redirected to a specific page instead of seeing Special:Search results. position >= 0 means they selected an item from the provided results and will most go directly to the selected page.
  • While the per-session vents are sorted by time the ordering is not guaranteed to be the order events were generated. For example an autocomplete 'click' action followed by one or two autocomplete 'serp' actions a few dozen ms later is likely simply mis-ordered. The sourcePageId column can be useful for deciding the autocomplete serp events likely came before the autocomplete click event.

Unclear on the best method to share results, for now the notebook is found at stat1008.eqiad.wmnet:~ebernhardson/T3761610-fulltext-abandonment-sample.ipynb and can be viewed by copying to your own home dir and opening in jupyterlab. The notebook could be cleared of output data and uploaded to our notebooks repo, but it's hard to review there so will wait on that.

Example session follows.

{'domain_name': 'de.wikipedia.org',
 'events': [Row(dt='2024-09-27T20:08:57.289Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='h', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.290Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='he', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.294Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hell', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.295Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='helle', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.299Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellen', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.300Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='helleni', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.301Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic a', hitsReturned=10),
            Row(dt='2024-09-27T20:08:57.302Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air ', hitsReturned=1),
            Row(dt='2024-09-27T20:08:57.304Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air f', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.304Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air fo', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.305Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic ai', hitsReturned=5),
            Row(dt='2024-09-27T20:08:57.306Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air for', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.306Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air forc', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.306Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air forcw', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.308Z', action='searchResultPage', source='autocomplete', position=None, sourcePageId=12312976, inputLocation='header-navigation', query='hellenic air force', hitsReturned=0),
            Row(dt='2024-09-27T20:08:57.310Z', action='click', source='autocomplete', position=-1, sourcePageId=12312976, inputLocation=None, query=None, hitsReturned=None),
            Row(dt='2024-09-27T20:09:27.884Z', action='searchResultPage', source='fulltext', position=None, sourcePageId=None, inputLocation=None, query='hellenic air force', hitsReturned=51)],
 'language_code': 'de',
 'num_fulltext_serp': 1,
 'searchSessionId': 'b69bbd0e9cfcab8d5d0dm1l5oi27',
 'wiki': 'dewiki'}
EBernhardson added a subscriber: TJones.

@TJones assuming you will want to look at this and comment on the output formatting.

@EBernhardson, the output format is fine. I can definitely work with that. I may have some questions about what some of the details mean, but between your description and looking at the data, it makes sense.

In other news, you have a typo in your file path: T3761610 instead of T376161.. but I figured it out!

I'm also surprised at how out of sync the autocomplete partial queries are (though not in your example).

Here's an example with extraneous details removed:

Row(dt='9.266Z', query='perung', hitsReturned=10),
Row(dt='9.267Z', query='Perungalathur high', hitsReturned=0),
Row(dt='9.269Z', query='Perungalathur', hitsReturned=2),
Row(dt='9.269Z', query='Perungalathur highway', hitsReturned=0),
Row(dt='9.269Z', query='perunga', hitsReturned=10),
Row(dt='9.271Z', query='Perungalathur hi', hitsReturned=0),
Row(dt='9.272Z', query='Perungalathur highwa', hitsReturned=0),
Row(dt='9.273Z', query='Perungalathur hig', hitsReturned=0),
Row(dt='9.286Z', query='Perungalathur highw', hitsReturned=0),
Row(dt='9.297Z', query='Perungalathur h', hitsReturned=2),
Row(dt='9.299Z', query='perun', hitsReturned=10),
Row(dt='9.301Z', query='p', hitsReturned=10),
Row(dt='9.301Z', query='per', hitsReturned=10),
Row(dt='9.301Z', query='peru', hitsReturned=10),
Row(dt='9.302Z', query='perungal', hitsReturned=10),

All this in 0.05 seconds.. I'm guessing the searcher just had trouble making their initial connection.

In likely typing order (though it makes almost as much sense wrt the time stamps in reverse order):

Row(dt='9.301Z', query='p', hitsReturned=10),
Row(dt='9.301Z', query='per', hitsReturned=10),
Row(dt='9.301Z', query='peru', hitsReturned=10),
Row(dt='9.299Z', query='perun', hitsReturned=10),
Row(dt='9.266Z', query='perung', hitsReturned=10),
Row(dt='9.269Z', query='perunga', hitsReturned=10),
Row(dt='9.302Z', query='perungal', hitsReturned=10),
Row(dt='9.269Z', query='Perungalathur', hitsReturned=2),
Row(dt='9.297Z', query='Perungalathur h', hitsReturned=2),
Row(dt='9.271Z', query='Perungalathur hi', hitsReturned=0),
Row(dt='9.273Z', query='Perungalathur hig', hitsReturned=0),
Row(dt='9.267Z', query='Perungalathur high', hitsReturned=0),
Row(dt='9.286Z', query='Perungalathur highw', hitsReturned=0),
Row(dt='9.272Z', query='Perungalathur highwa', hitsReturned=0),
Row(dt='9.269Z', query='Perungalathur highway', hitsReturned=0),

(I'm impressed with how fast they must have capitalized the first letter...)

I wonder how much autocomplete server load we could save if we used the much more accurate client-side clock and only submitted one query every n miliseconds, and whether searchers would notice a difference. Anyway, that's a toipic for another time.

The notebook looks great!

One more quick question, though.. should we be worried about that stack trace at the end after all the session details?