Select an arbitrary dwell-time threshold
Closed, ResolvedPublic5 Story Points

Description

We need an arbitrary dwell-time threshold for use in the user satisfaction metric. Compute it!

This can be as simple as:

  1. Build the backend code to compute success;
  2. Stick a shiny app up in front of it;
  3. Twiddle the dwell-time knob until it does something consistently.
Ironholds added a subscriber: Ironholds.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2015, 9:24 PM
Ironholds set Security to None.Sep 21 2015, 9:35 PM
Ironholds edited a custom field.
Deskana assigned this task to Ironholds.
Deskana added a subscriber: Deskana.

Okay, so here's where we are.

I took the data we could rely on from the UserSatisfaction2 schema (a week) and subsetted to grab all page visits and search page visits from the sessions available, which came to 122,638 sessions containing 178,003 events. From there I calculated our preliminary success rate using arbitrary thresholds of 1, 10, 20, 30, 60, 90, 100 and 120 seconds.

The highest it can possibly go is ~25%; this is because a lot of searches do not result in people clicking on anything at all, and after that we see a very long dropoff. There's no clear indication that any one threshold is "right" just from this graph (and I wouldn't expect to see one).

If we look at the daily variation for each threshold:

The lower the threshold goes, the more variation there is, which is interesting but not tremendously surprising (we'd expect it to "settle" after a certain point, and we see it settle). There's not enough trustable data in our backlog to give us a real way of saying "yes, X is the best threshold to use". Basically selecting one is going to come down to the work we did on inter-time analysis, where we saw that drop centred around 100 seconds. If we believe that events after that represent reading, we should set a threshold of ~100 seconds. On the other hand, if we believe the peak at 10 seconds represents people learning things and then leaving, we shouldn't. My gut is on the former, but I'm not tremendously wedded to any idea. Getting more data in over time would allow us to make a much better guess at this kind of thing.

Distinctly from this we should be tracking the actual clickthrough-to-results rate; that's far more robust and also gives us a hint as to whether our results are worth anything (and lets us monitor the impact of changes to the search interface). I've attached the code and aggregated data in case @mpopov has comments here.

We're defining user satisfaction as the area under a monotone decreasing curve to the right of a threshold. So user satisfaction as we defined it is always going to be inversely proportional to the threshold.

When the user visits a page from SERP, there are 4 possible outcomes. Either:
(1a) they stayed for a while (e.g. Alan Turing's biography)
(1b) they stayed for a while because they didn't find what they were looking for
(2a) they closed the page quickly because they found what they were looking for (e.g. Beyonce's birthday)
(2b) they closed the page quickly because they didn't find what they were looking for

All of our current approaches lump 1a and 1b into "good" and 2a & 2b into "bad", which is to say we need to figure out how to differentiate because what we actually want to do is figure out how many are in 1a+2a vs how many are in 1b+2b.

This isn't helpful, I know.

Well, yes, and there's no way around that, which is one of the reasons I'm pushing for clickthrough as an initial robust metric. It also has its problems but it at least allows us to see if we have obvious problems with the interface.

+1 on clickthrough rate as the best initial shot we've got going on out of the bunch

It's also something we have a lot of historical data on and can calculate for Apps/Mobile/Desktop distinctly.

Marking as "done" now we've had the review!

Deskana closed this task as "Resolved".Nov 20 2015, 5:23 AM