Page MenuHomePhabricator

Add a PaulScore approximation to discovery.wmflabs.org
Closed, ResolvedPublic3 Estimated Story Points

Description

In our weekly search meeting we talked about calculating and graphing an approximate paul score of the actual user behaviour, as a metric for quality of search results. Came up with the following sql which calculates it for fulltext and autocomplete for the specified number of days. Note that this isn't exactly the same paul score we use in rel forge, as we don't have all the data necessary, but the results seem to be a reasonably similar to what we get out of relforge (0.30 +- .02).

We've been using the 0.7^x value in relforge recently, which results in the 13th position having a weight of 0.01. 0.5^x also calculated here puts a little more preference on results being higher in the result set, with the 7th result having a weight of 0.008. We don't actually know the best values to use for these factors, these are guesses mostly.

Query

SELECT date, event_source,
       ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7,
       ROUND(SUM(pow_5)/COUNT(1), 2) as pow_5
  FROM ( SELECT event_searchSessionId,
                event_source,
                LEFT(MIN(timestamp), 8) as date,
                SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7,
                SUM(IF(event_action = 'click',
                      POW(0.5, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0))as pow_5
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20160820000000' AND '20160831000000'
            AND event_action IN ('searchResultPage', 'click')
          GROUP BY event_searchSessionId, event_source
       ) x
 GROUP BY date, event_source

Example results for a few days:

+----------+--------------+-------+-------+
| date     | event_source | pow_7 | pow_5 |
+----------+--------------+-------+-------+
| 20160820 | autocomplete |  0.43 |  0.53 |
| 20160820 | fulltext     |  0.30 |  0.28 |
| 20160821 | autocomplete |  0.42 |  0.52 |
| 20160821 | fulltext     |  0.30 |  0.27 |
| 20160822 | autocomplete |  0.44 |  0.55 |
| 20160822 | fulltext     |  0.32 |  0.29 |
| 20160823 | autocomplete |  0.45 |  0.56 |
| 20160823 | fulltext     |  0.32 |  0.29 |
| 20160824 | autocomplete |  0.45 |  0.57 |
| 20160824 | fulltext     |  0.31 |  0.29 |
| 20160825 | autocomplete |  0.44 |  0.56 |
| 20160825 | fulltext     |  0.31 |  0.29 |
| 20160826 | autocomplete |  0.46 |  0.57 |
| 20160826 | fulltext     |  0.31 |  0.28 |
| 20160827 | autocomplete |  0.42 |  0.52 |
| 20160827 | fulltext     |  0.31 |  0.28 |
| 20160828 | autocomplete |  0.42 |  0.53 |
| 20160828 | fulltext     |  0.30 |  0.27 |
| 20160829 | autocomplete |  0.44 |  0.56 |
| 20160829 | fulltext     |  0.32 |  0.30 |
| 20160830 | autocomplete |  0.44 |  0.55 |
| 20160830 | fulltext     |  0.31 |  0.28 |
+----------+--------------+-------+-------+

Event Timeline

For the Discovery-Analysis team, let's go ahead run the query (with the correct timestamps), create a graph and then add in a note to the user-engagement dashboard page.

The addition of Paulscore (once it's done) will show if things are getting better, but it probably doesn't have a higher or greater weight. This could be determined by doing this exercise per wiki.

Maybe @TJones can help as well.

We don't actually know the best values to use for these factors, these are guesses mostly.

It'd be interesting to use a few different factors—we've already got 0.7 and 0.5—and see how closely they track each other. It's possible that there is a reasonably "best" parameter value, but—given our users' preference for the first 3 results—my guess is that they will track each other relatively closely, with one having somewhat exaggerated moves compared to the other from time to time—like Erik's sample data.

Would it be possible / desirable to add 0.4 (i.e, pow_4) and 0.6 (i.e., pow_6) to the mix, at least to start?

mpopov set the point value for this task to 3.
mpopov moved this task from Needs triage to Current work on the Discovery-Analysis board.

@EBernhardson: is it safe to generalize that PaulScore 0.n is

SUM(IF(event_action = 'click', POW(0.n, event_position), 0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) AS pow_n

in the inner query and then

ROUND(SUM(pow_n)/COUNT(1), 2) as pow_n

in the outer query?

If yes, then we can have 0.1, …, 0.9 (in increments of 0.1) like @TJones suggested.

Change 309702 had a related patch set uploaded (by Bearloga):
Add approximate PaulScore daily tracking

https://gerrit.wikimedia.org/r/309702

@mpopov if i understand your code correctly (generates N select fields) then that should be fine, yes.

Change 309702 merged by Bearloga:
Add approximate PaulScore daily tracking

https://gerrit.wikimedia.org/r/309702

mpopov removed a project: Patch-For-Review.

Backfilling PaulScores now. We will add to dashboard after we're done with some other stuff.

If yes, then we can have 0.1, …, 0.9 (in increments of 0.1) like @TJones suggested.

Cool! We have intuitions about how they will behave, but it'll be awesome to actually see it. Thanks!

Change 310461 had a related patch set uploaded (by Bearloga):
Add PaulScore approximations

https://gerrit.wikimedia.org/r/310461

Live on beta: http://discovery-beta.wmflabs.org/metrics/#paulscore_approx

@TJones @EBernhardson Okay, this is where I need your help :)

  1. Do you want to keep the labels as "pow_1, …, pow_9" or do you want those renamed?
  2. I need some documentation below the graphs for what the PaulScore is measuring and how to interpret the graphs.
  3. I put the PaulScore into the Desktop set of pages since it's using TestSearchSatisfaction2 desktop data and I have autocomplete & fulltext together on the same page, but let me know if you'd prefer it in its own set with autocomplete and fulltext on separate pages.

Hey @mpopov,

I'm looking into this. The results are unexpected (they are in reverse order for fulltext and autocomplete!) but consistent with @EBernhardson's example results above. Some seem to have impossibly high scores, too.

I'm going to review the definitions in the original paper and check the math in the SQL and try to figure out what's happening. If everything is legit, then:

  1. pow_1, etc should be fine names, and they can be explained in the docs in #2. We can also probably get rid of some of them—maybe we only need 3 of them.
  2. T144243 is to provide better docs, which you can point to. In the meantime I can try to put something together.
  3. Erik and I discussed this briefly earlier today and being all together sounds fine!
  4. Any idea what happened from July 8 to July 15? Any chance there's a data problem?

Hmm. There has to be an error somewhere.

The max score for any given factor is 1/(1-factor), so for 0.1, the max score (i.e., every result was clicked on for every query) is 1.1111111...., but the graph shows scores above 2 for autocomplete.

Is it possible this is driven by having incompletely captured data? If there were not enough searchResultPage events, then the scores would be inflated.

Still thinking....

Okay, thanks to some help from @mpopov, I was able to get at the data and I found the problem. I don't know what it means, but I can see where the math goes off the rails.

For autocomplete, sometimes the event_position value for the click is -1. When this is used as the exponent, the smaller the factor, the bigger the impact, hence the inverted order of the lines on the demo dashboard.

I don't know what the -1 is supposed to signify, and whether we should convert it to something else, or just ignore it, or what—but if we treat them correctly things should improve.

Just by dropping them I get more believable numbers—but I'm not sure that's the right thing to do.

dateevent_sourcepow_1pow_5pow_9
20160820autocomplete0.120.140.17
20160820fulltext0.240.280.35
20160821autocomplete0.120.140.17
20160821fulltext0.240.270.34
20160822autocomplete0.120.130.16
20160822fulltext0.260.290.36
20160823autocomplete0.120.130.16
20160823fulltext0.260.290.36
20160824autocomplete0.120.130.16
20160824fulltext0.260.290.35
20160825autocomplete0.110.130.16
20160825fulltext0.250.290.36
20160826autocomplete0.110.130.16
20160826fulltext0.250.280.35
20160827autocomplete0.120.140.17
20160827fulltext0.250.280.35
20160828autocomplete0.120.130.16
20160828fulltext0.240.270.34
20160829autocomplete0.120.130.16
20160829fulltext0.260.300.36
20160830autocomplete0.120.130.16
20160830fulltext0.250.280.35

@mpopov, for #2, I've put together a first draft of a search glossary. Can you get what you need from or just point to the PaulScore entry there?

@mpopov, for #2, I've put together a first draft of a search glossary. Can you get what you need from or just point to the PaulScore entry there?

Awesome, thanks for putting that together! I'll try to summarize it on the dashboard and then point to MW for full definition.

Change 311598 had a related patch set uploaded (by Bearloga):
Add PaulScore documentation & relative display, fix formatting

https://gerrit.wikimedia.org/r/311598

Change 311598 merged by Bearloga:
Add PaulScore documentation & relative display, fix formatting

https://gerrit.wikimedia.org/r/311598

Documentation & other changes are live on beta now: http://discovery-beta.wmflabs.org/metrics/#paulscore_approx Let me know if anything there needs to be changed.

Now just waiting to hear back from @EBernhardson about how to fix the calculation for autocomplete searches after he's done with didyoumean satisfaction integration.

i've verified that the position is -1 when the user searches for something that is not an autocomplete result. This is particularly prevalent for the main search bar on Special:Search. I'm thinking to only count search/clicks to the autocomplete bar in the header. When the position is -1 that shouldn't be counted as a successfull click, for the purposes of this paul score.

This sql query should handle the above. It differs from the previous one with two additional AND IF(...) atatements in there WHERE clause.

SELECT date, event_source,
       ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7,
       ROUND(SUM(pow_5)/COUNT(1), 2) as pow_5
  FROM ( SELECT event_searchSessionId,
                event_source,
                LEFT(MIN(timestamp), 8) as date,
                SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7,
                SUM(IF(event_action = 'click',
                      POW(0.5, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0))as pow_5
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20160820000000' AND '20160831000000'
            AND event_action IN ('searchResultPage', 'click')
            AND IF(event_source = 'autocomplete' AND event_action = 'searchResultPage', event_inputLocation = 'header', TRUE)
            AND IF(event_source = 'autocomplete' AND event_action = 'click', event_position >= 0, TRUE)
          GROUP BY event_searchSessionId, event_source
       ) x
 GROUP BY date, event_source

This, unfortunately, does still count clicks to the Special:Search autocomplete, as we don't record anything that distinguishes the click events. I'll work up a patch to update our data collection to also record this information for click events.

debt triaged this task as Medium priority.Sep 20 2016, 8:59 PM

Change 313870 had a related patch set uploaded (by Bearloga):
Deploy dashboard updates

https://gerrit.wikimedia.org/r/313870

Hi @EBernhardson - can you let us know what else we'll need to do before the patch for the data collection goes through? Thanks!

Once we are recording the position of click events, will need to change:

IF(event_source = 'autocomplete' AND event_action = 'searchResultPage', event_inputLocation = 'header', TRUE)

to

IF(event_source = 'autocomplete', event_inputLocation = 'header', TRUE)

Change 316490 had a related patch set uploaded (by Bearloga):
[WIP] Fix PaulScore & update TSS2 refs

https://gerrit.wikimedia.org/r/316490

Waiting on the train to run again during the week of Oct 25 for T138087 to be deployed.

Change 316490 merged by Chelsyx:
Fix PaulScore & update TSS2 refs

https://gerrit.wikimedia.org/r/316490