Add a PaulScore approximation to discovery.wmflabs.org
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Aug 31 2016, 6:49 PM

Description

In our weekly search meeting we talked about calculating and graphing an approximate paul score of the actual user behaviour, as a metric for quality of search results. Came up with the following sql which calculates it for fulltext and autocomplete for the specified number of days. Note that this isn't exactly the same paul score we use in rel forge, as we don't have all the data necessary, but the results seem to be a reasonably similar to what we get out of relforge (0.30 +- .02).

We've been using the 0.7^x value in relforge recently, which results in the 13th position having a weight of 0.01. 0.5^x also calculated here puts a little more preference on results being higher in the result set, with the 7th result having a weight of 0.008. We don't actually know the best values to use for these factors, these are guesses mostly.

Query

SELECT date, event_source,
       ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7,
       ROUND(SUM(pow_5)/COUNT(1), 2) as pow_5
  FROM ( SELECT event_searchSessionId,
                event_source,
                LEFT(MIN(timestamp), 8) as date,
                SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7,
                SUM(IF(event_action = 'click',
                      POW(0.5, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0))as pow_5
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20160820000000' AND '20160831000000'
            AND event_action IN ('searchResultPage', 'click')
          GROUP BY event_searchSessionId, event_source
       ) x
 GROUP BY date, event_source

Example results for a few days:

+----------+--------------+-------+-------+
| date     | event_source | pow_7 | pow_5 |
+----------+--------------+-------+-------+
| 20160820 | autocomplete |  0.43 |  0.53 |
| 20160820 | fulltext     |  0.30 |  0.28 |
| 20160821 | autocomplete |  0.42 |  0.52 |
| 20160821 | fulltext     |  0.30 |  0.27 |
| 20160822 | autocomplete |  0.44 |  0.55 |
| 20160822 | fulltext     |  0.32 |  0.29 |
| 20160823 | autocomplete |  0.45 |  0.56 |
| 20160823 | fulltext     |  0.32 |  0.29 |
| 20160824 | autocomplete |  0.45 |  0.57 |
| 20160824 | fulltext     |  0.31 |  0.29 |
| 20160825 | autocomplete |  0.44 |  0.56 |
| 20160825 | fulltext     |  0.31 |  0.29 |
| 20160826 | autocomplete |  0.46 |  0.57 |
| 20160826 | fulltext     |  0.31 |  0.28 |
| 20160827 | autocomplete |  0.42 |  0.52 |
| 20160827 | fulltext     |  0.31 |  0.28 |
| 20160828 | autocomplete |  0.42 |  0.53 |
| 20160828 | fulltext     |  0.30 |  0.27 |
| 20160829 | autocomplete |  0.44 |  0.56 |
| 20160829 | fulltext     |  0.32 |  0.30 |
| 20160830 | autocomplete |  0.44 |  0.55 |
| 20160830 | fulltext     |  0.31 |  0.28 |
+----------+--------------+-------+-------+

Details

Subject	Repo	Branch	Lines +/-
Fix PaulScore & update TSS2 refs	wikimedia/discovery/golden	master	+18 -16
Deploy dashboard updates	wikimedia/discovery/dashboard	master	+5 -5
Add PaulScore documentation & relative display, fix formatting	wikimedia/discovery/rainbow	master	+90 -29
Add PaulScore approximations	wikimedia/discovery/rainbow	master	+54 -3
Add approximate PaulScore daily tracking	wikimedia/discovery/golden	master	+36 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EBernhardson	T138087 Integrate "did you mean" data collection into search satisfaction schema
		Resolved		mpopov	T144424 Add a PaulScore approximation to discovery.wmflabs.org

Event Timeline

EBernhardson created this task.Aug 31 2016, 6:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 31 2016, 6:49 PM

EBernhardson updated the task description. (Show Details)Aug 31 2016, 6:54 PM

For the Discovery-Analysis team, let's go ahead run the query (with the correct timestamps), create a graph and then add in a note to the user-engagement dashboard page.

The addition of Paulscore (once it's done) will show if things are getting better, but it probably doesn't have a higher or greater weight. This could be determined by doing this exercise per wiki.

Maybe @TJones can help as well.

debt added subscribers: mpopov, • chelsyx.Sep 1 2016, 8:13 PM

debt removed a project: Discovery-Search.Sep 1 2016, 10:12 PM

debt added a subscriber: TJones.

We don't actually know the best values to use for these factors, these are guesses mostly.

It'd be interesting to use a few different factors—we've already got 0.7 and 0.5—and see how closely they track each other. It's possible that there is a reasonably "best" parameter value, but—given our users' preference for the first 3 results—my guess is that they will track each other relatively closely, with one having somewhat exaggerated moves compared to the other from time to time—like Erik's sample data.

Would it be possible / desirable to add 0.4 (i.e, pow_4) and 0.6 (i.e., pow_6) to the mix, at least to start?

mpopov claimed this task.Sep 9 2016, 9:19 PM

mpopov set the point value for this task to 3.

mpopov moved this task from Needs triage to Current work on the Discovery-Analysis board.

mpopov edited projects, added Discovery-Analysis (Current work); removed Discovery-Analysis.

@EBernhardson: is it safe to generalize that PaulScore 0.n is

SUM(IF(event_action = 'click', POW(0.n, event_position), 0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) AS pow_n

in the inner query and then

ROUND(SUM(pow_n)/COUNT(1), 2) as pow_n

in the outer query?

If yes, then we can have 0.1, …, 0.9 (in increments of 0.1) like @TJones suggested.

Change 309702 had a related patch set uploaded (by Bearloga):
Add approximate PaulScore daily tracking

https://gerrit.wikimedia.org/r/309702

gerritbot added a project: Patch-For-Review.Sep 9 2016, 11:10 PM

mpopov moved this task from Backlog to Needs review on the Discovery-Analysis (Current work) board.Sep 9 2016, 11:14 PM

@mpopov if i understand your code correctly (generates N select fields) then that should be fine, yes.

Change 309702 merged by Bearloga:
Add approximate PaulScore daily tracking

https://gerrit.wikimedia.org/r/309702

Backfilling PaulScores now. We will add to dashboard after we're done with some other stuff.

In T144424#2624752, @mpopov wrote:

If yes, then we can have 0.1, …, 0.9 (in increments of 0.1) like @TJones suggested.

Cool! We have intuitions about how they will behave, but it'll be awesome to actually see it. Thanks!

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.Sep 13 2016, 10:57 PM

Change 310461 had a related patch set uploaded (by Bearloga):
Add PaulScore approximations

https://gerrit.wikimedia.org/r/310461

Change 310461 merged by Bearloga:
Add PaulScore approximations

https://gerrit.wikimedia.org/r/310461

Live on beta: http://discovery-beta.wmflabs.org/metrics/#paulscore_approx

@TJones @EBernhardson Okay, this is where I need your help :)

Do you want to keep the labels as "pow_1, …, pow_9" or do you want those renamed?
I need some documentation below the graphs for what the PaulScore is measuring and how to interpret the graphs.
I put the PaulScore into the Desktop set of pages since it's using TestSearchSatisfaction2 desktop data and I have autocomplete & fulltext together on the same page, but let me know if you'd prefer it in its own set with autocomplete and fulltext on separate pages.

Hey @mpopov,

I'm looking into this. The results are unexpected (they are in reverse order for fulltext and autocomplete!) but consistent with @EBernhardson's example results above. Some seem to have impossibly high scores, too.

I'm going to review the definitions in the original paper and check the math in the SQL and try to figure out what's happening. If everything is legit, then:

pow_1, etc should be fine names, and they can be explained in the docs in #2. We can also probably get rid of some of them—maybe we only need 3 of them.
T144243 is to provide better docs, which you can point to. In the meantime I can try to put something together.
Erik and I discussed this briefly earlier today and being all together sounds fine!
Any idea what happened from July 8 to July 15? Any chance there's a data problem?

Hmm. There has to be an error somewhere.

The max score for any given factor is 1/(1-factor), so for 0.1, the max score (i.e., every result was clicked on for every query) is 1.1111111...., but the graph shows scores above 2 for autocomplete.

Is it possible this is driven by having incompletely captured data? If there were not enough searchResultPage events, then the scores would be inflated.

Still thinking....

Okay, thanks to some help from @mpopov, I was able to get at the data and I found the problem. I don't know what it means, but I can see where the math goes off the rails.

For autocomplete, sometimes the event_position value for the click is -1. When this is used as the exponent, the smaller the factor, the bigger the impact, hence the inverted order of the lines on the demo dashboard.

I don't know what the -1 is supposed to signify, and whether we should convert it to something else, or just ignore it, or what—but if we treat them correctly things should improve.

Just by dropping them I get more believable numbers—but I'm not sure that's the right thing to do.

date	event_source	pow_1	pow_5	pow_9
20160820	autocomplete	0.12	0.14	0.17
20160820	fulltext	0.24	0.28	0.35
20160821	autocomplete	0.12	0.14	0.17
20160821	fulltext	0.24	0.27	0.34
20160822	autocomplete	0.12	0.13	0.16
20160822	fulltext	0.26	0.29	0.36
20160823	autocomplete	0.12	0.13	0.16
20160823	fulltext	0.26	0.29	0.36
20160824	autocomplete	0.12	0.13	0.16
20160824	fulltext	0.26	0.29	0.35
20160825	autocomplete	0.11	0.13	0.16
20160825	fulltext	0.25	0.29	0.36
20160826	autocomplete	0.11	0.13	0.16
20160826	fulltext	0.25	0.28	0.35
20160827	autocomplete	0.12	0.14	0.17
20160827	fulltext	0.25	0.28	0.35
20160828	autocomplete	0.12	0.13	0.16
20160828	fulltext	0.24	0.27	0.34
20160829	autocomplete	0.12	0.13	0.16
20160829	fulltext	0.26	0.30	0.36
20160830	autocomplete	0.12	0.13	0.16
20160830	fulltext	0.25	0.28	0.35

@mpopov, for #2, I've put together a first draft of a search glossary. Can you get what you need from or just point to the PaulScore entry there?

TJones mentioned this in T144243: Document the term "Paulscore".Sep 15 2016, 4:12 PM

mpopov moved this task from In progress to Backlog on the Discovery-Analysis (Current work) board.Sep 15 2016, 8:51 PM

In T144424#2641074, @TJones wrote:

@mpopov, for #2, I've put together a first draft of a search glossary. Can you get what you need from or just point to the PaulScore entry there?

Awesome, thanks for putting that together! I'll try to summarize it on the dashboard and then point to MW for full definition.

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.Sep 19 2016, 9:08 PM

Change 311598 had a related patch set uploaded (by Bearloga):
Add PaulScore documentation & relative display, fix formatting

https://gerrit.wikimedia.org/r/311598

Change 311598 merged by Bearloga:
Add PaulScore documentation & relative display, fix formatting

https://gerrit.wikimedia.org/r/311598

Documentation & other changes are live on beta now: http://discovery-beta.wmflabs.org/metrics/#paulscore_approx Let me know if anything there needs to be changed.

Now just waiting to hear back from @EBernhardson about how to fix the calculation for autocomplete searches after he's done with didyoumean satisfaction integration.

mpopov moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Sep 19 2016, 9:37 PM

mpopov removed a project: Patch-For-Review.

mpopov updated the task description. (Show Details)

mpopov moved this task from Needs review to Stalled/Waiting on the Discovery-Analysis (Current work) board.Sep 20 2016, 3:47 PM

i've verified that the position is -1 when the user searches for something that is not an autocomplete result. This is particularly prevalent for the main search bar on Special:Search. I'm thinking to only count search/clicks to the autocomplete bar in the header. When the position is -1 that shouldn't be counted as a successfull click, for the purposes of this paul score.

This sql query should handle the above. It differs from the previous one with two additional AND IF(...) atatements in there WHERE clause.

SELECT date, event_source,
       ROUND(SUM(pow_7)/COUNT(1), 2) as pow_7,
       ROUND(SUM(pow_5)/COUNT(1), 2) as pow_5
  FROM ( SELECT event_searchSessionId,
                event_source,
                LEFT(MIN(timestamp), 8) as date,
                SUM(IF(event_action = 'click',
                      POW(0.7, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0)) as pow_7,
                SUM(IF(event_action = 'click',
                      POW(0.5, event_position),
                      0)) / SUM(IF(event_action = 'searchResultPage', 1, 0))as pow_5
           FROM TestSearchSatisfaction2_15700292
          WHERE timestamp BETWEEN '20160820000000' AND '20160831000000'
            AND event_action IN ('searchResultPage', 'click')
            AND IF(event_source = 'autocomplete' AND event_action = 'searchResultPage', event_inputLocation = 'header', TRUE)
            AND IF(event_source = 'autocomplete' AND event_action = 'click', event_position >= 0, TRUE)
          GROUP BY event_searchSessionId, event_source
       ) x
 GROUP BY date, event_source

This, unfortunately, does still count clicks to the Special:Search autocomplete, as we don't record anything that distinguishes the click events. I'll work up a patch to update our data collection to also record this information for click events.

debt triaged this task as Medium priority.Sep 20 2016, 8:59 PM

Change 313870 had a related patch set uploaded (by Bearloga):
Deploy dashboard updates

https://gerrit.wikimedia.org/r/313870

Change 313870 merged by Bearloga:
Deploy dashboard updates

https://gerrit.wikimedia.org/r/313870

Hi @EBernhardson - can you let us know what else we'll need to do before the patch for the data collection goes through? Thanks!

Once we are recording the position of click events, will need to change:

IF(event_source = 'autocomplete' AND event_action = 'searchResultPage', event_inputLocation = 'header', TRUE)

IF(event_source = 'autocomplete', event_inputLocation = 'header', TRUE)

Waiting on the ticket T138087 to be merged before this ticket can be completed.

Change 316490 had a related patch set uploaded (by Bearloga):
[WIP] Fix PaulScore & update TSS2 refs

https://gerrit.wikimedia.org/r/316490

Waiting on the train to run again during the week of Oct 25 for T138087 to be deployed.

Change 316490 merged by Chelsyx:
Fix PaulScore & update TSS2 refs

https://gerrit.wikimedia.org/r/316490

debt moved this task from Stalled/Waiting to Needs review on the Discovery-Analysis (Current work) board.Oct 27 2016, 8:11 PM

debt moved this task from Needs review to Done on the Discovery-Analysis (Current work) board.

debt closed this task as Resolved.Oct 28 2016, 2:36 PM

Add a PaulScore approximation to discovery.wmflabs.orgClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add a PaulScore approximation to discovery.wmflabs.org
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...