Page MenuHomePhabricator

Run search relevance survey on enwiki and frwiki
Open, LowPublic

Description

Based on the promising prior work (T174106) collecting survey results from users and mapping those into relevance labels, run a test with new queries that we do not know the relevance of. The results of this will be fed into the model generated by the prior task to label these new queries.

This test will be run on enwiki, from which the model was built, and frwiki to evaluate how well it works for languages it was not trained on.

Steps:

  1. Collect a set of 1k queries each for enwiki and frwiki
  2. Have two native language speakers review the lists and remove any queries that look like PII. This doesn't have to be incredibly rigorous, just err on the side of caution.
  3. Intersect the accepted lists from the two native speakers and keep only the queries that both agreed on
  4. For each query generate a list of pages we will survey about this query. Intersect the results from the following queries to get 30-50 results for each for a total of ~40k (query, page_=) pairs:
    • top 20 MLR scored results
    • top 20 default scored results
    • top 10 completion search results
    • top 10 "high recall" search results, using relaxed filtering instead of our default AND
    • more?
  5. Process the list of queries, roughly like https://phabricator.wikimedia.org/T174106#3580502, to generate the appropriate php arrays that will be inserted into mediawiki-config to configure the test, along with a file of titles suitable for passing into the purgeList.php maintenance script.
    • Based on the prior test analysis, we should aim for at least 40 responses for each (query, title) pair and idealy 70+.
    • Running the survey for 14 days gives us a response rate of 5 per (query, title) per day, which should allow us to get a reasonable number of responses for all but the least-viewed articles. Given our target of 40k (query, page) pairs this works out to 200k impressions per day, and just under 3M impressions for the full survey.
  6. Partially revert ab40a07a12442b7e3c5426adf33aa4fa242af682 in WikimediaEvents to bring back the human search relevance survey code that was inadvertently deleted.
  7. Merge and test remaining code updates to survey based on learnings of previous test (patches attached to T176428)
  8. Deploy mediawiki-config with survey sampling rates
  9. Split the files containing titles to purge into new files with around 100 title per file and setup a loop on terbium to purge a file every 5 minutes until we've run through them all
  10. Verify the sampling rates are being included in the cached article pages
  11. Monitor the data coming in to see that sampling rates are working as intended
  12. ...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 2 2018, 10:50 PM

Change 401631 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Bring back human search relevance survey

https://gerrit.wikimedia.org/r/401631

@mpopov I wasn't quite sure from https://wikimedia-research.github.io/Discovery-Search-Adhoc-RelevanceSurveys/#responses_required , is 40 to 70 responses the number of impressions (yes+no+dismiss+timeout), the number of clicks (yes+no+dismiss), or the number of yes+no? I think it was yes+no+dismiss, but it might have been yes+no+dismiss+timeout?

Closer reading of the report:

the model is very accurate with at least 40 yes/no/unsure/dismiss responses and the most accurate with at least 70 responses

I think is saying that we are not considering timeouts here, which means with an ~30% response rate to get 70 responses we need 210 impressions?

mpopov added a comment.Jan 3 2018, 6:56 PM

@mpopov I wasn't quite sure from https://wikimedia-research.github.io/Discovery-Search-Adhoc-RelevanceSurveys/#responses_required , is 40 to 70 responses the number of impressions (yes+no+dismiss+timeout), the number of clicks (yes+no+dismiss), or the number of yes+no? I think it was yes+no+dismiss, but it might have been yes+no+dismiss+timeout?

Closer reading of the report:

the model is very accurate with at least 40 yes/no/unsure/dismiss responses and the most accurate with at least 70 responses

I think is saying that we are not considering timeouts here, which means with an ~30% response rate to get 70 responses we need 210 impressions?

Correct, it's the number of (non-timeout) responses aka clicks. Your initial guess was right, sorry about missing the comment yesterday.

debt triaged this task as Low priority.May 1 2018, 5:33 PM
debt edited projects, added Discovery-Search; removed Discovery-Search (Current work).
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Change 401631 abandoned by EBernhardson:
Bring back human search relevance survey

https://gerrit.wikimedia.org/r/401631

Aklapper removed EBernhardson as the assignee of this task.Jun 19 2020, 4:27 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

CBogen added a subscriber: CBogen.Thu, Aug 6, 7:37 PM

This has been de-prioritized until WMF has a better solution for large-scale surveys.