Based on the promising prior work (T174106) collecting survey results from users and mapping those into relevance labels, run a test with new queries that we do not know the relevance of. The results of this will be fed into the model generated by the prior task to label these new queries.
This test will be run on enwiki, from which the model was built, and frwiki to evaluate how well it works for languages it was not trained on.
- Collect a set of 1k queries each for enwiki and frwiki
- Have two native language speakers review the lists and remove any queries that look like PII. This doesn't have to be incredibly rigorous, just err on the side of caution.
- Intersect the accepted lists from the two native speakers and keep only the queries that both agreed on
- For each query generate a list of pages we will survey about this query. Intersect the results from the following queries to get 30-50 results for each for a total of ~40k (query, page_=) pairs:
- top 20 MLR scored results
- top 20 default scored results
- top 10 completion search results
- top 10 "high recall" search results, using relaxed filtering instead of our default AND
- Process the list of queries, roughly like https://phabricator.wikimedia.org/T174106#3580502, to generate the appropriate php arrays that will be inserted into mediawiki-config to configure the test, along with a file of titles suitable for passing into the purgeList.php maintenance script.
- Based on the prior test analysis, we should aim for at least 40 responses for each (query, title) pair and idealy 70+.
- Running the survey for 14 days gives us a response rate of 5 per (query, title) per day, which should allow us to get a reasonable number of responses for all but the least-viewed articles. Given our target of 40k (query, page) pairs this works out to 200k impressions per day, and just under 3M impressions for the full survey.
- Partially revert ab40a07a12442b7e3c5426adf33aa4fa242af682 in WikimediaEvents to bring back the human search relevance survey code that was inadvertently deleted.
- Merge and test remaining code updates to survey based on learnings of previous test (patches attached to T176428)
- Deploy mediawiki-config with survey sampling rates
- Split the files containing titles to purge into new files with around 100 title per file and setup a loop on terbium to purge a file every 5 minutes until we've run through them all
- Verify the sampling rates are being included in the cached article pages
- Monitor the data coming in to see that sampling rates are working as intended