|mediawiki/extensions/WikimediaEvents : wmf/1.28.0-wmf.1||Add textcat subtest|
|mediawiki/extensions/CirrusSearch : wmf/1.28.0-wmf.1||Adjust textcat data collection for AB test|
|operations/mediawiki-config : master||A/B/C test of control vs textcat vs accept-lang + textcat|
|mediawiki/extensions/WikimediaEvents : master||Add textcat subtest|
|mediawiki/extensions/CirrusSearch : master||Adjust textcat data collection for AB test|
|mediawiki/extensions/CirrusSearch : master||Allowing triggering user tests from query parameter|
- Mentioned In
- T134319: Turn off TextCat A/B test on the English Wikipedia on or after May 23
T134318: Verify data pipeline for TextCat A/B test on English Wikipedia
T132706: Validate click events in TestSearchSatisfaction2
T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification
- Mentioned Here
- T132706: Validate click events in TestSearchSatisfaction2
T121539: Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
T123537: Generate wikitext-based and query-based language models for TextCat
T118287: Run test with different library for detection language through the relevance lab, to decide how promising it is to invest further
T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search
T121540: Investigate Updating Cybozu / ES Plugin for Language Identification
I was looking at this after a comment Stas made about Italian, and I realized that the set of languages currently in LM-query under TextCat is not the ideal one for this test.
Portuguese and Japanese are missing, which are minor issues, because there are not many Portuguese or Japanese queries on enwiki. Hebrew, Armenian, Georgian, Tamil, and Telegu are present—they won't do much, but they won't hurt.
However, French and German are present, and they both tend to get many more false positives than true positives. Not a ton, but they will bring the overall performance down on enwiki.
As I understand it, the PHP version of TextCat doesn't yet have the ability to specify/limit languages, other than by what's in the requested directory.
Do we want to patch LM-query/ before the A/B test?
The primary outstanding question for this task is how to measure the effectiveness of the test. This was discussed briefly in a sprint planning meeting today, but @EBernhardson and @mpopov didn't come to a conclusion. @mpopov will schedule a meeting to discuss this. Marking as stalled until that's done.
Erik, Trey, David, Kevin, and I met this morning to discuss how we're going to handle data collection for the upcoming TextCat test. A big problem in this particular case is that the system wasn't designed/engineered in a way that's conducive for cross-wiki logging / session tracking. And recently we even lost the ability to use the referrer info to see which page the user came from when visiting another wiki page when going between wikis. (I was told this was done for user privacy reasons.)
Erik said he had recently implemented a click event in the TestSearchSatisfaction2 schema that we might be able to hook into to measure clickthrough rate for users who are eligible for TextCat language detection & get shown results in the language their non-English query probably is written in. Whether we use this and how much we rely on this particular method of measuring whether TextCat is successful (beyond just measuring how it impacts the zero results rate) depends on the validation of the click events and how they compare to page visit events (which cannot be fired in an interwiki context).
We also discussed an alternative approach which uses web requests with the caveat being that if a user is selected for the test once, they'll be selected every time. So if a particular IP+UA combination is part of the test and performs 2 million searches (as is sometimes the case), then we'll have to do some very careful filtering which will also exclude some completely valid use cases (a computer lab in a school or a country with only 2 public IP addresses). But we're shooting for being able to use TestSearchSatisfaction2
Will add validation of click events in TestSearchSatisfaction2 (T132706) as a blocker.
After discussion the way we are running this test has slightly changed. The provided patch above does a backend only test which doesn't include as much data as we would like to analyze with. Will re-work to run a test using our frontend search satisfaction schema.
@mpopov What additional metrics should we collect into the satisfaction schema for users in the textcat test?
Some i'm thinking might be useful:
- # of interwiki results provided
- boolean indicating if click event was interwiki or not
These might be unnecessary though, we mostly are just looking at if the click through more or not?