Maniphest T219534

Test MLR models for zhwiki, jawiki and kowiki
Open, MediumPublic
Actions

Assigned To

None

Authored By

	dcausse
	Mar 28 2019, 4:45 PM

Description

zhwiki and jawiki because of T219533 and kowiki because of T216738

Per @EBernhardson new models are automatically built and shipped to prod, they just need to be tested

Testing consists of running AB tests

First, deploy a config patch, see example here: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/537637/4/wmf-config/InitialiseSettings.php. This will wrap the config object with our custom values. The new models also need to be added to wgCirrusSearchMLRModel in that same InitaliseSettings.php file.

then, turn on the AB tests, example here: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/537639/3/modules/all/ext.wikimediaEvents.searchSatisfaction.js

Details

Subject	Repo	Branch	Lines +/-
Deploy MLR2020 test	mediawiki/extensions/WikimediaEvents	master	+3 -2
Update ML models	operations/mediawiki-config	master	+2 -5
Add new MLR models	operations/mediawiki-config	master	+25 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved	Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved	None	T283275 Make MW master tests pass on PHP 8.0
Resolved	Reedy	T268861 CirrusSearch uses Elastica's Match class
Resolved	Reedy	T268863 Translate uses Elastica's Match class
Resolved	matthiasmullie	T268866 WikibaseMediaInfo uses Elastica's Match class
Invalid	None	T268864 WikibaseCirrusSearch uses Elastica's Match class
Resolved	Reedy	T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class
Resolved	EBernhardson	T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0
Resolved	Gehel	T263142 [EPIC] Upgrade Elasticsearch to version 7.10
Resolved	EBernhardson	T218994 Deprecation warning on elasticsearch 6
Resolved	dcausse	T219268 Elasticsearch 6: the classic similarity is deprecated
Resolved	dcausse	T219533 Reindex space less languages wikis to use BM25
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T171652 Language Analysis Morphological Library Research Spike
Resolved	TJones	T178925 Review Korean Morphological Libraries
Resolved	TJones	T206874 Add Nori (Korean) configuration to AnalysisConfigBuilder
Resolved	TJones	T216738 Reindex Korean-language wikis to enable Nori analyzer
Open	None	T219534 Test MLR models for zhwiki, jawiki and kowiki

Event Timeline

dcausse created this task.Mar 28 2019, 4:45 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 28 2019, 4:45 PM

Restricted Application added subscribers: Stang, revi, Aklapper. · View Herald Transcript

dcausse added parent tasks: T219533: Reindex space less languages wikis to use BM25, T216738: Reindex Korean-language wikis to enable Nori analyzer.Mar 28 2019, 4:46 PM

EBernhardson moved this task from needs triage to elastic / cirrus on the Discovery-Search board.Mar 28 2019, 4:58 PM

dcausse moved this task from elastic / cirrus to ML & Data Pipeline on the Discovery-Search board.Mar 28 2019, 6:20 PM

Shizhao added a project: Chinese-Sites.Mar 29 2019, 3:30 AM

Mstyles claimed this task.Dec 13 2019, 8:45 PM

Mstyles updated the task description. (Show Details)

Mstyles added a subscriber: Gehel.

Mstyles renamed this task from Train MLR models for zhwiki, jawiki and kowiki to Test MLR models for zhwiki, jawiki and kowiki.Dec 18 2019, 5:23 PM

Change 559614 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[operations/mediawiki-config@master] Add new MLR models

https://gerrit.wikimedia.org/r/559614

gerritbot added a project: Patch-For-Review.Dec 19 2019, 10:37 PM

Change 572063 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/WikimediaEvents@master] Deploy MLR2020 test

https://gerrit.wikimedia.org/r/572063

Mstyles edited projects, added Discovery-Search (Current work); removed Discovery-Search.Feb 13 2020, 6:55 PM

Change 559614 merged by jenkins-bot:
[operations/mediawiki-config@master] Add new MLR models

https://gerrit.wikimedia.org/r/559614

Mentioned in SAL (#wikimedia-operations) [2020-02-13T19:17:28Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T219534 Add new MLR models for Cirrus on zh/ja/kowiki (duration: 01m 03s)

Mstyles moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Feb 13 2020, 8:13 PM

Config changes have been deployed but due to a configuration conflict with jawiki using the default for wgCirrusSearchFullTextQueryBuilderProfile (see https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects), some further changes will have to be made to have the model name show up here

Reviewing the history, I think the primary concern related to bm25 and spaceless languages was:

In T147500#2804417, @TJones wrote:

@dcausse predicted the ZRR would go down because, if I recall correctly, we aren't searching for phrases anymore, so any page that randomly has all the same characters as the query in it will be returned.

Essentially in spaceless languages without a good analysis pipeline we fall back to character n-grams, and the search is only reasonable using phrase search. We previously evaluated using a language specific analysis chain for japanese, but the community review of the that (I think it was done by putting an index with the new configuratoin on relforge, and letting users query it from an mw instance in cloud?) didn't find it to be an improvement, so we are still on 2-grams.

Additionally, we cannot switch between tf/idf and bm25 with a simple query-time selection. While we do have a custom hack in the search-extra plugin that does exactly that, we've only used it at development time and never in production. The official method to switch is to change the analysis pipeline and re-index the wiki. This means A/B testing the analysis pipeline requires reindexing one of the two search clusters, and then splitting A/B traffic between them. Fake latency also has to be injected as the eqiad side to account for round trips between app servers and the codfw cluster in the other bucket.

That all sounds like a bunch of additional work that is probably not justified. The remaining options are to either adjust the old query builder to be able to generate ltr queries, or drop jawiki from the A/B test. I haven't had a chance to find the exact code, but i suspect the old query builder not generating LTR queries is a bug of some sort, might be worth looking into.

Testing the Japanese analyzer (Kuromoji) was hard because we weren't set up to test tokenization and there isn't necessarily a canonical tokenization for some sentences, so my automated testing corpus was not 100% reliable—and using RelForge was hard because throwing people at a blank search box is not the best search evaluation plan. My write up is here.

It is possible that Kuromoji has improved in the last 2.5 years, and I'm sure my ability to test it has improved. Korean went much better, and it was easier to identify and even fix some of the problems (by unpacking and customizing the analyzer). I've wanted to look at Kuromoji again for a while, but just hadn't gotten around to it. We could think about doing it in Q4 or Q1 and look at this again after that.

On another note, it would be possible to use RelForge to set up a BM25 index of jawiki there and do a comparison of the impact it has on results, even if you can't assess the quality of the change (and it might be possible to get a human to review some of the changes). It may still not be worth the effort, though.

</2¢>

Talked about this today. Short term: Investigate why the classic query building isn't generating an LTR query, it almost certainly should. If that isn't fruitful we should ship the test without jawiki. Longer term: Elasticsearch is deprecating (turning into a noop) the phrase query we use here in 6.8, and removing it in 7.x. We need to re-evaluate the kuromoji analysis chain and hopefully move ja onto a proper analysis chain before upgrading to 6.8. The other spaceless languages are probably not big enough to need specific support, we can perhaps use shingles since the text content is minimal (outside jawiki).

@dcausse looked into the rescore building, RescoreBuilder::isProfileSyntaxSupported is rejecting the profile containing the LTR because it the query_string search query syntax is reported. Essentially this is saying that the search query needs further parsing on the elasticsearch side, and since our LTR query can't be modified to apply that parsing it instead rejects it outright. Most likely we could add a regex to check how simple the query string is and allow through all queries that are simple search strings and contain no syntax that would be handled by the query_string query.

Instead of filtering the query string queries, we want to move off of query string for spaceless languages and on to using the full text simple match query builder. This will help when we upgrade elastic search and no longer use query strings. In order to the make this move, the FTSM query builder has to be tested in relforge for Japanese. There's currently an upgrade going on with relforge, so this task will be paused until the python upgrade for relforge is complete.

Mstyles moved this task from Waiting to Needs review on the Discovery-Search (Current work) board.May 11 2020, 5:30 PM

Change 595019 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[operations/mediawiki-config@master] Update ML models

https://gerrit.wikimedia.org/r/595019

Change 595019 merged by jenkins-bot:
[operations/mediawiki-config@master] Update ML models

https://gerrit.wikimedia.org/r/595019

Shizhao moved this task from Backlog to Extensions/Skins on the Chinese-Sites board.Jul 6 2020, 1:13 AM

Mstyles removed a project: Discovery-Search (Current work).Jan 4 2021, 4:32 PM

Mstyles removed Mstyles as the assignee of this task.Jan 4 2021, 4:38 PM

Mstyles triaged this task as Medium priority.

Mstyles added a project: Discovery-Search.

Mstyles subscribed.

Change 572063 abandoned by Mstyles:
[mediawiki/extensions/WikimediaEvents@master] Deploy MLR2020 test