Page MenuHomePhabricator

Test the ICU tokenizer on ja, th and zh with relcomp
Closed, ResolvedPublic

Description

The results of the second BM25 A/B test (T147500) showed that the Zero Results Rate dropped to something below 10% for ja and zh which is way below our 20% average.
We should measure if ICU tokenization can make Zero Results Rate closer to our 20% average for Zero Results Rate for these languages.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The test was not very promizing, ZRR dropped dramatically even with the icu_tokenizer. It seems to suggest that this tokenizer is not smart enough to properly tokenize chinese and japanese.

For reference here is a lucene issue that describes the problem we are having with spaceless languages. It explains why the QueryString option we use (auto_generate_phrase_queries) was added: https://issues.apache.org/jira/browse/LUCENE-2458, Most lucene devs tend to agree that this option is bad for spaceless languages, on the other hand not using it expose bad tokenization behaviors directly to the user, the use of auto_generate_phrase_queries=true by default is certainly bad but it hides those bad behaviors at the cost of a very low recall.
The conservative approach would be not to drop this feature until we find proper tokenizers for chinese and japanese. Thai seems OK but it probably needs more testing.

The ICU tokenizer is probably a better default than Standard for languages where we do not have any specific analysis chains and I tend to think that it's still worth merging the patch chain concerning ICU.

Results:

thai

Comparison run summary: ./relevance/comparisons/th-prod_th-icu-tokenizer/

Stats: 1591 query pairs compared
Baseline: ./relevance/queries/th-prod/results

Metrics:

Query Count: 1591
Zero Results Rate: 48.6%
Poorly Performing Percentage: 57.4%
Top 3 Sorted Results Differ: 70.8%
Top 3 Unsorted Results Differ: 67.9%
Top 5 Sorted Results Differ: 71.9%
Top 5 Unsorted Results Differ: 69.8%
Top 20 Sorted Results Differ: 72.0%
Top 20 Unsorted Results Differ: 69.8%

Delta: ./relevance/queries/th-icu-tokenizer/results

Metrics:

Query Count: 1591
   Num TotalHits Changed: μ: 502.49; σ: 2383.68; median: 9.00; range: [-36, 37070]
   Pct TotalHits Changed: μ: 18161.5%; σ: 134476.7%; median: 88.5%; range: [-12.50%, 2242400.00%]
   Charts [ – ]

                     
Zero Results Rate: 24.4% (-24.3%)
Poorly Performing Percentage: 29.5% (-27.8%)
Top 3 Sorted Results Differ: 70.8%
Top 3 Unsorted Results Differ: 67.9%
Top 5 Sorted Results Differ: 71.9%
Top 5 Unsorted Results Differ: 69.8%
   Num Top 5 Results Changed: μ: 2.54; σ: 2.04; median: 3.00; range: [0, 5]
   Pct Top 5 Results Changed: μ: 155.0%; σ: 196.0%; median: 60.0%; range: [0.00%, 500.00%]
   Charts [ + ]

Top 20 Sorted Results Differ: 72.0%
Top 20 Unsorted Results Differ: 69.8%
   Num Top 20 Results Changed: μ: 4.85; σ: 4.06; median: 5.00; range: [0, 10]
   Pct Top 20 Results Changed: μ: 273.0%; σ: 389.6%; median: 60.0%; range: [0.00%, 1000.00%]
   Charts [ – ]

japanese
Comparison run summary: ./relevance/comparisons/ja-prod_ja-icu-tokenizer/

Stats: 1997 query pairs compared
Baseline: ./relevance/queries/ja-prod/results

Metrics:

Query Count: 1997
Zero Results Rate: 26.5%
Poorly Performing Percentage: 38.2%
Top 3 Sorted Results Differ: 87.8%
Top 3 Unsorted Results Differ: 83.8%
Top 5 Sorted Results Differ: 90.7%
Top 5 Unsorted Results Differ: 86.9%
Top 20 Sorted Results Differ: 91.1%
Top 20 Unsorted Results Differ: 87.9%

Delta: ./relevance/queries/ja-icu-tokenizer/results

Metrics:

Query Count: 1997
   Num TotalHits Changed: μ: 3368.05; σ: 29370.41; median: 87.00; range: [-284856, 723679]
   Pct TotalHits Changed: μ: 59842.3%; σ: 498086.4%; median: 667.0%; range: [-93.33%, 14782625.00%]
   Charts [ + ]

Zero Results Rate: 6.5% (-20.0%)
Poorly Performing Percentage: 10.6% (-27.6%)
Top 3 Sorted Results Differ: 87.8%
Top 3 Unsorted Results Differ: 83.8%
Top 5 Sorted Results Differ: 90.7%
Top 5 Unsorted Results Differ: 86.9%
   Num Top 5 Results Changed: μ: 2.91; σ: 1.75; median: 3.00; range: [0, 5]
   Pct Top 5 Results Changed: μ: 157.5%; σ: 186.1%; median: 60.0%; range: [0.00%, 500.00%]
   Charts [ + ]

Top 20 Sorted Results Differ: 91.1%
Top 20 Unsorted Results Differ: 87.9%
   Num Top 20 Results Changed: μ: 5.74; σ: 3.46; median: 6.00; range: [0, 10]
   Pct Top 20 Results Changed: μ: 281.0%; σ: 380.3%; median: 70.0%; range: [0.00%, 1000.00%]
   Charts [ + ]

chinese
Comparison run summary: ./relevance/comparisons/zh-prod_icu-tokenizer//

Stats: 9998 query pairs compared
Baseline: ./relevance/queries/zh-prod/results

Metrics:

Query Count: 9998
Zero Results Rate: 30.3%
Poorly Performing Percentage: 43.6%
Top 3 Sorted Results Differ: 82.2%
Top 3 Unsorted Results Differ: 78.5%
Top 5 Sorted Results Differ: 84.1%
Top 5 Unsorted Results Differ: 80.2%
Top 20 Sorted Results Differ: 84.2%
Top 20 Unsorted Results Differ: 79.3%

Delta: ./relevance/queries/icu-tokenizer/results

Metrics:

Query Count: 9998
   Num TotalHits Changed: μ: 804.52; σ: 11266.99; median: 14.00; range: [-618270, 148658]
   Pct TotalHits Changed: μ: 24285.0%; σ: 151611.1%; median: 178.5%; range: [-100.00%, 5994100.00%]
   Charts [ + ]

Zero Results Rate: 10.6% (-19.7%)
Poorly Performing Percentage: 17.2% (-26.4%)
Top 3 Sorted Results Differ: 82.2%
Top 3 Unsorted Results Differ: 78.5%
Top 5 Sorted Results Differ: 84.1%
Top 5 Unsorted Results Differ: 80.2%
   Num Top 5 Results Changed: μ: 2.81; σ: 1.89; median: 3.00; range: [0, 5]
   Pct Top 5 Results Changed: μ: 153.2%; σ: 187.5%; median: 60.0%; range: [0.00%, 500.00%]
   Charts [ + ]

Top 20 Sorted Results Differ: 84.2%
Top 20 Unsorted Results Differ: 79.3%
   Num Top 20 Results Changed: μ: 5.43; σ: 3.85; median: 6.00; range: [0, 10]
   Pct Top 20 Results Changed: μ: 270.1%; σ: 381.3%; median: 70.0%; range: [0.00%, 1000.00%]
   Charts [ + ]

I tried jieba, tl;dr: ZRR dropped to 17.5% which sounds a bit more sane.

raw data: stat1002.eqiad.wmnet:~dcausse/jieba_results.tar.gz

Note: sadly, jieba (elasticsearch integration of the java implementation) does not seem production ready. I had to hack the tokenizer in order to emit tokens that lucene will accept (tokens were emitted out of order which is not allowed when you store offsets).
If it's a valid alternative it'll certainly require some non negligible work to integrate properly into our task: properly handle position increment and offsets, make sure highlight is consistent, perf and mem usage....

jieba vs production (StandardTokenizer + QueryString & auto_generate_phrase_queries)

Comparison run summary: ./relevance/comparisons/zh-prod_zh-jieba/

Stats: 9998 query pairs compared
Baseline: ./relevance/queries/zh-prod/results

Metrics:

Query Count: 9998
Zero Results Rate: 30.3%
Poorly Performing Percentage: 43.6%
Top 3 Sorted Results Differ: 73.7%
Top 3 Unsorted Results Differ: 68.5%
Top 5 Sorted Results Differ: 76.4%
Top 5 Unsorted Results Differ: 71.3%
Top 20 Sorted Results Differ: 76.6%
Top 20 Unsorted Results Differ: 70.3%

Delta: ./relevance/queries/zh-jieba/results

Metrics:

Query Count: 9998
   Num TotalHits Changed: μ: 9.78; σ: 13668.23; median: 1.00; range: [-793487, 513838]
   Pct TotalHits Changed: μ: 10644.3%; σ: 516681.1%; median: 0.4%; range: [-100.00%, 51383800.00%]
   Charts [ + ]

Zero Results Rate: 17.5% (-12.8%)
Poorly Performing Percentage: 28.3% (-15.3%)
Top 3 Sorted Results Differ: 73.7%
Top 3 Unsorted Results Differ: 68.5%
Top 5 Sorted Results Differ: 76.4%
Top 5 Unsorted Results Differ: 71.3%
   Num Top 5 Results Changed: μ: 2.14; σ: 1.84; median: 2.00; range: [0, 5]
   Pct Top 5 Results Changed: μ: 104.6%; σ: 157.3%; median: 40.0%; range: [0.00%, 500.00%]
   Charts [ – ]

               
Top 20 Sorted Results Differ: 76.6%
Top 20 Unsorted Results Differ: 70.3%
   Num Top 20 Results Changed: μ: 3.92; σ: 3.63; median: 3.00; range: [0, 10]
   Pct Top 20 Results Changed: μ: 167.7%; σ: 305.8%; median: 40.0%; range: [0.00%, 1000.00%]
   Charts [ – ]

jieba vs ICU tokenization

Comparison run summary: ./relevance/comparisons/icu-tokenizer_zh-jieba/

Stats: 9998 query pairs compared
Baseline: ./relevance/queries/icu-tokenizer/results

Metrics:

Query Count: 9998
Zero Results Rate: 10.6%
Poorly Performing Percentage: 17.2%
Top 3 Sorted Results Differ: 66.5%
Top 3 Unsorted Results Differ: 59.1%
Top 5 Sorted Results Differ: 77.2%
Top 5 Unsorted Results Differ: 66.6%
Top 20 Sorted Results Differ: 82.0%
Top 20 Unsorted Results Differ: 73.2%

Delta: ./relevance/queries/zh-jieba/results

Metrics:

Query Count: 9998
   Num TotalHits Changed: μ: -794.74; σ: 7563.37; median: -1.00; range: [-175217, 513838]
   Pct TotalHits Changed: μ: 5877.6%; σ: 514082.3%; median: -3.8%; range: [-100.00%, 51383800.00%]
   Charts [ + ]

Zero Results Rate: 17.5% (+6.9%)
Poorly Performing Percentage: 28.3% (+11.1%)
Top 3 Sorted Results Differ: 66.5%
Top 3 Unsorted Results Differ: 59.1%
Top 5 Sorted Results Differ: 77.2%
Top 5 Unsorted Results Differ: 66.6%
   Num Top 5 Results Changed: μ: 1.83; σ: 1.83; median: 1.00; range: [0, 5]
   Pct Top 5 Results Changed: μ: 41.5%; σ: 54.2%; median: 20.0%; range: [0.00%, 500.00%]
   Charts [ + ]

Top 20 Sorted Results Differ: 82.0%
Top 20 Unsorted Results Differ: 73.2%
   Num Top 20 Results Changed: μ: 3.63; σ: 3.63; median: 2.00; range: [0, 10]
   Pct Top 20 Results Changed: μ: 46.0%; σ: 85.6%; median: 20.0%; range: [0.00%, 1000.00%]
   Charts [ + ]

I agree that jieba is probably too much technical work, but the results do seem better than the ICU tokenizer. Against the baseline the median increase in totalHits is only 1 results / 0.4%. That's pretty small. (Of course the outliers are ridiculous beyond measure, as usual.)

It's hard to get numbers by eye, but it looks like a lot more jieba queries ended up in the "0-20 more queries" bucket for jieba, which seems to be a good sign.

I also wonder how well we can generalize from the Latin/spaced/alphabetic languages to the non-Latin/spaceless/non-alphabetic languages. 17.5% ZRR doesn't seem ridiculous.