Another look at multi-hyphen tokens on enwiki and zhwiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Aug 24 2017, 11:12 PM

Description

Based on what we found in T172653, we'd like to test how often these cases come up by doing a survey of queries that get zero responses on the home wiki, get identified as something by TextCat, and then get results on the foreign wiki and then extrapolate back to how often it happens in production.

Also do a quick survey of TextCat performance since I'm going through a bunch of data anyway.

Related Objects

Mentioned In: T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5
T219911: Retrain Chinese query-based language ID models
T172653: investigate multi-hyphen tokens on enwiki and zhwiki
Mentioned Here: T219911: Retrain Chinese query-based language ID models
T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5
T219915: Enable more of the unambiguous/less ambiguous scripts for language identification
T172653: investigate multi-hyphen tokens on enwiki and zhwiki

Event Timeline

debt created this task.Aug 24 2017, 11:12 PM

Restricted Application added subscribers: Stang, Aklapper. · View Herald TranscriptAug 24 2017, 11:12 PM

debt mentioned this in T172653: investigate multi-hyphen tokens on enwiki and zhwiki.Aug 24 2017, 11:13 PM

Shizhao added a project: Chinese-Sites.Aug 25 2017, 2:24 AM

Liuxinyu970226 subscribed.Aug 25 2017, 4:52 AM

We haven't figured out why we get queries that are all punctuation or massive repeats of one character "-" but we're pretty sure it's not done by humans. We figured out how to stop these queries from taking up too much time in T172653, but this is to take that work one step further.

TJones moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:28 PM

debt moved this task from Tech Debt/Misc to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:41 PM

TJones claimed this task.Mar 7 2019, 6:09 PM

TJones moved this task from Language Stuff to Current work on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones updated the task description. (Show Details)Mar 19 2019, 5:25 PM

The results are in! A brief summary:

Current language identification performance looks good; not too many queries are being filtered as too short or too ambiguous, and almost all of the languages enabled get used in a 5K sample.
- Language identification errors that have results shown to users are very rare on English Wikipedia (which is by far the largest by volume), but more common elsewhere. The most common "bad" result is a non-language string of Latin characters being identified as English (because it's boosted) and then results are found (because English Wikipedia has so many non-word things in it). This isn't terrible, just not always very helpful.
Poorly-performing all-punctuation/symbol queries are uncommon, and most don't get cross-language results, though we dug up a small number of examples that do.
- The Chinese query-based model is nonetheless ridiculously skewed in favor of multiple periods and multiple dashes, and should be fixed.
It makes sense to enable some "less ambiguous" languages everywhere, just because the potential for harm is very low, and there is some small upside possibility.
Loosening the criteria for allowing DYM suggestion results to block cross-language results seems like a good idea.
Our query sampling process is now much improved, though working that out did cause a delay.

Lots more detail is available on MediaWiki: Review of Language Identification in Production, with a Special Focus on Stupid Identification Tricks. (This is one of my longer write ups, and that's saying something!)

More tickets for the suggested next steps to come.

TJones mentioned this in T219911: Retrain Chinese query-based language ID models.Apr 2 2019, 5:59 PM

TJones mentioned this in T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5.Apr 2 2019, 6:07 PM

TJones added subtasks: T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5, T219911: Retrain Chinese query-based language ID models.

TJones removed a subtask: T219912: Loosen limit on DYM suggestions blocking cross-language results from < 3 to < 5.Apr 2 2019, 6:11 PM

TJones removed a subtask: T219911: Retrain Chinese query-based language ID models.

Created the following tasks and will prioritize them into the Language Stuff workboard column:

I'm not creating tickets for the low-priority tasks yet.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Apr 8 2019, 3:28 PM

debt closed this task as Resolved.Apr 15 2019, 6:05 PM

Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Apr 16 2019, 3:10 AM

Stang unsubscribed.Nov 14 2021, 12:15 AM

Another look at multi-hyphen tokens on enwiki and zhwikiClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Another look at multi-hyphen tokens on enwiki and zhwiki
Closed, ResolvedPublic
Actions