Page MenuHomePhabricator

Another look at multi-hyphen tokens on enwiki and zhwiki
Closed, ResolvedPublic

Description

Based on what we found in T172653, we'd like to test how often these cases come up by doing a survey of queries that get zero responses on the home wiki, get identified as something by TextCat, and then get results on the foreign wiki and then extrapolate back to how often it happens in production.

Also do a quick survey of TextCat performance since I'm going through a bunch of data anyway.

Event Timeline

We haven't figured out why we get queries that are all punctuation or massive repeats of one character "-" but we're pretty sure it's not done by humans. We figured out how to stop these queries from taking up too much time in T172653, but this is to take that work one step further.

The results are in! A brief summary:

  • Current language identification performance looks good; not too many queries are being filtered as too short or too ambiguous, and almost all of the languages enabled get used in a 5K sample.
    • Language identification errors that have results shown to users are very rare on English Wikipedia (which is by far the largest by volume), but more common elsewhere. The most common "bad" result is a non-language string of Latin characters being identified as English (because it's boosted) and then results are found (because English Wikipedia has so many non-word things in it). This isn't terrible, just not always very helpful.
  • Poorly-performing all-punctuation/symbol queries are uncommon, and most don't get cross-language results, though we dug up a small number of examples that do.
    • The Chinese query-based model is nonetheless ridiculously skewed in favor of multiple periods and multiple dashes, and should be fixed.
  • It makes sense to enable some "less ambiguous" languages everywhere, just because the potential for harm is very low, and there is some small upside possibility.
  • Loosening the criteria for allowing DYM suggestion results to block cross-language results seems like a good idea.
  • Our query sampling process is now much improved, though working that out did cause a delay.

Lots more detail is available on MediaWiki: Review of Language Identification in Production, with a Special Focus on Stupid Identification Tricks. (This is one of my longer write ups, and that's saying something!)

More tickets for the suggested next steps to come.