Page MenuHomePhabricator

investigate multi-hyphen tokens on enwiki and zhwiki
Closed, ResolvedPublic

Description

This may become separate tasks, but a quick look into what is going on is needed first.

Searching for four dashes (----) on enwiki is slow (see T169498) but also gets language ID on Chinese. Oddly, it also gives results that don't seem to have dashes!

  1. Why is TextCat saying ---- is Chinese on enwiki?
  1. Why does searching ---- on zhwiki give results (and seem to highlight random characters as matches)

Event Timeline

The english analysis chain, when run against ---------- for both text and text_search analysis chains, returns no tokens. The chinese analysis chain returns a token for each - converted to a comma (according to source all punctuation is converted to commas)

When running the english analysis chain the top level tokenizer, lucene's StandardTokenizerImpl, we hit a code path inside the nextToken that throws away punctuation saying:

/* Not numeric, word, ideographic, hiragana, or SE Asian -- ignore it. */

The chinese tokenizer, lucene's HMMChineseTokenizer, on the other hand converts all punctuation into commas via org.apache.lucene.analysis.cn.smart.hhmm.SegTokenFilter. This appears to be the default behaviour since smartcn was first introduced to lucene in 2009.

I think the right approach forward might be to mimic the StandardTokenizerImpl and add a step to our analysis chains the rejects the , tokens smartcn is generating. The most direct approach seems to be to add a stop words token filter to the chinese analysis chain, and provide it a custom list of tokens containing only the comma.

EBernhardson changed the visibility from "Public (No Login Required)" to "WMF-NDA (Project)".Aug 7 2017, 11:43 PM

(changing visibility to WMF-NDA because this ticket basically describes how to DOS the search infrastructure)

@dcausse

I should also note this problem does not appear to be limited to this specific case of querys consisting of mostly dashes, it just happened to come up. I wonder if we should be adjusting all the analysis chains so that the text_search analysis chain has a limit on the number of tokens it will produce.

This isn't quite as easy to "accidentaly" create in english, but the same behaviour is there. For examples using repeatitions of the words 'a' (fairly common), and 'the' (most common english word). As an aside, we only allow 300 character queries so you can actually only search for 'the' 75 times. Also worth noting all of these were run while providing elasticsearch the 'timeout=10s' query parameter.

wordtimestook
a5030s
a10045s
a2001m12s
the502m50s
the1004m28s

None of these quite rise to the level of what is happening in chinese, but it seems a general enough issue it should be fixed.

There is also a separate problem, that the elasticsearch query timeout feature is not able to cancel these queries (at least not immediately). If i had to guess the reason is going to be pretty much the same as what we ran into before with regex timeouts (https://phabricator.wikimedia.org/T152895#2872562) , except this time it's lucene/elasticsearch and not our own plugin so it will be harder to force the timeout.

Related TextCat weirdness: searching for "Сharkoviensis" (note that the first letter is a Cyrillic С) recommends Simple English results. The Cyrillic C should be fixed, but in the meantime: What the heck?

Update to Chinese analysis analysis is here. In a 10K-article sample, 16.4% of tokens are indexed as commas, spread across 261 different characters. Stopword filter added.

I think Gerrit can't post here because of the NDA exclusion:

debt triaged this task as Medium priority.Aug 17 2017, 6:17 PM
debt added a project: Discovery-ARCHIVED.
debt subscribed.

This was merged on Aug 16; but it looks like it'll go out on next week's train. Keeping in the done column for now, until it actually goes into production.

On #1:

I took a quick look at why strings of hyphens are being identified as Chinese. With all languages enabled, it's a race between Chinese and Telugu for which matches best, depending on the exact number of hyphens.

It's generally a junk query (esp when there are 90+ hyphens), and normally since hyphens aren't indexed, even if identified as Chinese, there are no results to show.

This could be fixed by including hyphen among our list of non-word characters, but I don't think that's worth doing. There probably is some discrimination value because some languages use hyphens more than others. Multiple hyphens are just a pathological case.

Maybe we should have some sanity filter on textcat results? I.e. if the string does not contain any letters that are used in a language (hopefully we can find a good definition of that... but we can certainly use one that excludes ASCII non-letters) it's probably not in any particular useful language, whatever textcat ngrams say?

We could have a second list of non-word characters and say that if a string consists of only those characters, we ignore it. It would need to be a very long list—there are 111 Unicode punctuation characters, 60 number forms, 112 arrows, a zillion emoji, etc., etc. The list of acceptable characters is going to be even longer, since it would have to include all Chinese characters. Unicode ranges or named Unicode patterns might work.

A smaller list of more common characters might also be reasonable, but there's always the possibility of someone searching for a long series of any given character, or other nonsense.

I tried to filter a lot of this with the addition of minimum length and max proportion of max score, which filter out single characters and many things that look like non-language junk.

Still, there will always be edge cases. Here, a few things happened in concert: (1) the Chinese sample
used to build the sample distribution happened to have enough hyphens in it that a string of hyphens could still pass the max proportion of max score filter; (2) Chinese happened to have enough more hyphens than other languages that it barely passed the ambiguity check; and (3) there had to be results on Chinese Wikipedia when searching for lots of hyphens.

Things like (1) and (2) will probably always happen for some specific characters. (3) is the totally unexpected element this time, since SmartCN indexes hyphens, which really makes no sense. If "-----" had been identified as French by TextCat, there wouldn't be any results to show, and no one would know!

Once the patch is merged and Chinese Wikipedia re-indexed, most punctuation queries will fail to get any results there—though there are always other weird cases like "--" being a redirect to the decrement operator.

Another option would be to not only require fewer than three results to trigger TextCat, but also blocking TextCat if the query gets an exact title match on the home wiki. That would solve the "--" case and would've also solved some of the one-character query cases (like a single quote being labeled "Hebrew").

It would need to be a very long list—there are 111 Unicode punctuation characters, 60 number forms, 112 arrows, a zillion emoji,

Realistically, what we'll probably encounter is ASCII punctuation and maybe emoji. So we can use these ranges. Alternatively, we could use IntlChar::charType or IntlBreakIterator or any other ICU function to check that - I'm pretty sure something in ICU allows to do that without having to make ranges manually.

the Chinese sample used to build the sample distribution happened to have enough hyphens in it that a string of hyphens could still pass the max proportion of max score filter

Maybe instead of using filter on the search string, we should only add it to training sets and then we won't have bad chars in the ngram data? Or will it hurt the accuracy?

Realistically, what we'll probably encounter is ASCII punctuation

I don't know. People do weird stuff. Even something like jkjkjkjkjkjkjk could squeak by all the filters and come up identified as something—but hopefully it wouldn't get any search results. I think the big hole in the safety net was indexing punctuation by SmartCN. I should have paid more attention to that when I was working on Chinese.

Maybe instead of using filter on the search string, we should only add it to training sets and then we won't have bad chars in the ngram data? Or will it hurt the accuracy?

We do have a very small filter on the training data. We skip numbers, whitespace, and parens. Other punctuation is clearly distinctive (¿?, «» or 。, for example), and it's possible that distribution of punctuation is helpful. Shorter sentences means more periods. Some languages might use commas or dashes more often, etc.

I still think this is mostly an edge case that doesn't need more work right now.

Bonus thought from when I'm not right about to run to a meeting: we could also test how often these cases come up by doing a survey of queries that get zero responses on the home wiki, get identified as something by TextCat, and then get results on the foreign wiki and then extrapolating back to how often it happens in production.

Thanks for the additional conversation; I've created a new ticket T174116 to take a look again, based on Trey's comment: T172653#3549738.

Closing this out - as it was pushed into production last week. Thanks again, @TJones for all the good stuff written up in this ticket!

EBernhardson changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jun 1 2018, 5:38 PM

removing wmf-nda, as this is no longer a DOS vector.