Page MenuHomePhabricator

UDF for language detection
Open, LowestPublic


In order to answer questions such as

  • "How many search queries happen in what languages?"
  • "How many files are in lang X?"
  • "How many files/descriptions are in multiple languages?"

we need to be able to identify the language of the search queries in CirrusSearchRequestSet, file descriptions in Edits data lake, and event logging data in Hive. We have Accept-Language data (in a lot of cases, but not always) in Cirrus logs, but it's not a reliable marker – although we can use it together with a more reliable method such as TextCat (a popular algorithm for this task).

Once a library is available, I can write the UDF that uses it but I'll need help building up a catalog of languages. @TJones and Stas have implemented a PHP version that we use in production so Trey already has A BUNCH of TextCat language models (one set specifically for search queries and one set built from Wikipedia articles for general language identification). We might need to convert or recreate these in a different format.

Fortunately there appears to be a Java library of TextCat that we could use. So the first step would be making that library available by adding it as a dependency to analytics/refinery/source/refinery-hive. I assume it would need to be reviewed by Ops/Analytics for security reasons? Or maybe there's a way for us to utilize the PHP implementation from inside Java/Hive?

I should also mention that besides being able to answer important questions for the Structured Data on Commons endeavor, there are also research questions about language of edits & content that could then be asked and answered if this were to be available :)

Event Timeline

mpopov triaged this task as Medium priority.Dec 7 2017, 8:44 PM
mpopov created this task.

So, I think this is a very nifty idea, but there are some potential pitfalls to be aware of.

The claim (copied in the Java port from the original) that TextCat is "a library that was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy" is far from true, especially when dealing with short strings like queries and file names. Getting optimal performance from TextCat requires significant tuning, and knowledge of the kinds of data you are likely to throw at it (which may require manual tagging of data samples).

The Java library is from 2006. I haven't looked at it closely, but I downloaded it and looked at the language models. They are from the original TextCat implementation, and are pre-Unicode. So, there is an "arabic-iso8859_6" model and an "arabic-windows1256" model and "chinese-big5" and "chinese-gb2312" models. The problem isn't so much the models as the potential lack of Unicode support. I did have to upgrade the Perl version to support Unicode. It was easy enough, but did need to be done.

The Java version uses language models in the format of the original Perl implementation. My updated Perl implementation already has the same models as the PHP library, but in the original format.

The Java port is also going to be missing all of the other upgrades I made to the Perl update and PHP port. See "Updates" in the Perl README. Important ones include specifying a minimum length (otherwise really short strings get essentially random identifications), the "alphabetic sub-sort" (without which n-grams can be loaded in random order, which can cause random score fluctuations), and "max proportion of worst score" (without which, unexpected strings like hjdkashljkljkldjklsajdklas, emoji, and scripts you don't have models for can end up with very confident results that are totally bogus). Useful updates include boosting certain languages (e.g., assuming strings on French Wikipedia are more likely to be French than not—so you might boost based on the user's accept language header, for example), and allowing for multiple language model directories (so that query-based and Wiki-based models can easily be used together). I think the original version of TextCat also always used all available models, which is often not a great idea; if so, the Java port may do the same.

The PHP port and Perl update also support specifying the model size. I can't recall if the original does (and don't know if the Java port does). This is important, because the current models are very large (10K n-grams). Tuning the model size also improves performance. Bigger models recognize more n-grams (i.e., less common ones), but also give a bigger penalty to unknown n-grams, which can be an issue with queries and file names, where something unexpected could easily pop up.

In general, throwing a lot of languages at short strings can lead to bad results. We have a models for Scots, which, if enabled, can be the ID given for English text much more often than it is the ID for actual Scots text because we almost never see actual Scots text. French and English get mixed up a lot, too (almost any one word ending in -ble could go either way), as do other pairs of closely related languages. Choosing the languages you are most likely to encounter makes TextCat smarter and faster.

You also have to expect potential weird results for names (of people, places, and things) because names are not really "in" a language. Some correlate with a particular language very highly, while others, like "Maria" can be found all over the world. Even when a name looks like it comes from a particular language, that can be very misleading. (For example, the actor Enrico Colantoni has a very Italian name, but he's Canadian, and mostly famous for being in American movies and TV shows. What language is his name in?) This is a general language ID problem, not specific to TextCat, so you should plan for it unless you also have a name detector.

We chose TextCat for CirrusSearch because it is very lightweight. You might want to consider looking around for something smarter but slower if your application can handle that—particularly something with a reasonable dictionary for each language covered. Dictionaries can easily tell you that incredible is English, incroyable is French, and increíble is Spanish—while TextCat, based only on n-grams, may struggle. On the other hand, a dictionary may not recognize that increíble is also Asturian. And, on the other other hand, adding language coverage for something using dictionaries can be very difficult, while adding a model to TextCat for most languages is pretty easy once you slurp down some text from the relevant Wikipedia (though some very small Wikipedias have lots of text not in their "home" language, so caveat lector before you carpe textum).

TL;DR: If you aren't expecting the claim of near-perfect accuracy to be true on short strings like queries and file names, the Java port plus the updated Perl language models may take you far enough—though the Java port may need to be updated to handle Unicode correctly, and the language models may need to be trimmed down from 10K n-grams. If you need much higher accuracy, then you'll need to do some sampling and tuning, and you'll want some of the extra features added to the PHP port and Perl update.

I'm here to help!

I wonder if it's possible to make a Hive UDF that uses Google's Compact Language Detector v3 (CLD3) library & model

There's an R wrapper ( that I'll try with search queries & a dump of Commons and if the results are decent, maybe we could make a Java wrapper to it?

The CLD3 page says it is intended to run in a browser and relies on Chromium... that's kinda weird. And I don't see a list of supported/identified languages. The R wrapper is excellent for testing, though!

You can also test TextCat if you have some data. It can take a file of one item per line and output its guesses line by line. You can configure it to give you the best guess or to give multiple guesses to see what its second and third choices are, too. If you have data somewhere, I can easily run it for you, too, if you just want to check out the results. (Despite my earlier caveats, I'm not saying you shouldn't use TextCat, just that it is not at all "near-perfect"—and on short strings, nothing will be.)

The CLD3 page says it is intended to run in a browser and relies on Chromium... that's kinda weird.

Seems that you can install it as a library with a dependency on protocol buffers but in any case it is a cpp one, in order to use this on a udf you need the java bindings that will call this library via jni, I assume, so it would be some work. There rae javascript bindings though:

How about tagging each request with their language using this library?

The javascript bindings open an interesting possibility. We can use them to consume and tag a kafka topic, publishing to another topic. We already worked with kafka in javascript via kafkasse: (and kasocki before that). So this is possible, but we're not sure where it would fit in the overall infrastructure. Probably easier to just make the java bindings, but we don't have room to prioritize that work ourselves.

I don't know that it would be particularly useful here, but i have a WIP patch to expose our language detection via the mediawiki API that could potentially be useful for external language detection. For accessing in bulk from hadoop (like a UDF would) hitting a mediawiki api is probably undesirable.

debt added a subscriber: debt.

Punting to later - it'd be cool, but not sure we have the time to really dig into this right now.

nshahquinn-wmf lowered the priority of this task from Medium to Lowest.Sep 27 2018, 8:38 PM