In order to answer questions such as
- "How many search queries happen in what languages?"
- "How many files are in lang X?"
- "How many files/descriptions are in multiple languages?"
we need to be able to identify the language of the search queries in CirrusSearchRequestSet, file descriptions in Edits data lake, and event logging data in Hive. We have Accept-Language data (in a lot of cases, but not always) in Cirrus logs, but it's not a reliable marker – although we can use it together with a more reliable method such as TextCat (a popular algorithm for this task).
Once a library is available, I can write the UDF that uses it but I'll need help building up a catalog of languages. @TJones and Stas have implemented a PHP version that we use in production so Trey already has A BUNCH of TextCat language models (one set specifically for search queries and one set built from Wikipedia articles for general language identification). We might need to convert or recreate these in a different format.
Fortunately there appears to be a Java library of TextCat that we could use. So the first step would be making that library available by adding it as a dependency to analytics/refinery/source/refinery-hive. I assume it would need to be reviewed by Ops/Analytics for security reasons? Or maybe there's a way for us to utilize the PHP implementation from inside Java/Hive?
I should also mention that besides being able to answer important questions for the Structured Data on Commons endeavor, there are also research questions about language of edits & content that could then be asked and answered if this were to be available :)