In order to answer questions such as
- "How many search queries happen in what languages?"
- "How many files are in lang X?"
- "How many files/descriptions are in multiple languages?"
we need to be able to identify the language of the search queries in [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Cirrus | CirrusSearchRequestSet ]], file descriptions in [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits | Edits data lake ]], and [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_&_Hive | event logging data ]] in Hive. We have [[ https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language | Accept-Language ]] data (in a lot of cases, but not always) in Cirrus logs, but it's not a reliable marker – although we can use it together with a more reliable method such as [[ https://www.mediawiki.org/wiki/TextCat | TextCat ]] (a popular algorithm for this task).
Once a library is available, I can write the UDF that uses it but I'll need help building up a catalog of languages. @TJones and Stas have implemented [[ https://github.com/wikimedia/wikimedia-textcat | a PHP version that we use in production ]] so Trey already has A BUNCH of TextCat language models ([[ https://github.com/wikimedia/wikimedia-textcat/tree/master/LM-query | one set specifically for search queries ]] and [[ https://github.com/wikimedia/wikimedia-textcat/tree/master/LM | one set built from Wikipedia articles for general language identification ]]). We might need to convert or recreate these in a different format.
**Fortunately** [[ http://textcat.sourceforge.net/ | there appears to be a Java library ]] of TextCat that we could use. So the first step would be making that library available by adding it as a dependency to `analytics/refinery/source/refinery-hive`. I assume it would need to be reviewed by Ops/Analytics for security reasons? Or maybe there's a way for us to utilize the PHP implementation from inside Java/Hive?
I should also mention that besides being able to answer important questions for the Structured Data on Commons endeavor, there are also research questions about language of edits & content that could then be asked and answered if this were to be available :)