Page MenuHomePhabricator

Automate Process for Selecting Languages for a Given Wiki for Use With TextCat
Closed, DeclinedPublic

Description

It takes 1.5 to 2 days per wiki to do the necessary review to manually determine the right set of languages to use with TextCat for a given wiki. (See T132466 for French, German, Italian, and Spanish as an example.)

This is not scalable to cover a very large number of wikis (> 10 maybe, > 25, definitely), and if we want to update the list of relevant languages regularly (quarterly to annually) it wouldn't hurt to make it more efficient in terms of the human time/effort to do the work (which could also allow more frequent updates, such as monthly to quarterly).

It may be possible to automate this process.

Rather than manually filter the queries and manually identify their language as has previously been done, we could optimize the set of languages used with TextCat against the number of results generated by searching against the wikis identified.

Details of how to automatically rank the "quality" of the results by result count and the process for optimizing the language set require some more thought, but there are some obvious first approaches to try (details available upon request—this is already getting long).

In terms of the wikis we'd like to cover, we have an estimate of query volume by language on the dashboards. There are 12 languages with > 1% of the query volume measured there, and 5 more if we go down to 0.5%. Looking at the list of Wikipedias, there are 13 with > 1M articles, 4 (almost 5) more if we go down to 500K, and lots more if we go down a bit further to 250K or even 100K.

The current set of 4 most recently optimized wikis (T132466, as above) could serve as training data. If a process can come to a similar conclusion for those wikis as the manual process, we would have some confidence that it works.

Potential points of failure for the project include the large number of junk and other non-language queries (mostly proper nouns) in the query set (which could swamp the "real" queries since they aren't being manually filtered), and the differences language processing on the various wikis (e.g., a "wrong" language wiki might return more results because of the variability it allows in matching that the "right" language wiki does not).

Event Timeline

Deskana lowered the priority of this task from Medium to Low.May 5 2016, 10:10 PM
Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

Good idea! However, given the fact that non top-10 languages only account for around 5% of our total traffic, and we've not even got this working for our biggest languages yet, I'm not prioritising this right now.

debt subscribed.

Closing this out in favor of T140300.