Page MenuHomePhabricator

A/B Test TextCat settings on non-WP projects
Open, MediumPublic

Description

We want to spread the usefulness of Language Identification (via TextCat) to non-Wikipedia wikis.

Rather than do a time-consuming manual analysis for each wiki project, we could do an A/B test on some/all projects in the same language using the default configs for the Wikipedia project in that language (for which analysis is done).

Such A/B tests would give us insight into whether the TextCat configs can be straightforwardly shared across projects in the same language. If so, it would help us be able to apply language detection to more of the long tail of wiki projects.

Event Timeline

TJones created this task.Jul 13 2016, 7:50 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 13 2016, 7:50 PM
debt triaged this task as Medium priority.EditedJul 14 2016, 10:17 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added a subscriber: debt.

Let's pick one or two other project wiki's in the same language and determine if we have enough data to do a test on already or if we have to do more data gathering. Maybe Wiktionary and Wikispecies (in EN) would be good candidates to start off with.

debt added a subscriber: mpopov.Aug 5 2016, 7:39 PM

notes based on a conversation with @TJones:

This seems like an easy A/B test to set up and run - pick a few languages, pick a few non-WP projects, configure, run for a couple of weeks and then get some analysis on it.