Automate Process for Selecting Languages for a Given Wiki for Use With TextCat
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	TJones
	May 4 2016, 8:04 PM

Description

It takes 1.5 to 2 days per wiki to do the necessary review to manually determine the right set of languages to use with TextCat for a given wiki. (See T132466 for French, German, Italian, and Spanish as an example.)

This is not scalable to cover a very large number of wikis (> 10 maybe, > 25, definitely), and if we want to update the list of relevant languages regularly (quarterly to annually) it wouldn't hurt to make it more efficient in terms of the human time/effort to do the work (which could also allow more frequent updates, such as monthly to quarterly).

It may be possible to automate this process.

Rather than manually filter the queries and manually identify their language as has previously been done, we could optimize the set of languages used with TextCat against the number of results generated by searching against the wikis identified.

Details of how to automatically rank the "quality" of the results by result count and the process for optimizing the language set require some more thought, but there are some obvious first approaches to try (details available upon request—this is already getting long).

In terms of the wikis we'd like to cover, we have an estimate of query volume by language on the dashboards. There are 12 languages with > 1% of the query volume measured there, and 5 more if we go down to 0.5%. Looking at the list of Wikipedias, there are 13 with > 1M articles, 4 (almost 5) more if we go down to 500K, and lots more if we go down a bit further to 250K or even 100K.

The current set of 4 most recently optimized wikis (T132466, as above) could serve as training data. If a process can come to a similar conclusion for those wikis as the manual process, we would have some confidence that it works.

Potential points of failure for the project include the large number of junk and other non-language queries (mostly proper nouns) in the query set (which could swamp the "real" queries since they aren't being manually filtered), and the differences language processing on the various wikis (e.g., a "wrong" language wiki might return more results because of the variability it allows in matching that the "right" language wiki does not).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
		Declined		None	T134430 Automate Process for Selecting Languages for a Given Wiki for Use With TextCat

Event Timeline

TJones created this task.May 4 2016, 8:04 PM

Restricted Application added a subscriber: Zppix. · View Herald TranscriptMay 4 2016, 8:04 PM

TJones added a project: CirrusSearch.May 4 2016, 8:05 PM

Good idea! However, given the fact that non top-10 languages only account for around 5% of our total traffic, and we've not even got this working for our biggest languages yet, I'm not prioritising this right now.

TJones mentioned this in T140294: Provide language identification to the long-tail of wikis.Jul 13 2016, 8:02 PM

TJones mentioned this in T140300: Provide language identification to the long-tail of wikis.Jul 13 2016, 8:26 PM

Closing this out in favor of T140300.

Automate Process for Selecting Languages for a Given Wiki for Use With TextCatClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Automate Process for Selecting Languages for a Given Wiki for Use With TextCat
Closed, DeclinedPublic
Actions

Related Objects
Search...