MinT for Wikipedia Readers MVP (T359072) enables users to read contents from other languages using machine translation. In some contexts, users may request the translation into a specific language without indicating a specific source language. Since content can be available in multiple languages with different levels of coverage, it is useful to have a way to identify which could be the language with most content coverage for a given topic.
This is useful in contexts where the source language is not specificed, such as entry points at the target wiki (T363338), reaching the Confirm step (T359512) after searching for a topic across all languages, or when exploring the available source languages (T359863).
Currently, for the MVP a simple approach was defined in the context of a cross-language search using the search query matched language from Wikidata, a default language or the first from the list in case the previous criteria don't apply.
This ticket proposes to pick a language based on their content coverage. Initially it was proposed to make the selection based on the number of sections, but this may be too slow (since it may require to query data about the article in potentially hundreds of languages). As part of this ticket we'll explore some viable alternatives.
Some initial ideas to explore:
- Check page byte size instead of number of sections.
- Check sections but only for a limited set of languages (e.g., the 4 languages with the larger Wikipedias from those where the article is available).
- Use some metadata available on Wikidata (aspects such as “featured”, and maybe others, are captured there)
- Select based on a pre-defined order using some property of the wiki. From the list of Wikipedias we could define some formula (e.g., Number of articles × Depth) to try to capture the notion of overall content coverage.