Page MenuHomePhabricator

Insights from the Wikidata Languages Landscape project
Closed, ResolvedPublic

Description

As Wikidata PMs we would like to make analytics insights actionable for our product strategy.

This is why we need to check if there are any actionable insights that we can draw from the Wikidata Languages Landscape project.

Please focus only on:

  • UNESCO/Ethnologue language status vs Wikidata statistics (number of labels, reuse, etc).

Currently not in focus:

  • Lapses in the Wikidata ontology for languages.

Acceptance criteria:

  • in case there are potentially actionable insights, list them for a sensemaking session (in the session we can discuss and decide together what to put in a strategic report -> follow up task)

Event Timeline

@Manuel

Here is a concise report that relies on UNESCO Language Status:

The analyses presented here can be completely replicated using the Ethnologue language status categories as well. Please let me know if you find that necessary or interesting - I have opted for UNESCO language status simply because I thought it would be good to use one criterion - if if we choose it ad hoc - in comparison to a more complicated situation where we use two criteria (UNESCO and Ethnologue).

From my perspective, the most important insights are:

  • Languages that are not endangered are way better represented than the endangered or vulnerable languages in terms of how many sitelinks they have; this is probably more relevant for the Wikipedia community than for us, however, I thought we should help by informing them when we already have the numbers at our hands;
  • Languages that are not endangered have many more labels in Wikidata in comparison to languages that are endangered or vulnerable;
  • Beyond that, languages that are not endangered in general label items that are more reused across the Wikimedia projects in comparison to the items for which we have labels in endangered or vulnerable languages.

I have used visualizations, labeling languages by their respective code, in order to single out the extremes on the following indicators:

  • number of sitelinks
  • number of items for which a particular languages has labels for
  • the reuse of items labeled by a particular language.

In conjunction with the tables - all of them are provided in the report - that might helps us to figure out if there are specific linguistic communities that we could address and see if they need any help.

The analysis is exploratory: I did not want to invest any time in statistical hypothesis testing (e.g. comparisons across groups or languages + decision making on whether the differences are statistically significant or not) before we can have a glimpse of the big picture at least.

Please let me know if anything needs further clarification; I am open for a 1:1 on this until Wednesday 14. July late CET hours.

It could be interesting to check labels that are unique to a language.

In the report "ast" appears as well represented, but I suppose only a small number of labels differ from "en".

In a different field, unique content by Wikipedia language: WikiProject_Movies/Numbers/unique_films

@Manuel We did not touch upon this one in our 1:1. Do we need anything else here? Please let me know. Thanks!

raw data for this is regularly updated onwiki by the community

@Esc3300

It could be interesting to check labels that are unique to a language.

Please check-out our Wikidata Languages Landscape system and let me know if it provides the insights that you are interested in.

The datasets that the Wikidata Languages Landscape use are found here and here.

If not, please formulate precisely what datasets would you like to have represented and/or visualized, and our PMs will make sure that the request is taken into consideration.
As a Data Scientist for Wikidata, I will be very glad to help. Thank you for following our discussions and making suggestions on how to improve our data analytics!

Here is a sample of the uniqueness count I had in mind https://w.wiki/3ub4 . The sample is only across three items.

Manuel claimed this task.