Page MenuHomePhabricator

Global South Contributions to Wikidata
Closed, ResolvedPublic

Description

As a Wikidata Analytics PM, and in relation to the Reimagining Wikidata from the margins process, @Manuel would like to know

  • if it is possible to get meaningful statistics about contributions to Wikidata from countries from the Global South:
    • the number of edits in the last 12 months per country,
    • the number of active editors per country.

In @GoranSMilovanovic and @Manuel 1:1 discussions of of this task the following question was also addressed:

  • How precise is the data in the context of the global south countries?

To address the later question we will read through the relevant documentation thoroughly and probably ask the Analytics team to clarify the details if necessary.

Event Timeline

@Ladsgroup @Manuel

The API will not be of any help here to provide an answer on the number of active editors per country; from the docs:

Non-Wikipedia projects are not available for this endpoint.

Also, the API will not be of any help to look for the number of edits in the last 12 months per country.

Inspecting alternative data sources now.

@Ladsgroup @Manuel

the following public dataset

Analytics/Data Lake/Edits/Geoeditors/Public

will also not going to be of any help, from the docs:

... at this time the dataset is available just for Wikipedias

@Ladsgroup @Manuel

Ok, the following dataset:

Analytics/Data Lake/Edits/Geoeditors

has everything that we need for this ticket.

NOTE 1. we need a working definition here: what countries do we count in "Global South"?

Current status:

  • active Wikidata editors dataset per country is produced, fields:
    • country_code
    • YYYY-MM = year/month, the time span is from 2020/09 to 2021/08
    • active_editors = number of distinct, active editors per month
    • active_editors_ns0 = number of distinct, active editors in Namespace 0 per month.
  • Working now on: the number of edits in the last 12 months per country.

Note 2. I will not be sharing the datasets (in spite of the fact that I know that both @Ladsgroup and @Manuel have an NDA signed with the WMF) because the data are highly sensitive while @Manuel needs to share them with third parties (see ticket description).

Analytics We will be in a need of your review of the datasets mentioned in this ticket; I will open a separate ticket in that respect.

@Ladsgroup @Manuel

Ok, now we also have the data for

the number of edits in the last 12 months per country

The dataset comprises the following fields:

  • country_code
  • month = year/month, the time span is from 2020/09 to 2021/08
  • edits = how many edits from the respective country in the respective month
  • edits_ns0 = how many edits in Namespace 0 from the respective country in the respective month.

@Manuel In order to move on, I need to filter our countries that are not, in our understanding, found on the "Global South". Please provide a list of countries that we will consider as "Global South".

Next steps.

  • filter out countries that are not part of the "Global South" from the datasets
  • open a Phab ticket for a privacy policy review for Analytics

Change 721516 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/721516

Change 721516 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/721516

@Manuel

Since you have an NDA signed with the Wikimedia Foundation, I can share the existing datasets with you.

The only thing that would be helpful at this point is a list of countries that we consider to be in the "Global South", because the current datasets encompass all available country codes.

For purposes of hypothesis testing, we could keep all countries in the data and contrast the amount of contribution and/or the number of active editors between the "Global North" and "Global South".

Let me know of your thoughts in this respect. Thank you.

Thx @GoranSMilovanovic! Yes, no need to filter out countries if we can have them all! :)

To be on the safe side: When publishing information from this aggregated dataset outside of the NDA group, please make sure that we remove all information from countries on the Country Protection List (as recommended by Htriedman).

Change 725039 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/725039

Change 725039 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/725039

Change 730880 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/730880

Change 730880 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAdHocAnalytics@master] T291170

https://gerrit.wikimedia.org/r/730880

@Manuel

The Wikidata Global South Report is ready.

The Report is quite extensive and detailed. However, all analyses follow the same methodological framework, so once you figure out what was done with the Active Editors dataset it will be easy for you to figure out the rest.

A narrative follows the analyses precisely. I would advise focusing on it.

There is a Summary section at the beginning of the Report. The most essential findings are listed there. If I would have to communicate this publicly, my choice would be to focus on the Summary section alone. As of the charts, I would focus on the Global North vs Global South contrasts in terms of Change (faceted bar-plots; easy to find, you will see).

If you need any consultations before presenting this in WikidataCon 2021 or to interested third parties before the onset of the event, please let me know. We can find some time so that I can answer all the questions that you might have.

As I have explained in out 1:1 this week why, I have completely avoided any statistical hypothesis testing. In my opinion, the data are telling enough. However, if we wish to add the tests, it can be done - the datasets are quite suitable for non-parametric statistical analysis.

@Manuel Is there anything else that we need here?