Page MenuHomePhabricator

REQUEST Africa data check
Closed, ResolvedPublic

Description

Name for main point of contact and contact preference:
Felix Nartey

What are your goals? Editor participation in the campaigns-partnership collaboration project, Africa Knowledge Initiative - Agenda 2063 WikiProject

How will you use this data or analysis?
Data will be used in the Africa Knowledge Initiative - Agenda 2063 WikiProject deck

What decisions do you need data to inform?

Do you want to share data publicly? yes

Do you want to include data in a narrative or message (e.g. for PR, audience engagement, or fundraising)? yes

What are the details of your request?
Please fact check the following:

  • 19 recognized communities
  • 60 african language projects
  • 16k unique editor /month
  • Only 0.7% of contributions on Wikipedia come from Africa
  • Less than 3% of knowledge on English Wikipedia is about Africa

Include relevant timelines or deadlines TBD
Is there a date after which the analysis will no longer be useful? Please provide any timeline/relevant deadlines, requested formats, examples, links to documentation, or other information that would help us understand your request.

Is this request urgent or time sensitive? TBD
We try to reply to “Urgent” requests immediately and “Time sensitive” requests by the end of the workday. All other requests will be prioritized during our weekly triage.

Details

Other Assignee
Iflorez

Event Timeline

19 recognized communities ---> maybe @DNdubane_WMF can point Felix to a data source for this item?
60 african language projects ---> @rudolph-san which is the data source for this?
16k unique editors/month ---> which is the source for this? Maybe this is related to active editors?
Only 0.7% of contributions on Wikipedia come from Africa ---> see T290358
Less than 3% of knowledge on English Wikipedia is about Africa ---> see T290358

@Aklapper thank you
I wasn't sure if this task was a better fit for the Global Data & Insights team. We can triage this at the next Product Analytics phab review meeting.

related to T287715 where I looked at distinct editors in SSA.

26,000 SUM(distinct_editors) in Africa in October 2021

editors_africa = hive.run("""
SELECT SUM(distinct_editors)
FROM wmf.geoeditors_monthly AS g
JOIN canonical_data.countries AS c
ON c.iso_code  = g.country_code 
WHERE c.maxmind_continent ='Africa'
AND month='2021-10'
--AND c.name IN (SSA_countries)
""")
ldelench_wmf triaged this task as Medium priority.
ldelench_wmf updated Other Assignee, added: Iflorez.
ldelench_wmf moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

Adding on to this discussion. @Iflorez is it also possible to get some data/information in regards to readership of the African language projects (i.e. unique visitors or page views per month)? Thank you!

@Lescamilla16 to obtain unique devices accessing a project or pageviews for project(s) we need the names of the projects in question. Do you have a list of projects that you can share?
Also, to note, enwiki and frwiki are edited and read in Africa and whether or not these two are included in the list will significantly change the data reported.

Following up here, following a Slack conversation:

I've reviewed and recommend the below items given the info I have so far. If you need to decide which metrics to use, let me know.

19 recognized communities ---> do you mean affiliates? If so, see the Affiliates Data report and Data Portal
Only 0.7% of contributions on Wikipedia come from Africa ---> use data referenced on T290358
Less than 3% of knowledge on English Wikipedia is about Africa ---> use data referenced on T290358
60 african language projects ---> @rudolph-san which is the data source for this?
29k unique editors/month ---> see the query below

SELECT SUM(distinct_editors)
FROM wmf.geoeditors_monthly AS g
JOIN canonical_data.countries AS c
ON c.iso_code  = g.country_code 
WHERE c.maxmind_continent ='Africa'
AND month >='2020-10'
AND month <'2021-11'

@Iflorez, sorry for the late response to the question on 60 African language projects. I think Felix (point of contact for this ticket) will be the best person to ask.

5.5% of all geotagged articles relate to Africa --- from T290358
29,391 avg monthly editors, in Africa, across all projects, per the below Superset PrestoSQL query.
311,479 avg monthly editors, across all projects --- from the Wikimedia descriptive statistics sheets
9.43% of avg monthly editors across all projects are in Africa ---calculated using the above 2 datapoints

#EDITORS ACROSS ALL PROJECTS THAT EDIT FROM WITHIN AFRICA
SELECT SUM(distinct_editors)
FROM wmf.geoeditors_monthly AS g
JOIN canonical_data.countries AS c
ON c.iso_code  = g.country_code 
WHERE c.maxmind_continent ='Africa'
AND month >='2020-10'
AND month <'2021-11'

After some fact-checking, we crafted this paragraph via conversations on Slack:

5.5% of all geotagged articles relate to Africa and only 9% of contributors to Wikimedia projects are located in the African continent. The African Knowledge Initiative seeks to increase the number of African contributors to Wikimedia projects and to grow the corpus of African knowledge in multiple African languages.

The calculation for the 9% of all global editors from Africa

  • 29k avg monthly editors, in Africa, across all projects, per the below Superset PrestoSQL query (whose results were divided by 12 to get a monthly avg).
  • 311,479 avg monthly editors, across all projects - source

- 9.43% of avg monthly editors across all projects are in Africa

#EDITORS ACROSS ALL PROJECTS THAT EDIT FROM WITHIN AFRICA
SELECT SUM(distinct_editors)
FROM wmf.geoeditors_monthly AS g
JOIN canonical_data.countries AS c
ON c.iso_code  = g.country_code 
WHERE c.maxmind_continent ='Africa'
AND month >='2020-10'
AND month <'2021-11'

Results by calendar year:

Monthly average contributors from the African continent (using the base query noted above):
2019 - 24,808
2020 - 26,893
2021 - 27,406
2022 - 26,291 (average so far)

Calculating the average monthly editors across all projects that edit from within Africa:

There is one self-service imprecise method for getting a close albeit rough idea of this percent. See Method 1.
GDI and Product Analytics with Data Engineering are testing methods for getting accurate numbers. See methods 3-4.
Which is the recommended method for calculating the average monthly editors across all projects that edit from within Africa? How close are we to finalizing beta tests?

Method 1: geoeditors in Africa divided by editor_month yearly average editors -- T261015#7915950 and T269625
Method 2: geoeditors in Africa divided by geoeditors global
Method 3: Editors Daily + MWUser + Canonical Data
Method 3b: Editors Daily + MWUser + Canonical Data —different query method which gathers less information
Method 4: Editors Daily + MWUserHistory + MediaWikiHistory —used in the regional learning sessions to calculate the % of SSA editors.
Method 4b - GDI is working on this now.
See Jamboard to compare the methods.

Related:
Africa Data Points
Africa Data Review
Jamboard
Calculating the % of editors that edit from within Africa

Related: T310224 New Data Pipeline for Unique Editor metrics by Geo (Country & Region)

I discussed with @nshahquinn-wmf the benefits of the following language:
Roughly 3-4%* of Wikimedia contributors (across all projects) edit from within the continent.
~ The earlier calculation incorrectly included anonymous contributors in the numerator but not the denominator.

Further, @JAnstee_WMF notes:
"As for the Aug 2021 extract from that time - Africa continent editors proportion is between 3.1% and 3.2% of global depending on the denominator's inclusion or exclusion of the non-geolocated hits, respectively."
It seems certain that Less than 5% is accurate, I wonder if it would be more helpful to future data points to say the range from deduplication efforts: between 3-5%"

I discussed with @nshahquinn-wmf the benefits of the following language:
Roughly 3-4%* of Wikimedia contributors (across all projects) edit from within the continent.
~ The earlier calculation incorrectly included anonymous contributors in the numerator but not the denominator.

Thank you for updating here! It would probably be good to also add what the new approach is. If I remember correctly, you said it was in the new range whether or not you include anonymous contributors. So you could add something like: "Both consistently including and consistently excluding anonymous contributors produce similar results of around 3-4% African contributors."

In practice, it looks like 3-4% of contributors edit from within Africa. However, the best methods to count unique logged-in and IP editors currently allow for pulling data for the latest month only, which makes it hard to get a good snapshot of data over multiple months. Because of this limitation, I'm inclined to say that 3-5% monthly editors across all projects edit from within Africa, this covers variability of methods used to calculate the total and also covers fluctuations over time. The earlier calculation [9%] incorrectly included anonymous contributors in the numerator but not the denominator when calculating the proportion.

@nshahquinn-wmf can you take a look at the Jamboard and review the query in the far right, Editors Daily+MW user +cdc / Editors Daily+MW user +cdc which currently lists 4.7%? I think that that query extends the middle query to count not only unique editors but also unique IPs for the latest month. In large part, that far right unique-logged in + unique IP query has me thinking that the number reported should include the higher bound of 5%.

In practice, it looks like 3-4% of contributors edit from within Africa. However, the best methods to count unique logged-in and IP editors currently allow for pulling data for the latest month only, which makes it hard to get a good snapshot of data over multiple months. Because of this limitation, I'm inclined to say that 3-5% monthly editors across all projects edit from within Africa, this covers variability of methods used to calculate the total and also covers fluctuations over time.

That seems perfectly reasonable!

@nshahquinn-wmf can you take a look at the Jamboard and review the query in the far right, Editors Daily+MW user +cdc / Editors Daily+MW user +cdc which currently lists 4.7%? I think that that query extends the middle query to count not only unique editors but also unique IPs for the latest month. In large part, that far right unique-logged in + unique IP query has me thinking that the number reported should include the higher bound of 5%.

I've looked through the query and it seems correct! So, yes, giving 3-5% seems like a better choice. One suggestion: please consider following our team SQL style when you write queries. I'm used to using that style now, so it would've been a bit easier for me to process the queries.

Overall, I think you're good to go! Thank you for grappling with all this detail and being dedicated to accuracy 😊

Thank you @nshahquinn-wmf and @Iflorez for all the sleuthing on this! Following Irene's previous update, we revised the Diff post to read:

"At the moment, less than 5% of Wikimedia contributors (logged in, across all projects) edit from within the continent."

Would you be comfortable sticking with that? This post is aimed at a general audience and we don't need to be super precise, or get into the weeds of methodology (unless we get specific inquiries). I think putting a percentage range will actually be distracting and raise questions, when the point we're really trying to make is that African contribution is unacceptably low, as a setup for the steps we're taking to address that. "Less than 5%" feels like a more concise way of landing that point, and doesn't in my view contradict or mislead based on what you've established. But if you feel strongly to the contrary, I can be persuaded!

Thanks again. <3

@BVershbow_WMF from my point of view, that's a great way of phrasing it. Thanks for helping translate our data-science-ese!