Page MenuHomePhabricator

[REQUEST] Validate and update numbers used in Product Platform Strategy presentation & wiki page
Closed, ResolvedPublicAug 14 2021

Description

Name for main point of contact and contact preference
Carol Dunn; Slack

What teams or departments is this for?
Product Department

What are the details of your request? Include relevant timelines or deadlines
Validate/update numbers used in Product Platform Strategy presentation & wiki page; make sure they're consistent with other metrics we share
Product Platform Strategy Wikimania Aug 2021
Draft for Product Platform Strategy Wiki Page

How will you use this data or analysis?
To illustrate the reach and potential reach of our projects as part of the Product Platform Strategy presentation

Is this request urgent or time sensitive?
Yes; presentation scheduled for Wikimania on 2021-08-14

Details

Due Date
Aug 14 2021, 7:00 AM

Event Timeline

kzimmerman triaged this task as Unbreak Now! priority.Aug 18 2021, 8:41 PM
kzimmerman created this task.

I provided numbers for Carol on Friday as follows:

Screen Shot 2021-08-18 at 1.42.34 PM.png (206×1 px, 76 KB)

Screen Shot 2021-08-18 at 1.42.14 PM.png (912×1 px, 132 KB)

For easier replicability & explicability, I recommended using averages from the last fiscal year (ending in June 2021) and comparing it to 5 years before (ending in June 2016). One reason is that our pageviews tracking changed in mid-2015 (numbers before May/June of 2015 are very different and include lots of bots). The other is that we frequently get asked for end-of-fiscal-year numbers, and this will make the data consistent with those requests.

I used the methodology outlined in Wikimedia descriptive statistics (public) to pull numbers for FY15-16 and added tabs for Monthly Content Interactions, Monthly Edits, Monthly Active Editors, and total Wikimedia content pages.

On Saturday Carol notified me that she needed breakouts for existing & emerging markets, in addition to the overall numbers, for Unique Devices ("Use Wikipedia") and Registered Editors ("Contribute").

Unique Devices

I used the following Presto query to calculate unique devices by market, and verified that the output from this query for total unique devices across all markets matched the output from the query used in Key Product Metrics

SELECT
  year, month, project_family, economic_region, SUM(uniques_estimate) AS sum_uniques_estimate
  FROM wmf.unique_devices_per_project_family_monthly
  LEFT JOIN canonical_data.countries
    ON canonical_data.countries.iso_code = wmf.unique_devices_per_project_family_monthly.country_code
  WHERE project_family = 'wikipedia'
  GROUP BY year, month, project_family, economic_region
  ORDER BY year, month, project_family, economic_region
Registered Editors

I used the following Presto queries to calculate registered editors by market (output from these queries)

WARNING: These queries should be used only for rough estimates until they undergo further QA.

Caveats & Notes:

  • Results from these queries do not match results from a previously validated query. The total 12-month average for FY20-21 is 306,008; however, this number is higher than numbers we have obtained using a different query, documented in Phabricator here: https://phabricator.wikimedia.org/T261015#6428301, which resulted in 292,015 editors as the FY20-21 average.
  • One possible cause for the discrepancy is that these queries do not filter out bots by name. There may be other inconsistencies that should be addressed, including possibly in how bots are filtered out using the user_is_bot_by field.
  • The results from these queries indicate that, for FY20-21, editors from emerging markets (Global South) made up 26% percent of total monthly editors, while editors from existing markets (Global North) made up 74% of total monthly editors. These proportions are similar to the numbers we track for active editors: editors from emerging markets made up 25%, while editors from existing markets made up 80% of total monthly active editors (and it seems reasonable that, among active editors, there are more people who have edited from multiple geolocations).
  • Based on similarities with prior queries and related metrics, I calculated a ballpark estimate of monthly editors by market using the percentages obtained by these queries, but applied the percentages to the total number of monthly editors from the query Product Analytics had previously used & validated (https://phabricator.wikimedia.org/T261015#6428301).

Registered Editors by Market:

SELECT
  month, database_group, economic_region, COUNT(DISTINCT user_fingerprint_or_id) AS editors
  FROM wmf.editors_daily
  LEFT JOIN canonical_data.countries
    ON canonical_data.countries.iso_code = wmf.editors_daily.country_code
  JOIN canonical_data.wikis ON wmf.editors_daily.wiki_db = canonical_data.wikis.database_code
  WHERE user_fingerprint_or_id != '0'
    AND user_is_anonymous = FALSE
    AND ARRAY_MAX(user_is_bot_by) IS NULL
    AND action_type = 0
    AND database_group = 'wikipedia'
  GROUP BY month, database_group, economic_region
  ORDER BY month, database_group, economic_region

Total Registered Editors:

SELECT
  month, database_group, COUNT(DISTINCT user_fingerprint_or_id) AS editors
  FROM wmf.editors_daily
  LEFT JOIN canonical_data.countries
    ON canonical_data.countries.iso_code = wmf.editors_daily.country_code
  JOIN canonical_data.wikis ON wmf.editors_daily.wiki_db = canonical_data.wikis.database_code
  WHERE user_fingerprint_or_id != '0'
    AND user_is_anonymous = FALSE
    AND ARRAY_MAX(user_is_bot_by) IS NULL
    AND action_type = 0
    AND database_group = 'wikipedia'
  GROUP BY month, database_group
  ORDER BY month, database_group

Ping - this Unbreak Now ("needs to be fixed immediately, setting anything else aside"!) task has been open for two weeks and has a past Due Date.
Could someone please update | resolve | lower priority? Thanks a lot!

I reviewed my code with @nshahquinn-wmf, and one major cause for the inflated numbers I see in registered editors by market, as compared to the previously validated query for registered editors, is that the registered editors by market query counts editors by user_fingerprint_or_id, which is unique to each Wiki. This means some editors are double-counted. To eliminate double-counting, I would need to group editors by user_name (which is not as readily available).

Resolving this as the ballpark estimate I provided earlier was sufficient for the use case.