Page MenuHomePhabricator

Editor data for regional views
Closed, ResolvedPublic

Description

Provide monthly data broken down by region, since 2018:

  • Active editors (total) from wmf.geoeditors_monthly
  • Active editors (total) from wmf.unique_editors_by_country_monthly (so we can check the differences in scale by region once editors are disaggregated by wiki - though this dataset only goes back to June 2022)

Updated to note: we are not able to provide breakouts by region for new active editors and returning active editors, as neither wmf.geoeditors_monthly nor wmf.unique_editors_by_country_monthly have a flag for whether the editor is new or returning.

Data results

Reference:
Comparing Active Editors and Geoeditors Monthly tables for obtaining Active Editors by Geo data
Core user tables comparison

Related: T294458, T316896

geogroupings: T330780, T324681

Event Timeline

mpopov triaged this task as High priority.Mar 7 2023, 6:13 PM

In progress queries below; see also this KPM Slack conversation for detailed discussion and this NSQ Slack conversation for further discussion.

select 
    g.month, 
    g.namespace_zero_distinct_editors,
    r.wmf_region 
FROM wmf.geoeditors_monthly AS g
--JOIN canonical_data.countries AS c
    --ON c.iso_code  = g.country_code
JOIN ntsako.country_meta_data AS r
    ON g.country_code = r.country_code_iso_2
WHERE g.month == '2023-02' AND
    g.activity_level != '1 to 4' AND
    g.users_are_anonymous = false AND
GROUP BY g.month, r.wmf_region, g.namespace_zero_distinct_editors

Noting differences:

I've run the code above for Latin America & Caribbean and compared results to different methods:

Method 1: query the editors_daily table

editors_daily = spark.run("""
    SELECT COUNT(DISTINCT ed.user_fingerprint_or_name), ed.month, wmf_region
    FROM wmf.editors_daily ed
    JOIN ntsako.country_meta_data AS r
        ON ed.country_code = r.country_code_iso_2
    WHERE 
        ed.month == '2023-02' AND
        ed.user_is_anonymous = false AND 
        ed.user_fingerprint_or_name IS NOT NULL AND 
        --ed.user_fingerprint_or_name  > 0 AND
        ed.action_type IN (0, 1) AND --https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type
        size(ed.user_is_bot_by) = 0 AND
        r.wmf_region = 'Latin America & Caribbean' AND
        namespace_zero_edit_count >= 5
    group by ed.month, r.wmf_region
""")

Method 2: query the Geoeditors by country table

gxc = spark.run("""
SELECT
    gxc.month, 
    gxc.namespace_zero_distinct_editors,
    r.wmf_region 
FROM wmf.unique_editors_by_country_monthly AS gxc
JOIN ntsako.country_meta_data AS r
    ON gxc.country_code = r.country_code_iso_2
WHERE gxc.month == '2023-02' AND
    gxc.activity_level != '1 to 4' AND
    gxc.users_are_anonymous = false AND
    r.wmf_region = 'Latin America & Caribbean'
GROUP BY gxc.month, r.wmf_region, gxc.namespace_zero_distinct_editors

""")

Method 3: query for active_editors from the editors_monthly table with geo data

select
    month,
    count(DISTINCT(user_name)) as active_editors,
    wmf_region
from (
    select
        cast(em.month as date) as month,
        em.user_name,
        sum(em.content_edits) as content_edits,--coalesce(sum(ns_map.namespace_is_content), 0) as content_edits,
        max(em.bot_by_group) as bot_by_group,
        r.wmf_region
    FROM wmf_product.editor_month em
    JOIN wmf.editors_daily ed
        ON ed.user_fingerprint_or_name = em.user_name
    JOIN ntsako.country_meta_data AS r
        ON ed.country_code = r.country_code_iso_2
    WHERE
        em.month = '2023-02' AND
        em.user_id != 0 AND
        r.wmf_region = 'Latin America & Caribbean'
    group by em.month, em.user_name, r.wmf_region
) global_edits_LATAM
where
    content_edits >= 5 and
    not bot_by_group and
    user_name not regexp "bot\\b"
group by month, wmf_region

Method 4: editors_monthly w/geo + active editors filtering

"""
with editor_month_copy AS (
select
    trunc(event_timestamp, "MONTH") as month,
    mwh.wiki_db as wiki,
    event_user_id as user_id,
    max(event_user_text) as user_name, -- Some rows incorrectly have a null `event_user_text` (T218463)
    coalesce(sum(ns_map.namespace_is_content), 0) as content_edits,
    max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
    wmf_region
from wmf.mediawiki_history mwh
inner join canonical_data.wikis 
on
    wiki_db = database_code and
    database_group in (
        "commons", "incubator", "foundation", "mediawiki", "meta", "sources", 
        "species","wikibooks", "wikidata", "wikinews", "wikipedia", "wikiquote",
        "wikisource", "wikiversity", "wikivoyage", "wiktionary"
    )
left join wmf_raw.mediawiki_project_namespace_map ns_map -- Avoid `page_namespace_is_content` to work around T221338
on
    mwh.wiki_db = dbname and
    coalesce(page_namespace_historical, page_namespace) = namespace and
    ns_map.snapshot = "2023-02" and
    mwh.snapshot = "2023-02"
JOIN wmf.editors_daily ed
        ON ed.user_fingerprint_or_name = mwh.event_user_text
    JOIN ntsako.country_meta_data AS r
        ON ed.country_code = r.country_code_iso_2

where
    event_timestamp between "2023-02-01 00:00:0.0" and "2023-03-01 00:00:0.0" and
    event_entity = "revision" and
    event_type = "create" and
    mwh.snapshot = "2023-02" and
    r.wmf_region = 'Latin America & Caribbean'
group by
    trunc(event_timestamp, "MONTH"),
    mwh.wiki_db,
    event_user_id, event_user_text, wmf_region
),


active_editor_copy AS (
select
    cast(emc.month as date) as month,
    emc.user_name,
    sum(emc.content_edits) as content_edits,--coalesce(sum(ns_map.namespace_is_content), 0) as content_edits,
    max(emc.bot_by_group) as bot_by_group,
    emc.wmf_region
FROM editor_month_copy emc
WHERE emc.user_id != 0 
group by emc.month, emc.user_name, emc.wmf_region
)


select
    month,
    count(DISTINCT(user_name)) as active_editors,
    wmf_region
from active_editor_copy
where
    content_edits >= 5 and
    not bot_by_group and user_name not regexp "bot\\b"
group by month, wmf_region

"""
tableGeoeditorsGeoeditors by countryEditors_daily w/GeoActive Editors w/GeoEditor_month w/geo+
LATAM distinct editors with 5+ content edits520152113805*1309913099

*4724 if we take out action_type from the query

Noting differences when reviewing global numbers:

In Feb, I see 87.9k active_editors

When I check wmf.unique_editors_by_country_monthly, I see 76,299 distinct editors with 5+ content edits with the below query

SELECT
    gxc.month, 
    gxc.namespace_zero_distinct_editors
FROM wmf.unique_editors_by_country_monthly AS gxc
JOIN ntsako.country_meta_data AS r
    ON gxc.country_code = r.country_code_iso_2
WHERE gxc.month == '2023-02' AND
    gxc.activity_level != '1 to 4' AND
    gxc.users_are_anonymous = false
GROUP BY gxc.month, gxc.namespace_zero_distinct_editors

I think essentially were talking about two branches of data gathering approaches (related comparison grid):

Branch A: MediaWiki's cu_changes table

  • geoeditors
  • unique_editors_by_country_monthly
  • editors_daily

Branch B: MediaWiki's history table

  • editors_monthly
  • active_editors

Note also T316896 and this question:

For various geoeditors datasets (including wmf.unique_editors_by_country_monthly) how should Unknown country (country code "--") be thought about?

Summary notes:

  1. IP addresses that start with 10. - that's a "reserved range" in the IP standard so we use it for internal traffic like that between Cloud Services and the production cluster; Because these are reserved for internal use, a standard IP geolocation database (like the one we use) will not know what to do with them.
  1. Edits can have their information deleted. That's testable by finding edits that have had information removed and checking whether we have IPs for those somewhere. Our geolocated editor/edits data uses the cu_changes table, and maybe that’s not affected by this.
  2. When from internal requests/edits (e.g. bots on Toolforge), this is now documented in-depth in the data catalog and on wikitech.

See two related slack threads:
https://wikimedia.slack.com/archives/G01EE5WNK2M/p1658795343000599
https://wikimedia.slack.com/archives/CLKDS4MG9/p1678142867705699

The code I've run:

Five_p_editors_total_gm = spark.run("""
select 
    g.month, 
    sum(g.namespace_zero_distinct_editors)  AS editors_gm,
    r.wmf_region 
FROM wmf.geoeditors_monthly AS g
JOIN ntsako.country_meta_data AS r
    ON g.country_code = r.country_code_iso_2
WHERE g.month >= '2018-01' AND
    g.activity_level != '1 to 4' AND
    g.users_are_anonymous = false
GROUP BY g.month, r.wmf_region, g.namespace_zero_distinct_editors
""")

Five_p_editors_total_gxc = spark.run("""
select 
    gxc.month, 
    sum(gxc.namespace_zero_distinct_editors) AS editors_uebcm,
    r.wmf_region 
FROM wmf.unique_editors_by_country_monthly AS gxc
JOIN ntsako.country_meta_data AS r
    ON gxc.country_code = r.country_code_iso_2
WHERE gxc.month >= '2022-06' AND
    gxc.activity_level != '1 to 4' AND
    gxc.users_are_anonymous = false
GROUP BY gxc.month, r.wmf_region, gxc.namespace_zero_distinct_editors
""")

editors_ed = spark.run("""
    SELECT COUNT(DISTINCT ed.user_fingerprint_or_name) as editors_ed, ed.month, wmf_region
    FROM wmf.editors_daily ed
    JOIN ntsako.country_meta_data AS r
        ON ed.country_code = r.country_code_iso_2
    WHERE 
        ed.month >= '2022-11' AND
        ed.user_is_anonymous = false AND 
        ed.user_fingerprint_or_name IS NOT NULL AND 
        --ed.action_type IN (0, 1) AND --https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type
        --ed.edit_count >= 5 AND
        size(ed.user_is_bot_by) = 0 AND
        namespace_zero_edit_count >= 5
    group by ed.month, r.wmf_region
""")

editors_ae_me = spark.run("""
with editor_month_copy AS (
select
    trunc(event_timestamp, "MONTH") as month,
    mwh.wiki_db as wiki,
    event_user_id as user_id,
    max(event_user_text) as user_name, -- Some rows incorrectly have a null `event_user_text` (T218463)
    coalesce(sum(ns_map.namespace_is_content), 0) as content_edits,
    max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
    wmf_region
from wmf.mediawiki_history mwh
inner join canonical_data.wikis 
on
    wiki_db = database_code and
    database_group in (
        "commons", "incubator", "foundation", "mediawiki", "meta", "sources", 
        "species","wikibooks", "wikidata", "wikinews", "wikipedia", "wikiquote",
        "wikisource", "wikiversity", "wikivoyage", "wiktionary"
    )
left join wmf_raw.mediawiki_project_namespace_map ns_map -- Avoid `page_namespace_is_content` to work around T221338
on
    mwh.wiki_db = dbname and
    coalesce(page_namespace_historical, page_namespace) = namespace and
    ns_map.snapshot = "2023-02" and
    mwh.snapshot = "2023-02"
JOIN wmf.editors_daily ed
        ON ed.user_fingerprint_or_name = mwh.event_user_text
    JOIN ntsako.country_meta_data AS r
        ON ed.country_code = r.country_code_iso_2

where
    event_timestamp between "2023-01-01 00:00:0.0" and "2023-03-01 00:00:0.0" and
    event_entity = "revision" and
    event_type = "create" and
    mwh.snapshot = "2023-02" 
    --r.wmf_region = 'Latin America & Caribbean'
group by
    trunc(event_timestamp, "MONTH"),
    mwh.wiki_db,
    event_user_id, event_user_text, wmf_region
),


active_editor_copy AS (
select
    cast(emc.month as date) as month,
    emc.user_name,
    sum(emc.content_edits) as content_edits,--coalesce(sum(ns_map.namespace_is_content), 0) as content_edits,
    max(emc.bot_by_group) as bot_by_group,
    emc.wmf_region
FROM editor_month_copy emc
WHERE emc.user_id != 0 
group by emc.month, emc.user_name, emc.wmf_region
)


select
    month,
    count(DISTINCT(user_name)) as active_editors,
    wmf_region
from active_editor_copy
where
    content_edits >= 5 and
    not bot_by_group and user_name not regexp "bot\\b"
group by month, wmf_region

""")
#groupby
Five_p_editors_total_gm = Five_p_editors_total_gm.groupby(by=["month", 'wmf_region']).sum()
Five_p_editors_total_gxc = Five_p_editors_total_gxc.groupby(by=["month", 'wmf_region']).sum()

#handle datetime column 
editors_ae_me['month'] = pd.to_datetime(editors_ae_me['month'])
editors_ae_me['month']=editors_ae_me.month.dt.strftime('%Y-%m')

df = (Five_p_editors_total_gm.merge(Five_p_editors_total_gxc, how='left', on=['month', 'wmf_region'])
                            .merge(editors_ed, how='left', on=['month', 'wmf_region'])
                            .merge(editors_ae_me, how='left', on=['month', 'wmf_region']))

Next steps - create a new spike task including:

  • ed.action_type IN (0, 1) AND --https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type ...do we want to output a new editors_daily number with action_type included?
  • look into the concept 'content edits'...does editors daily count wikidata and commons? does namespace zero also make sense with these projects? does the method in active_editors catch more non-wiki active editors than other methods?
  • close review of the editors not included in the editors_daily output that are included in the active_editors output... to see who /what we missed.
  • in what context is uebcm being used by GDI?
  • review in detail the monthly_editors + active_editors copy method above...and simplify it to ensure that I'm not adding a bug
  • review the geoeditors_monthly and the unique_editors_by_country_monthly code in detail to compare and contrast the logic with active_editors's logic.

@Iflorez can you update this with March data and let Hua know when that's ready?

Thank you @kzimmerman

@HXi-WMF The updated data can be found in this sheet

Note: I am seeing significant numbers of unique editors whose country/region is coming up null.

Query used:

"""
WITH regions AS (
    SELECT
        wmf_region,
        country_code_iso_2
    FROM gdi.country_meta_data
),

geoeditors_monthly AS(
select 
    month,
    sum(namespace_zero_distinct_editors) AS editors,
    country_code
FROM wmf.geoeditors_monthly
WHERE month >= '2018-01' AND
    activity_level != '1 to 4' AND
    users_are_anonymous = false
GROUP BY 
    month, 
    country_code
)


select 
    month,
    sum(editors),
    wmf_region
FROM geoeditors_monthly g
LEFT JOIN regions AS r
    ON g.country_code = r.country_code_iso_2
GROUP BY
    month, 
    wmf_region

"""

Per @JAnstee_WMF the only country that remains unclassed is Antarctica.

Nulls in the geo field may be best explained in this comment: T331361#8685209

Adding this Comparison sheet of the tables that provide Editors by Geo that was created by @Iflorez
In this sheet you can see the differences in actual active editors by region in the tables we use that are all considering the same definition of active editor.