Page MenuHomePhabricator

Estimate the impact of the growth features on ptwiki
Closed, ResolvedPublic

Description

Growth team partially enabled growth features on Portuguese Wikipedia (ptwiki) on September 30, 2020. Portuguese Wikipedia turned off IP editing by community decision on October 4, 2020. Both interventions could impact active editors, retention rate, and net content edits. In order to better understand the impact of turning off IP editing, we'd like to estimate the impact of growth features for following three metrics:

  • Active editors

Definition: registered editors who have edited on content page in the given month

  • Retention rate

Definition: out of the non-bot users who registered in the week before the previous and made at least one edit in their first 30 days, the proportion who also edited during their second 30 days.

  • Net content edits:

Definition: number of content edits which were not reverted within 48 hrs editing, excluding bot edits, and revert edits in the given month

Event Timeline

Here are the current queries to extract monthly data for each metric. We will try to estimate impact of growth feature using the same definitions if data is available.
Active editors

SELECT
  date_format(event_timestamp,'yyyy-MM-01') AS y_m,
  date_format(event_timestamp,'MM')  AS m,           
  COUNT(DISTINCT event_user_id) wiki_user_editors
FROM wmf.mediawiki_history
WHERE
        event_entity = 'revision' AND
        event_type = 'create' AND
        DATE(event_timestamp) >= '{START_YYYY_MM_DD}' AND DATE(event_timestamp) <= '{END_YYYY_MM_DD}' AND
        event_user_is_anonymous = false  AND page_namespace_is_content
        AND snapshot = '{SNAPSHOT}' AND wiki_db='ptwiki'
GROUP BY   
  date_format(event_timestamp,'yyyy-MM-01') ,
  date_format(event_timestamp,'MM') 
ORDER BY y_m
LIMIT 10000

Retention rate

-- only count non-bot user 
SELECT 
          date_format(1st_month.user_creation_date,'yyyy-MM-01') AS user_creation_cohort_month, 
          ROUND(SUM(CASE WHEN 2nd_month.edits IS NOT NULL THEN 1 ELSE 0 END) /COUNT(1), 6)          AS retention_rate
FROM      ( 
                   SELECT   event_user_text                           AS user_name, 
                            event_user_id                             AS user_id, 
                            substr(event_user_creation_timestamp,1,10) AS user_creation_date, 
                            count(*)                                  AS edits 
                   FROM     wmf.mediawiki_history 
                   WHERE    snapshot = "{SNAPSHOT}" 
                   AND      wiki_db='ptwiki' 
                   AND      event_entity = "revision" 
                   AND      event_type = "create" 
                   AND      NOT event_user_is_created_by_system 
                   AND      size(event_user_is_bot_by) = 0 
                   AND      event_user_creation_timestamp BETWEEN "{START_YYYY_MM_DD}" AND      "{END_YYYY_MM_DD}" 
                   AND      unix_timestamp(event_timestamp, "yyyy-MM-dd HH:mm:ss.0") < (unix_timestamp(event_user_creation_timestamp, "yyyy-MM-dd HH:mm:ss.0") + (30*24*60*60)) 
                   GROUP BY event_user_text, 
                            event_user_id, 
                            event_user_creation_timestamp
                            ) 1st_month 
LEFT JOIN 
          ( 
                   SELECT   event_user_text                           AS user_name, 
                            event_user_id                             AS user_id, 
                            substr(event_user_creation_timestamp,1,10) AS user_creation_date, 
                            count(*)                                  AS edits 
                   FROM     wmf.mediawiki_history 
                   WHERE    snapshot = "{SNAPSHOT}" 
                   AND      wiki_db='ptwiki' 
                   AND      event_entity = "revision" 
                   AND      event_type = "create" 
                   AND      NOT event_user_is_created_by_system 
                   AND      size(event_user_is_bot_by) = 0 
                   AND      event_user_creation_timestamp BETWEEN "{START_YYYY_MM_DD}" AND      "{END_YYYY_MM_DD}" 
                   AND      unix_timestamp(event_timestamp, "yyyy-MM-dd HH:mm:ss.0") >= (unix_timestamp(event_user_creation_timestamp, "yyyy-MM-dd HH:mm:ss.0") + (30*24*60*60)) 
                   AND      unix_timestamp(event_timestamp, "yyyy-MM-dd HH:mm:ss.0") < (unix_timestamp(event_user_creation_timestamp, "yyyy-MM-dd HH:mm:ss.0")  + (60*24*60*60)) 
                   GROUP BY event_user_text, 
                            event_user_id, 
                            event_user_creation_timestamp
                            ) 2nd_month 
ON        ( 
                    1st_month.user_id = 2nd_month.user_id 
          AND       1st_month.user_creation_date = 2nd_month.user_creation_date)
GROUP BY date_format(1st_month.user_creation_date,'yyyy-MM-01')
ORDER BY user_creation_cohort_month
LIMIT 100000

Net content edits

WITH t1 AS
(
SELECT date_format(event_timestamp,'yyyy-MM-01') AS y_m,
    count(revision_id) AS non_reverted_edits
FROM wmf.mediawiki_history
WHERE snapshot= '{SNAPSHOT}' AND wiki_db='ptwiki'
    AND page_namespace_is_content
    -- for bots size(event_user_is_bot_by) returns 1/2, IP editors return -1, registered non-bot editors return 0
    AND  size(event_user_is_bot_by) <= 0 
    AND substr(event_timestamp,1,10) BETWEEN '{START_YYYY_MM_DD}' AND '{END_YYYY_MM_DD}'
    AND event_entity = "revision" AND (NOT revision_is_identity_reverted  OR revision_seconds_to_identity_revert > 172800)
GROUP BY date_format(event_timestamp,'yyyy-MM-01')
ORDER BY y_m
LIMIT 1000000
),
t2 AS
(
SELECT date_format(h1.event_timestamp,'yyyy-MM-01') AS y_m,
    count(distinct h1.revision_id) AS revert_edits
FROM wmf.mediawiki_history AS h1 
LEFT JOIN wmf.mediawiki_history AS h2  -- h1 is revert edits, h2 is the edits being reverted
    ON h1.revision_id=h2.revision_first_identity_reverting_revision_id  
    AND h1.wiki_db = h2.wiki_db
    AND h1.snapshot= h2.snapshot 
WHERE h1.wiki_db = 'ptwiki' AND h1.snapshot='{SNAPSHOT}'
    AND h1.page_namespace_is_content
    -- for bots size(event_user_is_bot_by) returns 1 or 2, IP editors return -1, registered non-bot editors return 0
    AND size(h1.event_user_is_bot_by) <= 0 AND size(h2.event_user_is_bot_by) <= 0
    AND substr(h1.event_timestamp,1,10) BETWEEN '{START_YYYY_MM_DD}' AND '{END_YYYY_MM_DD}'
    AND h1.event_entity = "revision" and h2.event_entity = "revision" 
    AND (NOT h1.revision_is_identity_reverted  OR h1.revision_seconds_to_identity_revert > 172800)
    AND h2.revision_is_identity_reverted AND h2.revision_seconds_to_identity_revert < 172800
    AND h1.revision_is_identity_revert
    AND h1.event_entity = 'revision' and  h2.event_entity = 'revision'
GROUP BY date_format(h1.event_timestamp,'yyyy-MM-01')
ORDER BY y_m
LIMIT 10000
)
SELECT t1.y_m, t1.non_reverted_edits - t2.revert_edits AS net_non_reverted_edits
FROM  t1
LEFT JOIN t2 ON t1.y_m=t2.y_m

@jwang : I fiddled with the active editors query a bit and created a notebook that combines it with the Growth dataset of registrations from October/November 2020 to count the number of them that edits every month from Nov 2020 through May 2021, split by whether they had the Growth features or not. You can find that notebook here. Let me know what questions you might have about this!

@nettrom_WMF , thank you for the active editor data. Using the same method, I queried the other 2 metrics: retention rate and net content edits. You can find the queries and results in the notebook. Let me know if you have questions about it. Look forward to discussing the results with you.

ldelench_wmf triaged this task as Medium priority.
ldelench_wmf moved this task from Triage to Kanban on the Product-Analytics board.

@nettrom_WMF and I have went though the results. As the growth features only applied to the newcomers who registered after October 1st, 2020 and were selected in experiment group ( 80% sample rate), we also estimated the portion of users with growth features in each metric. And then came up with the estimated overall impacts.

Summary

Active editors
We observed a small difference between the control group and the experiment group. It's not statistically significant. The absolute difference is estimated to be +0.168%, based on the average from Nov 2020 to March 2021. The impacted group is 48% of active editors (60% active editors are newcomers after 2020-10-01 * 0.8 experiment rate). Therefore, the overall impact on the active editors is estimated to be +0.08%.

Retention rate
A +0.42% lift is observed between control group and experiment group. The lift is applied on newcomers who are in experimental group, which is 80% of total newcomers. The estimated overall impact is +0.33%.

Net content edits
A +4.36% lift is observed between control group and experiment group. 11.7% of total edits are made by newcomers after 2020-10, based on the average from Oct 2020 to Mar 2021. The overall estimated impact is estimated to be +0.5%.