Page MenuHomePhabricator

DE3.1 - Logged-out reader 21-day retention on web
Open, MediumPublic

Description

Datasource

improve: repeat experiments, productionize, Temporary hack solution: jiawang_web.retention_logged_out_cross_wiki jiawang_web.retention_logged_out_per_wiki

Open Questions

Is twice within 21 days really correct? I thought the metric was different?
At least twice. Please see the definition below.

Definition

Logged-out reader: a user uniquely identified by an edge-unique cookie, within the boundaries of an experiment, who is not logged in

Metric: The percentage of logged-out web users who return at least once by the end of day 21 after their first visit within a given 7-day cohort week.

Event Timeline

Milimetric renamed this task from DE3.1 Logged-out casual reader retention on web to DE3.1 - Logged-out casual reader retention on web.Apr 28 2026, 6:14 PM
Milimetric triaged this task as Medium priority.May 6 2026, 3:15 PM
Milimetric renamed this task from DE3.1 - Logged-out casual reader retention on web to DE3.1 - Logged-out reader 21-day retention on web.May 7 2026, 8:07 PM
Milimetric updated the task description. (Show Details)
jwang updated the task description. (Show Details)

Status

  • The data for logged-out-retention-round3 has been pulled and stored in the jiawang_web.retention_logged_out_cross_wiki and jiawang_web.retention_logged_out_per_wiki tables.
  • The data for rounds 4–7 has an instrumentation issue: we currently cannot distinguish logged-out from logged-in page visits. ( slack discussion)
  • The next usable dataset will mature in about 10 days with round 8.
Ahoelzl subscribed.
Status Update

Prepared SQL files to manually pull logged-out reader retention metrics per experiment. The models have been tested successfully in the DBT environment.

SQL location (@stat1010): /home/jiawang/share/dbt-jobs/models/web_reader

1. Global retention metrics (across wikis)

  • SQL files

int_retention_21day_logged_out_reader_global.sql
int_retention_2ndweek_logged_out_reader_global.sql
int_retention_2ndday_logged_out_reader_global.sql
mrt_retention_logged_out_reader_global.sql

  • Test command
dbt run --profiles-dir ~/GITLAB_DBT/dbt-jobs/ -s mrt_retention_logged_out_reader_global
  • Test result: jiawang.mrt_retention_logged_out_reader_global

2. Retention metrics per wiki

  • SQL files

int_retention_21day_logged_out_reader_wiki.sql
int_retention_2ndweek_logged_out_reader_wiki.sql
int_retention_2ndday_logged_out_reader_wiki.sql
mrt_retention_logged_out_reader_wiki.sql

  • Test command
dbt run --profiles-dir ~/GITLAB_DBT/dbt-jobs/ -s mrt_retention_logged_out_reader_wiki
  • Test result: jiawang.mrt_retention_logged_out_reader_wiki
Next steps
  • Hand off to Thomas to automate report triggering
  • Replace the source tables with the new schema in the Superset: logged-out reader dashboard

Change #1295087 had a related patch set uploaded (by TChin; author: TChin):

[mediawiki/extensions/WikimediaEvents@master] Update logged out reader retention experiment to be async

https://gerrit.wikimedia.org/r/1295087

tchin updated Other Assignee, added: Milimetric; removed: tchin.

Change #1295087 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Update logged out reader retention experiment to be async

https://gerrit.wikimedia.org/r/1295087

@tchin , I discovered a few instrumentation limitations for Rounds 4–10 that could affect the accuracy of the baseline metrics. Just wanted to keep you posted.
Rounds 4–7
The performer_is_logged_in field was not enabled between April 1 and April 15, so we cannot distinguish logged-in page visits from logged-out page visits during that period.
There was also a transition from the page_visited event name to page_visit, which may introduce inconsistencies.

Suggestion: Skip snapshotting the baseline for Rounds 4–7.

Rounds 8–10
The mediawiki_database field is missing for page_visit events, which means we cannot calculate retention rates at the wiki level.
Suggestion: Snapshot only the global retention baseline and skip the per-wiki retention baseline for Rounds 8–10.

I also documented the QA findings in the instrumentation spec.

@amastilovic is there an easy way to selectively run modified dbt jobs in production for backfilling like what we might need above?

The mediawiki_database field is missing for page_visit events, which means we cannot calculate retention rates at the wiki level.
Suggestion: Snapshot only the global retention baseline and skip the per-wiki retention baseline for Rounds 8–10.

Is that different from mediawiki.database? That seems to exist

spark-sql (default)> SELECT count(1)
                   > FROM event.product_metrics_web_base
                   > WHERE experiment.enrolled = 'logged-out-retention-round9'
                   >   AND action = 'page_visit'
                   >   AND NOT performer.is_logged_in
                   >   AND mediawiki.`database` IS NOT NULL
                   > LIMIT 1;
count(1)
15896166

The mediawiki.database was added back since May 15. Between 4/29 and 5/14, the field is NULL.

logged-out-retention-round9 config: https://test-kitchen.wikimedia.org/experiment/logged-out-retention-round9
Presto sql:

SELECT month, day, count(DISTINCT mediawiki."database")
FROM event.product_metrics_web_base
WHERE experiment.enrolled = 'logged-out-retention-round9'
AND YEAR=2026 and month > 3
AND action = 'page_visit'
AND NOT performer.is_logged_in
AND mediawiki."database" IS NOT NULL
GROUP BY month, day
ORDER BY day
LIMIT 100

@amastilovic is there an easy way to selectively run modified dbt jobs in production for backfilling like what we might need above?

For rounds 8–10, I can manually snapshot the data into my database tables: jiawang_web.retention_logged_out_cross_wiki and jiawang_web.retention_logged_out_per_wiki.
Would it be easier for you if I did that and then you write the data into the destination schema?

It's okay, I changed a few small things in the sql so I can just do it manually on my end