Hi Growth Team,
I'm a graduate student researcher at the University of Michigan working on a research about the mentor program's causal impact on newcomer retention, advised by Prof. Ceren Budak (https://www.si.umich.edu/people/ceren-budak). I have two technical questions about the growthexperiments_mentor_mentee table that I haven't been able to resolve from the documentation alone.
Background on what I've done so far:
I downloaded the latest growthmentorship dump from dumps.wikimedia.org/other/growthmentorship/ (the SQL dump of the growthexperiments_mentor_mentee table, ~5.45 million records with 4 columns: gemm_mentee_id, gemm_mentor_id, gemm_mentor_role, gemm_mentee_is_active).
Separately, I pulled all account creation events from 2021 through early 2026 via the MediaWiki API (logging API with type=newusers), totaling ~10.8 million records. I excluded temporary accounts (usernames starting with ~), leaving ~9.8 million regular user registrations.
I then matched these two datasets by user ID to check what fraction of registrants in each month appear in the mentor-mentee table.
Question 1: Unexpected ~60% coverage rate
Based on the Growth Team's documented deployment timeline (T323048, T302846), I expected to see a step-wise increase in assignment rates: ~10% for users who registered before July 2023, ~25% for July-September 2023, ~50% for October 2023 through early 2025, and close to 100% after February 2025 when full coverage was enabled.
Instead, I found a flat ~60% assignment rate across all time periods. Some examples from my data:
Registration month 2022-03: 184,071 registrations, 106,170 matched in dump = 57.7% (expected ~10%)
Registration month 2023-10: 161,466 registrations, 97,221 matched = 60.2% (expected ~50%)
Registration month 2025-03: 152,559 registrations, 88,320 matched = 57.9% (expected ~100%)
Registration month 2025-06: 124,800 registrations, 71,028 matched = 56.9% (expected ~100%)
The flat ~60% rate across all periods — including after February 2025 when coverage was set to 100% — makes me think I'm overcounting registrations on my side. My registration data comes from the MediaWiki logging API (type=newusers), and I excluded temporary accounts (usernames starting with ~), but I suspect the remaining records still include account types that are not eligible for mentor assignment — for example, cross-wiki autocreated accounts (CentralAuth autocreation, action=autocreate), accounts created by other users (action=create2), or accounts created via email (action=byemail). Could you confirm which account creation methods actually trigger mentor assignment? Once I know this, I can re-filter my registration data and recalculate the actual coverage rate for each period.
Separately, even if the ~60% figure is wrong due to my overcounting, there's a related question: users who registered in 2021-2022 (when documented coverage was only ~10%) also appear in the current dump at a high rate. This suggests the system may have retroactively assigned mentors to previously unassigned users at some point — perhaps when coverage was increased to higher levels or to 100%. Did such backfilling occur? If so, do you know roughly when it happened? This distinction matters for my research because I need to know whether a user's mentor assignment in the current dump reflects their status at registration time or a later retroactive assignment.
Question 2: Assignment timestamps
Does the growthexperiments_mentor_mentee table (or any related table in the database) record when a mentor was assigned to a mentee? The dump I have only contains the 4 columns listed above, with no timestamp field.
This matters because my research design uses the phased rollout of mentor assignment (10% to 25% to 50% to 75% to 100%) as a source of quasi-random variation. To do this properly, I need to know whether each user was assigned a mentor at the time of registration or was retroactively assigned later. If there is an assignment timestamp in the database, I could reconstruct historical assignment status directly from the current data without needing older dumps.
If no timestamp exists, would it be possible to access archived versions of the weekly dumps from earlier dates? The oldest file currently available on the dump server is from late November 2025. Snapshots from around July 2023, October 2023, and January 2025 would be especially valuable for my analysis.
I'm happy to go through any data access process needed (IRB, data use agreements, etc.). My university supports this research and I can provide documentation.
Thank you for your time. Any pointers would be greatly appreciated.
Best,
Yubo Zhou
University of Michigan