Page MenuHomePhabricator

Question: mentor-mentee assignment timestamps and unexpected ~60% coverage rate
Open, Needs TriagePublic

Description

Hi Growth Team,

I'm a graduate student researcher at the University of Michigan working on a research about the mentor program's causal impact on newcomer retention, advised by Prof. Ceren Budak (https://www.si.umich.edu/people/ceren-budak). I have two technical questions about the growthexperiments_mentor_mentee table that I haven't been able to resolve from the documentation alone.

Background on what I've done so far:
I downloaded the latest growthmentorship dump from dumps.wikimedia.org/other/growthmentorship/ (the SQL dump of the growthexperiments_mentor_mentee table, ~5.45 million records with 4 columns: gemm_mentee_id, gemm_mentor_id, gemm_mentor_role, gemm_mentee_is_active).
Separately, I pulled all account creation events from 2021 through early 2026 via the MediaWiki API (logging API with type=newusers), totaling ~10.8 million records. I excluded temporary accounts (usernames starting with ~), leaving ~9.8 million regular user registrations.
I then matched these two datasets by user ID to check what fraction of registrants in each month appear in the mentor-mentee table.

Question 1: Unexpected ~60% coverage rate
Based on the Growth Team's documented deployment timeline (T323048, T302846), I expected to see a step-wise increase in assignment rates: ~10% for users who registered before July 2023, ~25% for July-September 2023, ~50% for October 2023 through early 2025, and close to 100% after February 2025 when full coverage was enabled.
Instead, I found a flat ~60% assignment rate across all time periods. Some examples from my data:

Registration month 2022-03: 184,071 registrations, 106,170 matched in dump = 57.7% (expected ~10%)
Registration month 2023-10: 161,466 registrations, 97,221 matched = 60.2% (expected ~50%)
Registration month 2025-03: 152,559 registrations, 88,320 matched = 57.9% (expected ~100%)
Registration month 2025-06: 124,800 registrations, 71,028 matched = 56.9% (expected ~100%)

The flat ~60% rate across all periods — including after February 2025 when coverage was set to 100% — makes me think I'm overcounting registrations on my side. My registration data comes from the MediaWiki logging API (type=newusers), and I excluded temporary accounts (usernames starting with ~), but I suspect the remaining records still include account types that are not eligible for mentor assignment — for example, cross-wiki autocreated accounts (CentralAuth autocreation, action=autocreate), accounts created by other users (action=create2), or accounts created via email (action=byemail). Could you confirm which account creation methods actually trigger mentor assignment? Once I know this, I can re-filter my registration data and recalculate the actual coverage rate for each period.

Separately, even if the ~60% figure is wrong due to my overcounting, there's a related question: users who registered in 2021-2022 (when documented coverage was only ~10%) also appear in the current dump at a high rate. This suggests the system may have retroactively assigned mentors to previously unassigned users at some point — perhaps when coverage was increased to higher levels or to 100%. Did such backfilling occur? If so, do you know roughly when it happened? This distinction matters for my research because I need to know whether a user's mentor assignment in the current dump reflects their status at registration time or a later retroactive assignment.

Question 2: Assignment timestamps
Does the growthexperiments_mentor_mentee table (or any related table in the database) record when a mentor was assigned to a mentee? The dump I have only contains the 4 columns listed above, with no timestamp field.
This matters because my research design uses the phased rollout of mentor assignment (10% to 25% to 50% to 75% to 100%) as a source of quasi-random variation. To do this properly, I need to know whether each user was assigned a mentor at the time of registration or was retroactively assigned later. If there is an assignment timestamp in the database, I could reconstruct historical assignment status directly from the current data without needing older dumps.
If no timestamp exists, would it be possible to access archived versions of the weekly dumps from earlier dates? The oldest file currently available on the dump server is from late November 2025. Snapshots from around July 2023, October 2023, and January 2025 would be especially valuable for my analysis.
I'm happy to go through any data access process needed (IRB, data use agreements, etc.). My university supports this research and I can provide documentation.
Thank you for your time. Any pointers would be greatly appreciated.

Best,
Yubo Zhou
University of Michigan

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think @Urbanecm_WMF is the best person to answer, as he built all this. :)

A few questions from me though:

  • Unexpected ~60% coverage rate:
    • have you counter the users who opt-out from the program (they can do so in their preferences) or the ones that are opt-ed out because they are blocked? I don't think it is the solution, as we wouldn't have 40% of exclusions from these two cases.
    • I think (probably wrongly as this old) that we didn't assigned a mentor to the oldest accounts.

Hi @Bobocicada,

Thank you very much for the well-prepared question (and @Trizek-WMF for the ping)! I appreciate someone looking at the mentorship program's effectiveness.

Ad question (1)

I'm responding to the hypothesis in the description below, but I'm pretty sure you're running into something different. Historically, we used two distinct configuration values for scaling Growth features to new projects (particularly very early on [in 2019], when we were trying to verify we're doing the right thing, and to very large projects [en.wikipedia, es.wikipedia]). Those were/are:

  • wgGEHomepageNewAccountEnablePercentage (the homepage percentage), which controls how many users are given access to Special:Homepage,
  • wgGEMentorshipNewAccountEnablePercentage (the mentorship percentage), which controls how many users are shown the Mentorship module in Special:Homepage.

Only the mentorship percentage remains, and it is notably set to 70% on es.wikipedia (where we haven't fully scaled the mentorship program yet, cf. T285235).

We implemented the mentorship percentage by not displaying the module, rather than not assigning the mentor in the database. This means that on all wikis, mentors are assigned whenever possible. If the mentorship percentage says so, the module is hidden, but the mentor/mentee relationship continues to exist independently. If someone queries for it directly (in the dumps like you're doing, or using the #mentor magic word (docs)), it'll be revealed. In other words: "an entry exists in growthexperiments_mentor_mentee" does NOT equal "the user sees a mentor in Special:Homepage" (those two behaviours are relatively independent on each other). Details about the implementation (as well as the patch that made this happen) are available at T287903.

This was done to limit potential confusion on the community members, who can use the mentor's names in eg. welcome templates. If the mentor/mentee relationship existed only for few users, the welcome templates would need to be unnecessarily complicated. It was much easier to generate a mentor for everyone rather than forcing the community to deal with A/B testing quirks.

Implementation-wise, we automatically set the growthexperiments-homepage-mentorship-enabled user property to 0 (meaning "mentorship module disabled") if the randomisation determines the user should have mentorship disabled. Unfortunately, this data (who has what value for growthexperiments-homepage-mentorship-enabled) is not publicly available as of now. I believe having this data would be more helpful for your research (rather than the dumps you use currently). If it would be helpful, I'd be happy to look into making it available to you, so you can consider it in your research.

As to the 60% figure, I'm not sure how you arrived at it. I ran my own numbers for December 2021 (before it was switched to 100% of mentorship, cf. T384505). Mentors were assigned to 100% of users (as expected). My analysis is available at https://people.wikimedia.org/~urbanecm/growth-team/T419002_answering_20260309.html. I'm happy to answer any questions about it. I'm also happy to share the list of users that I considered as the 100% (the event_sanitized.serversideaccountcreation query); maybe this would allow you to identify the source of the inaccuracy?

I excluded temporary accounts (usernames starting with ~)

Do note that this also excludes some non-temporary accounts (for example, ~52Turria starts with tilda but is not a temporary account). Only the ~2 prefix is reserved for temporary accounts (and yes, this means we have a Y3K problem as of now... :-/). See config for more details.

This suggests the system may have retroactively assigned mentors to previously unassigned users at some point — perhaps when coverage was increased to higher levels or to 100%. Did such backfilling occur?

Partially. We assign a mentor to everyone who visits Special:Homepage (assuming they have it enabled), if they don't already have one. This means if an experienced user wants to try Special:Homepage, they'll receive a mentor, even if they were registered before Homepage was created. That being said, we did not do any mass backfilling, it only happened gradually (based on user visitor pattern).

Could you confirm which account creation methods actually trigger mentor assignment?

We assign the mentor in the LocalUserCreated hook (assuming $autocreated = false and $user->isTemp() === false). We target users who create their accounts via Special:CreateAccount by themselves (for them, a mentor should be always auto-assigned) and we don't want autocreated accounts to receive a mentor [unless they visit the Homepage, cf. above] (T292090 is a research spike to eventually enable it). For Growth's use case, the result for manually-created accounts (byemail, create2 and similar) is insignificant, which means we're not testing for that particular scenario, and no particular behaviour can be guaranteed (in theory, mentors should be assigned in that case, but it might've behaved differently in the last few years).

In addition to the quirks mentioned above, note that the mentor assignment is only possible if a suitable mentor is available on the wiki. This can be determined by looking at the wiki's MediaWiki:GrowthMentors.json (example) and its history. The logic for autogenerating a mentor is available in MentorManager, and hasn't changed meaningfully.

If you decide to examine the list of mentors, do note that early on (before September 2022), we used an unstructured mentor list (actually, two of them; the other was used for manually-assigned mentors [that can only claim users, but are not autoassigned], which looked like this. Its location differed heavily on wiki-by-wiki basis, and you can identify its location by looking at GEHomepageMentorsList in MediaWiki:GrowthExperimentsConfig.json (example).

Ad question (2)

The dumps you're processing are verbatim copies of the actual DB table in production (cf. the dump generation logic). Unfortunately, we're not storing assignment timestamps anywhere. We also only store the latest 13 dumps (so about a year worth of data), which is happening since 2022. I checked dumps mirrors if they have older dumps by any chance, but it doesn't appear to be the case. Years-old data wouldn't be stored in our internal backups either.

Conclusion

I hope that the advice above is helpful to your research, @Bobocicada. Let me know if you want me to take a look into a possible data release of the data for growthexperiments-homepage-mentorship-enabled (I don't see why it shouldn't be possible, but I'd need to get the release approved first). I also remain available if you have any further inquiries about mentorship – I'm happy to address them in writing or schedule a quick video call to go over the details. Please do feel free to get in touch with me either here, or at murbanec@wikimedia.org, and I'd be happy to help. I'm also interested in the research outcomes (once they're available); if possible, I'd appreciate them being shared with me.

Best regards,
Martin Urbanec
Software Engineer, Wikimedia Foundation

@Urbanecm_WMF Thank you so much for the clarification — this resolves one of my biggest confusions about the system design.

I have two follow-up questions:

1. Can users who were not assigned a mentor during the phased rollout later enable mentorship on their own?
For example, if a user registered during a period when the mentorship module was not visible to them (due to rollout percentage), could they later enable it manually through preferences or by visiting Special:Homepage? And if so, is there any way to identify that this happened — e.g., via a timestamp of when mentorship was first enabled for them, or a HomepageVisit event log?

2. Request for data access
Based on your explanation, the growthexperiments-homepage-mentorship-enabled user property seems critical for our causal analysis. Having this data for users who registered between 2021 and February 2025 (before 100% rollout) would be decisive for our study.

Would it be possible to request a release of this data for that time window?

I'll keep my main questions here so others who are interested can also follow along. If needed (after consulting with my advisor), I'd be happy to schedule a video call to discuss further. Thanks again for your help!

Thanks for the follow-ups, @Bobocicada! I'm glad my answer was helpful.

1. Can users who were not assigned a mentor during the phased rollout later enable mentorship on their own?

In theory. User properties are writable by the user (via the API, action=options). However, this is not promoted anywhere in the UI (on the mentee's side). In the hypothetical scenario a newcomer identifies this is how it is implemented and deliberately changes their user properties to gain a mentor, it could happen, but I consider this to be highly unlikely.

However, mentors can enable mentor module for anyone by claiming them (setting themselves as a mentor). This is publicly logged under https://en.wikipedia.org/wiki/Special:Log/growthexperiments?subtype=mentorassignmentchanges. If a newcomer is present in this log, they definitely have access to mentorship module from that moment on. This is likely to be a relatively uncommon route, but at least possible to achieve from the web UI.

And if so, is there any way to identify that this happened — e.g., via a timestamp of when mentorship was first enabled for them

Not in a 100% way. The claim mentee logs linked above have a portion of this, but there is no central dataset that would say "users have mentor access since this date".

or a HomepageVisit event log

Partially. The HomepageModule schema (see definition) contains information whether the user sees a mentorship module or not. However, this information is only logged on visit to Special:Homepage (meaning if the user never visited the homepage, this event wouldn't be recorded). In addition to this, we only keep this data for 90 days for privacy purposes. Given you're analyzing the mentorship module over a long period of time, I'm not sure this would be useful for your research. Let me know if you disagree with my interpretation though :).

In addition to what I mentioned so far, please do know mentees have the ability to opt out from mentorship (as well as opting back in), if they so decide. We log that this happened (incl. a timestamp), but only for 90 days. If there is something unclear about the opt out process, please do let me know.

2. Request for data access
Based on your explanation, the growthexperiments-homepage-mentorship-enabled user property seems critical for our causal analysis. Having this data for users who registered between 2021 and February 2025 (before 100% rollout) would be decisive for our study.

Would it be possible to request a release of this data for that time window?

I'll get the approval process started on this. Please be patient, as it might take me a couple of weeks to get an answer from our Legal department. Note that if approved, the data would be from today (we're not storing historical data, meaning if any change happened in the values, it wouldn't be detectable).

To aid the data release process, would you mind clarifying whether you're interested in users on a particular project, or users across all Wikipedias? In other words, are you studying a particular language edition of Wikipedia, or Wikipedia in general?

I'll keep my main questions here so others who are interested can also follow along. If needed (after consulting with my advisor), I'd be happy to schedule a video call to discuss further. Thanks again for your help!

Communication on the ticket is absolutely fine – it was just an offer, in case it would speed things up.

Thanks for the detailed response! @Urbanecm_WMF

To answer your question: our study focuses on English Wikipedia (enwiki).

Also, I want to confirm my understanding of the data limitation you mentioned. Since historical changes to growthexperiments-homepage-mentorship-enabled are not stored, the current value essentially reflects the initial assignment for the vast majority of users — because almost no one would change it on their own (via API), and mentor claims are publicly logged and can be excluded. So after filtering out claimed mentees using the Special:Log, the remaining users' current values should reliably represent their original auto-assignment status. Does this reasoning sound correct to you?

Thanks!

Thanks for the message, @Bobocicada. In that case, I started the approval process by asking the Legal department to review the release (Asana task, link internal to the Foundation). I'll keep you posted on the development.

To answer your question: our study focuses on English Wikipedia (enwiki).

Duly noted, thank you.

So after filtering out claimed mentees using the Special:Log, the remaining users' current values should reliably represent their original auto-assignment status. Does this reasoning sound correct to you?

Mostly, yes. There might be discrepancies (especially on test users used by the Growth team), but they should be affecting very few accounts.

Hi @Urbanecm_WMF,

One more follow-up question, this time about the Help Panel.

From T275908, I understand that since March 2021, the Help Panel defaults to sending questions to the user's assigned mentor (wgGEHelpPanelAskMentor = true) rather than to the Help Desk. This means newcomers have two entry points to contact their mentor: the mentorship module on Special:Homepage, and the "Ask your mentor" option in the Help Panel.

My question is: for users where growthexperiments-homepage-mentorship-enabled is set to 0 (i.e., the mentorship module is hidden due to the rollout percentage), does the Help Panel also hide the "Ask your mentor" option? Or can these users still reach their mentor through the Help Panel even though the Homepage module is not visible to them?

I can see from the edit tags that the two entry points are distinguished (mentorship module question vs. mentorship panel question), so I can tell them apart in the data. I just want to confirm whether the wgGEMentorshipNewAccountEnablePercentage randomization gates both channels or only the Homepage module.

Thanks!

Hi @Bobocicada,

Thanks for the follow up!

From T275908, I understand that since March 2021, the Help Panel defaults to sending questions to the user's assigned mentor (wgGEHelpPanelAskMentor = true) rather than to the Help Desk. This means newcomers have two entry points to contact their mentor: the mentorship module on Special:Homepage, and the "Ask your mentor" option in the Help Panel.

That is correct. Do note communities can change this setting independently using Community Configuration (via https://en.wikipedia.org/wiki/Special:CommunityConfiguration/HelpPanel). Log of changes is available in page history)(https://en.wikipedia.org/w/index.php?title=MediaWiki:GrowthExperimentsHelpPanel.json&action=history) (before we created [Extension:CommunityConfiguration, we had a custom implementation of the same concept; back then, configuration lived at MediaWiki:GrowthExperimentsConfig.json; history should still be publicly available).

My question is: for users where growthexperiments-homepage-mentorship-enabled is set to 0 (i.e., the mentorship module is hidden due to the rollout percentage), does the Help Panel also hide the "Ask your mentor" option? Or can these users still reach their mentor through the Help Panel even though the Homepage module is not visible to them?

I can see from the edit tags that the two entry points are distinguished (mentorship module question vs. mentorship panel question), so I can tell them apart in the data. I just want to confirm whether the wgGEMentorshipNewAccountEnablePercentage randomization gates both channels or only the Homepage module.

The Help Panel module respects the user property value. It might use the Help Desk as a fall back destination though (as in, if a wiki has a Help Desk configured, and if the user doesn't have mentorship enabled, then the Help Panel will direct questions to the Help Desk, despite the community defined mentors as the preferred target).


Regarding the data release approval process, our Legal department got back to me this morning with some extra questions. I'll answer them today, and we should have a decision within a week or two (or so I hope). Thank you for your patience.

@Bobocicada The release of the requested dataset was approved. I created a separate ticket (T420387), where the dataset can be generated and published. I also left a question for you on that ticket.