Page MenuHomePhabricator

Mentor sign-up volume choice labels are mathematically unsound
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • go to en:Special:EnrollAsMentor
  • (Note: I am unable to see the enrollment page, as I am already a mentor and it shows me my dashboard instead)

What happens?:

  • From memory, the sign-up process offers mentors a tripartite division of mentee numbers or activity volume, labeled something like: 'half the average', 'average', and 'twice the average'. This is circular, and the average may drop to 1, or increase indefinitely.

What should have happened instead?:

  • It should offer volume labels that do not refer to the average, like, 'low', 'medium', 'high'; or absolute numbers (e.g., 1-10, 11-20, 20-40).

Details at mw:Talk:Growth.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1214400 had a related patch set uploaded (by Shivaansh Singh; author: Shivaansh Singh):

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Use neutral mentor load labels

https://gerrit.wikimedia.org/r/1214400

Urbanecm_WMF subscribed.

Pulling to sprint, as it has a patch provided already.

Hello!

Thanks for filling this task, and for uploading a patch here. Before we decide on how to proceed, I'd like to ask @AAlhazwani-WMF (Growth's Designer) and @KStoller-WMF (Growth's Product Manager) for their input as well. From my perspective, I understand how the labels can be misleading, especially in edge cases.

Internally, we have a mentor pool, which contains usernames of registered mentors. By default (when on the "Average" setting), everyone is in the pool twice. If someone changes their settings to "twice the average", their username will be in the pool four times. In case they set themselves to "half the average", their username will be in the pool only once. Whenever a new account registers, a random username is taken from the pool, which is the mentor we're assigning to that user. The number of times a mentor's username is in the pool is called the mentor's weight.

This means the only changes that happen are relative: if one mentor's weight is smaller relative to the weight of all the other mentors, that mentor would receive fewer newcomers. Conversely, if one mentor's weight is higher relative to all the other mentors, they would receive more newcomers.

Naturally, this only works if only a small portion of the mentors change their preferences. If everyone sets themselves as "Half the average", the system wouldn't change at all, as there wouldn't be any relative differences.

Mathematically, this should indeed work as converging towards half of the average or twice of the average. While the average naturally changes of course, the system would eventually reach a stable configuration. While I can see how the labels might be confusing, the newly proposed labels are probably even more confusing, as they provide absolutely no information about what the actual difference is. In addition to that, this would continue having the "only works when only some mentors change their settings" problem I described above, so "Low" and "Medium" might actually be the same in some cases.

I'm curious what you think about this.

Best regards,
Martin Urbanec

Thanks for thinking about this, @Mathglot and @ShivaanshSingh!

I agree that the current copy is imperfect, but I do worry that "Low, Medium, and High" is too vague.

One possible approach is to focus on the mentor’s preferred workload. Something like:

• Fewer (I have limited time to support mentees)
• Standard number of mentees (default)
• More (I am eager to support more mentees)

This phrasing keeps the intent clear and lets mentors choose based on their availability rather than the underlying distribution mechanics.
Although admittedly those are all rather long phrases for a drop down menu, and inevitably will be even longer when localized in certain languages.

Let's let @AAlhazwani-WMF chime in before we make any changes.

. . . From my perspective, I understand how the labels can be misleading, especially in edge cases.

Internally, we have a mentor pool, which contains usernames of registered mentors. By default (when on the "Average" setting), everyone is in the pool twice. If someone changes their settings to "twice the average", their username will be in the pool four times. In case they set themselves to "half the average", their username will be in the pool only once. Whenever a new account registers, a random username is taken from the pool, which is the mentor we're assigning to that user. The number of times a mentor's username is in the pool is called the mentor's weight.

So what I understand now, is that you are assuring mentor load only proportionally relative to other mentors, and not absolutely based on any number of mentees. Am I correct in concluding that in this scheme there is no upper bound on possible mentor load: they might get 10 mentees, or 100, or 10,000 mentees. Without knowing the details of the number of mentors and the total number of mentees in the future, it would be impossible for a prospective mentor upon signing up to estimate load or to set a threshold on the number of mentees they might be assigned.

If that is a correct understanding, then this is a completely untenable system, with a set of labels that are devoid of any real-world meaning. If you are doing piecework on the assembly line, and you assign the novices half the work of the journeyman, and one fourth that of the masters, that is perhaps fair to begin with, until you dump one million parts into the system and the whole system collapses. However fair it is to tell a novice that the masters have to finish four million parts and you only have to finish one million, if a normal person can only do a thousand a day, the fairness of the division of labor is irrelevant.

I would predict that as load gets heavier with increasing number of mentees, those requesting 2x or average will downgrade and there will be a race to the bottom, with everyone eventually ending up at 1/2, thus total equality to divvy up the load. Then when 1/2 is still too much, mentors will start to leave the system, increasing the load on everyone else, with a vicious cycle occurring until you have nobody left willing to mentor.

I no longer believe that the core problem is that the label names are unsound. The system design is unsound. Please tell me I have this all wrong.

Change #1214996 had a related patch set uploaded (by Shivaansh Singh; author: Shivaansh Singh):

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Clarify mentor workload labels

https://gerrit.wikimedia.org/r/1214996

Change #1215037 had a related patch set uploaded (by Shivaansh Singh; author: Shivaansh Singh):

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Clarify mentor workload labels

https://gerrit.wikimedia.org/r/1215037

. . . From my perspective, I understand how the labels can be misleading, especially in edge cases.

Internally, we have a mentor pool, which contains usernames of registered mentors. By default (when on the "Average" setting), everyone is in the pool twice. If someone changes their settings to "twice the average", their username will be in the pool four times. In case they set themselves to "half the average", their username will be in the pool only once. Whenever a new account registers, a random username is taken from the pool, which is the mentor we're assigning to that user. The number of times a mentor's username is in the pool is called the mentor's weight.

So what I understand now, is that you are assuring mentor load only proportionally relative to other mentors, and not absolutely based on any number of mentees. Am I correct in concluding that in this scheme there is no upper bound on possible mentor load: they might get 10 mentees, or 100, or 10,000 mentees. Without knowing the details of the number of mentors and the total number of mentees in the future, it would be impossible for a prospective mentor upon signing up to estimate load or to set a threshold on the number of mentees they might be assigned.

yeah, to add even more complexity to the discussion.. the challenge here is that we don't know how active those mentees are going to be (maybe we could estimate the "number of questions per mentee per week" based on prev data, though i defer to engineering to confirm or not this). one mentor could have hundreds of mentees and only get a few questions a month.. while another mentor might have just a dozen of mentees, but getting several questions per day.


to build off your prompt @Mathglot, i wonder what does a mentor actually need to know to make this decision?

right now we're trying to describe the mechanism (likelihood, weighting, probability), but what mentors actually need to understand to make an informed decision may be simpler?

so my question for the group is.. what are mentors actually trying to signal when they change this setting? are they saying:

  • how much time they have?
  • how experienced they are?
  • how many questions they can answer?
  • how many mentees they can handle?
  • something else?

if we understand their/your intent when using this control, we might be able to write labels that match that intent, even if the technical implementation underneath is more complex.

Change #1214238 had a related patch set uploaded (by Prakhar0804; author: Prakhar0804):

[mediawiki/extensions/GrowthExperiments@master] GrowthExperiments: Mentorship — use Low/Medium/High labels instead of 'average' phrasing

https://gerrit.wikimedia.org/r/1214238

Hello @Prakhar0804 @ShivaanshSingh,

Thank you for the work you spent on this task. However, please note that as of now, the problem is not the lack of code, but rather agreeing on the solution. You can help building agreement by participating in the discussion on this task, specifically by answering the comments posted earlier by @KStoller-WMF, @AAlhazwani-WMF or myself.

It seems that while they understand the current situation is not perfect, they consider the proposed solution to have other problems (possibly more impactful). The team is open to considering a change if it is a improvement of the status quo, but we're not open to changing one not-so-great solution with another.

Thank you for your understanding.

Sincerely,
Martin Urbanec

Change #1214996 abandoned by Shivaansh Singh:

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Clarify mentor workload labels

https://gerrit.wikimedia.org/r/1214996

Change #1215037 abandoned by Shivaansh Singh:

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Clarify mentor workload labels

https://gerrit.wikimedia.org/r/1215037

Change #1214400 abandoned by Shivaansh Singh:

[mediawiki/extensions/GrowthExperiments@master] Mentorship: Use neutral mentor load labels

https://gerrit.wikimedia.org/r/1214400

to build off your prompt @Mathglot, i wonder what does a mentor actually need to know to make this decision?

.
.

so... what are mentors actually trying to signal when they change this setting?

That is looking into the wrong end of the telescope. Why ask what mentors are trying to signal when responding to bad choices where no response is meaningful or satisfactory? What is your favorite flavor of ice cream: 1) yes; 2) tomorrow; 3) Game of Thrones? Ask mentors what the choices ought to be. Or even, as an interesting post-mortem: "What do you think we were asking when we said, 'avg, 1/2 avg, 2x avg' ? If you pick 2x avg, what do you think happens then?"

Nevertheless, I was signaling something, namely this: "This is a cockamamie option set that should have been murdered in its bed and never seen the light of day, but since the ones who will get hurt if I don't answer are the new users who won't get good advice, or any advice, so I guess I have to pick something, for the users' sake; I can always vent later at Phab. Since I am also interested in survey design, I'll start with one extreme, then after a time, switch to the other extreme, and see if I can reverse engineer what is going on under the hood. But the whole thing sounds like voodoo mathematics to me. I wonder how upset they'll be, if I just give them my honest opinion."

But I think my signals are probably atypical. Anyway, if you didn't get all that out of my "one-half" response, I guess the cat is out of the bag now.

As far as what a mentor actually needs to know to make the decision, that is unanswerable given the three options. If you are asking without limit to the three options, then the main question (for me) is, "How will this affect my workload?" I can think of tons of ways this system could be rationalized. One approach that would satisfy me adequately would be something similar to the configurability of the en-wiki Wikipedia:Feedback request service, where volunteers sign up and configure their load-throttling in terms that are completely transparent with no voodoo math. (Currently sign-up and throttling are combined on one page and I'd vote for separating those two functions, but this is not a big deal.)

The issue there, is that FRPS volunteers are assured of no more load than they have configured. That means, of course, that volunteer-feedback time is a limited resource, and not every Rfc at Wikipedia will always get as many volunteers as one might like to have, but that is the nature of a limited resource (and also, of a volunteer project where you get to choose what work you want to do, and how much of it).

The current Mentor question assignment design takes the approach that every newbie gets a mentor, and every question they issue gets posted, without taking into account mentors' max-load wishes (other than the bizarre, 'half-the-pain' / 'twice-the-pain' options that are unconvertible into units of time or work) with an opaque system that just dumps more and more load out, as more and more newbies sign up. This is inherently disrespectful of mentor volunteer time.

My guess is that some of the decisions that led to this situation must be deeply baked into the system and cannot easily be undone or alter how the whole system works in a fundamental way without a from-scratch redesign, but hopefully I'm wrong about that.

(Apologies for delayed reply; I must have missed the ping.)