Page MenuHomePhabricator

Develop a dataset for editor Q&A
Closed, ResolvedPublic

Description

Task

Build a dataset of realistic inputs and outputs for a system that aims to provide support to editors with questions. The inputs likely are questions that have already been asked on wikis -- e.g., extracted from WP:Teahouse -- but the outputs might take a few reasonable forms:

  • Link to the relevant content in a policy/help page. This could have several different levels at different granularities where feasible such as relevant sentence, paragraph, section, page, or even namespace. The relevant namespace might seem trivial/non-useful but one design for an agentic system might involve recommending where to search in the first place.
  • Link to a similar question that has been asked. To make the task of discovering this not trivial, either the question itself would need to be masked from the output data or natural examples of this would need to be found on-wiki or some sort of transformation of the questions would be required to fuzz them.
  • Text of the actual answer provided to the question.

Considerations:

  • Size: while larger datasets are better as they'll provide more detail into the performance of a given approach and potential errors, quality is likely the most important thing given that it's not likely that a system would be fine-tuned (but would instead use existing pre-trained language models for generating embeddings or LLMs for selecting answers). So realistically, a few hundred high-quality and diverse examples is much more valuable than 1,000 or more of mixed or unknown quality.
  • Quality: not all answers provided will necessarily be correct. This may require manual evaluation but filtering on answering-editor expertise or other parameters might help with reducing down the scope that needs to be evaluated.
  • Diversity: ideally the questions will cover a wide range of potential topics. This diversity could be measured through text similarity metrics, diversity of where the question was asked or features of the editor who asked it, diversity in the namespaces/pages referenced in answers, or potentially even devising a taxonomy of potential topic areas and annotating questions with which they fall in.

When this dataset is compiled, a first task to determine its utility would be to evaluate the current Search API on the dataset and measure its effectiveness at different levels of precision -- i.e. is the correct page returned in the top-k results.

Motivation

As an editor, it can be difficult to get prompt guidance on a particular issue. One would hope that it would either be easy to discover the relevant documentation to the question or get guidance from a fellow editor, but both of these have challenges:

  • Mentorship is difficult to scale to the needs of editors:
    • There are many programs/space for this mentorship -- e.g., Newcomer Homepage, The Teahouse, Noticeboards, Village pumps, Talk pages, WikiProjects as well as plenty of off-wiki spaces -- but they might not be discoverable by those editors who most would benefit from the help.
    • The async/distributed nature of the wikis means that it can take a while before a question is answered.
    • Answering these questions can also burnout those editors willing to spend their time providing this support when those questions are often repetitive or require saying "no" because the support that is requested violates Wikipedia policies.
  • Existing help/policy documentation is very difficult to index/discover via traditional keyword-based search:
    • The documentation and related questions that other editors have asked that might be relevant are spread across numerous namespaces and may even be found on other wikis.
    • A lot of Wiki documentation uses highly specific terminology that is not easily discoverable via Search unless you know the name of the policy or the piece of wikitext syntax or name of the extension etc.
    • Many of the help/policy pages are quite long and actually combine many related pieces of guidance together. This makes it a needle-in-the-haystack challenge for Search to find the page that has the one snippet of content that's relevant.
    • It may be difficult for an editor to convert their actual question into effective keywords to search.

While asking questions and receiving mentorship is not just about receiving the "right" answer but also a valuable learning/social process, there are likely many frictions in this process that are not helpful and can frustrate editors and reduce the capacity for beneficial mentorship. This is a good space for improved tooling ranging from more effective search approaches to potentially even AI-generated answers to questions. Before these can be explored, however, there are many basic questions about how to evaluate any potential solutions in this space to determine their effectiveness.

This further aligns with the AI Strategy (Engage new generations of editors with guided mentorship) and builds on creating more time for human judgment because the primary approach is to leverage that AI excels at handling tasks such as information retrieval.

Resources etc.

  • SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions is a relevant past example of a more narrow Q&A challenge that might provide some inspiration for appropriate methods for this broader task.
  • WP:Teahouse is probably the best-studied mentorship space. Specifically, newcomers can get assistance via the {{Help me}} template on their talk page (with the {{Help me-helped}} template indicating an answer) since the teahouse is often semi-protected such that unregistered users or accounts that are not confirmed/autoconfirmed cannot post their questions there directly. But newcomers can also ask questions to mentors via the Newcomer Homepage (details) and these follow a common pattern (Question from... as can be seen on this user page or this one) and are tagged with the Mentorship module question so easy to gather.
  • Reference desk: https://en.wikipedia.org/wiki/Wikipedia:Reference_desk
  • This task is about a dataset for evaluating search for mentorship-type questions. There are also more basic questions that would be beneficial to answer via qualitative methods -- e.g., what do editors need from mentorship? what questions will they ask publicly vs. privately? are AI-generated answers appropriate or should systems stop at providing improved access to relevant documentation? If AI-generated answers are appropriate, there are also interesting design questions about what oversight should be provided -- e.g., how to ensure transparency and some level accountability/curation of the resulting answers. SpinachBot (AI bot for answering SPARQL-related questions) may be one approach but presumably there are other designs and aspects to consider in balancing speed and usability for newcomers with curation, transparency, and the ability to correct answers by more experienced editors.

Event Timeline

Isaac renamed this task from [long] Develop a dataset for editor Q&A to Develop a dataset for editor Q&A.Sep 29 2025, 8:34 PM
Isaac claimed this task.
Isaac edited projects, added Research, Essential-Work; removed research-ideas.
Isaac updated the task description. (Show Details)

Weekly update:

  • I began considering what it would mean to extract nice structured datasets of Q&A from pages -- e.g., WP:Teahouse Archives -- but paused that effort as I realized that a) it was non-trivial, and, b) that I wasn't fully sure yet what I would want to extract so better to return to the question after spending some more time with the data. I want to process the HTML but that's also a consideration: HTML will make it much easier to extract e.g., policy/help links and nice clean text from the conversations. Because I'm really only interested in the final question+answers, it's okay that HTML largely locks me into working with the current snapshot as opposed to the full history of the conversation. There are some parsers that exist for wikitext+talk pages if I decide to change direction -- they likely wouldn't work exactly for my needs but might have some of the logic around e.g., extracting timestamps, usernames, etc.
  • I pivoted instead to starting some qualitative coding of editor Q&A. I began with newcomer questions via the Newcomer Homepage mentor module largely because there were some discussions happening about the impact of that module that I thought might benefit from more data. I grabbed 100 random mentor questions from English Wikipedia (query below) and have gotten through 15 of them (thanks to @TAndic for helping me think through my codebook). Still very small sample but some early takeaways:
    • 5 did not really receive responses (2 mentors seemed to be generally inactive at that time, 1 was a case of the question being ignored problem, 2 were cases of the question being off-topic/unintelligible and were eventually reverted).
    • Of the 10 with responses: 2 mentor responses came within ~20 minutes; 5 responses came in 12-20 hours; 2 took 1.5 days, and 1 came a month later.
    • The mentor responses were largely helpful/kind -- sometimes directly answering the question, sometimes asking for clarification. Mentees almost never responded back or thanked them though. More common actually was the mentee making a follow-up in a new section (twice) or on their own talk page (once). Only twice did they actually follow-up on the original question.
    • Of the questions where the intention was clearer, 7 were about editing existing articles and 4 were about creating new articles. Most questions were generic (e.g., "how do I create an article?") and probably would have benefited from some follow-up questions/answers. The needs were pretty diverse (general workflow, questions about policies, questions about wikitext/syntax, help with approving articles, etc.)
    • There were reasonable COI concerns in 4 of the questions. On the flip side, several of the newcomers clearly were acting in good-faith and just trying to figure things out. Many it was unclear (generic question and not enough other activity to judge).
    • The outcomes for these 15 aren't great though a few mentees made it through:
      • No contributions for a month after question and then returned to edit occasionally
      • Asked again about their draft article on different talk page and on Commons for some reason, but then stopped editing
      • Made two more edits to their draft article about a month later but eventually declined for notability reasons and they never edited again
      • Kept editing but most of it was reverted for lack of sources. Eventually blocked.
      • Never edited beyond the question
      • Never edited beyond the question
      • Never edited beyond the question
      • Made edit but was reverted. Then made more policy-conforming edit and hasn't edited since. Likely COI though.
      • Never made edit they asked about or edited again. Page is still broken 2 years later from their initial attempts
      • Never edited beyond the question
      • Never edited beyond the question
      • Figured it out and kept editing
      • Figured it out and kept editing
      • Fixed typo and asked follow-up in wrong place and then stopped
      • Unclear what was going on with mentee but eventually they dropped off
  • Next steps for me will be to pull some samples from other sources to diversify my sample. Once I have a better sense of what's out there, I'll return to the question of whether to try to more automatically extract some of this or continue in a more manual fashion.
-- 600 (mentorship module question): 26471 instances (questions asked from homepage)
-- 603 (mentorship panel question): 6507 instances (questions asked from specific article context)

SELECT
  CONCAT("https://en.wikipedia.org/wiki/Special:Diff/", ct_rev_id)
FROM change_tag
WHERE
  (ct_tag_id = 600 OR ct_tag_id = 603)
ORDER BY
  RAND()
LIMIT 100

A general reflection too: it's really powerful to go through these editor journeys via the questions they're asking mentors and their Contribution history and trying to figure out what was going on. Many are quite short (unfortunately) with many misconceptions based on their actions but it's really interesting to see them try at creating user pages, getting help, making edits, etc. And then very heartwarming when you see an editor figure it out and keep editing!

Weekly update:

  • I started coding up some questions from Wikipedia Teahouse. After 5 of them, I'm going to pause though. They tend to be far more detailed/advanced and I think out-of-scope for my goals at the moment. These are questions that almost certainly do need the level of detail/context that an editor can provide in their reply (i.e. bad fit for just surfacing documentation). The Teahouse folks are also largely doing a good job of responding pretty quickly -- e.g., 3 of the 5 questions got responses in ~10 minutes. It's telling that in all three of those cases, the conversation was much more in-depth than the usual question + single response (9, 5, and 6 responses) and actually saw the question-asker continue to engage. For the other two (2 hours and 13 hours to first response), the question-asker never re-engaged.
  • Mentees essentially never thank their mentor (despite occasionally using this feature to thank others) and often don't respond to their initial thread if the question isn't answered in the first ~10 minutes. We may want to nudge mentees to thank their mentor when their response is helpful. The more accessible Thanks link on talk pages (details) should be a big help when it's deployed to English Wikipedia but perhaps there's a good place to nudge mentees to use this functionality when they appreciate a mentor response (as it's still slightly hidden).
  • I talked with a number of folks at WikiConference North America about this work, which led to some interesting ideas:
    • Mentorship has at least two dual goals: giving the question-asker specific feedback on what they should do next (competency) and also advising on broader norms within Wikipedia (relatedness). The former is what I think we might address better via improved Search over documentation while the latter is what is still important to preserve as a human interaction.
    • At some point, it might be valuable to consider what data could help mentors in assessing their work. This would have to be done carefully because folks are doing this out of their own goodwill and you don't want to transform it into another chore or just plain work -- i.e. it shouldn't feel like grading or surveillance. That said, statistics on mentee survival/success, response times (maybe too surveillance-y?), or other outcome-related data might help in surfacing particularly successful mentors or identify areas for improvement. So maybe e.g., a public top-list of best mentors by engagement/outcomes and then folks can privately view their own statistics about response time etc.
    • In relation to discussions around the progression system (T395678), mentors might eventually be folks who could "sign off" on someone achieving a basic level of skills. This could be purely for feedback purposes or help build the confidence of new editors, a way for a mentee to "graduate" out of the mentorship program if mentors feel they have too many folks on their plate, or perhaps even be tied to receiving some sort of user access level if that's deemed helpful?
    • I'm less convinced about this but leaving it here as a thought: we may want to institute some sort of back-up similar to how Help me templates work. E.g., if a mentee isn't receiving a response within some timeframe, other editors could be pinged. Perhaps more appropriate would be doing that if e.g., the mentor has not edited in the last 24 hours? Generally I see some issues with slow responses on Growth Homepage mentor questions (especially as compared to Teahouse) though it's been pretty rare that a mentor doesn't respond in e.g., 24 hours so I don't really think this is a problem that needs to be solved.
  • Next steps: I'm realizing that there are a lot of approaches to getting feedback (many listed in the description of this task) but it might be helpful to describe them in a bit more organized way -- e.g., whether it pings an individual, a small group, or a large group; how easy to use; how discoverable; etc. This will also help me in deciding whether I want to continue coding up the Newcomer Homepage Mentor questions or switch to a third source.

Another quick thought: I was curious how many mentees actually had an email (necessary to get a notification that their mentor had responded if they logged out from Wikipedia) and it's up around 91% so that's not necessarily a major issue here as far as mentee drop-off. EDIT: added authentication check and that's only 67% of mentees, so perhaps a larger factor.

SELECT
  COUNT(DISTINCT(user_id)) AS num_mentees_who_asked_question,
  COUNT(DISTINCT(user_email)) AS num_mentees_with_email,
  COUNT(DISTINCT(IF(user_email_authenticated IS NULL, "", user_email))) AS num_mentees_with_email_authenticated
FROM change_tag ct
INNER JOIN revision r
  ON (ct.ct_rev_id = r.rev_id)
INNER JOIN actor a
  ON (r.rev_actor = a.actor_id)
INNER JOIN user u
  ON (a.actor_user = u.user_id)
WHERE
  (ct_tag_id = 600 OR ct_tag_id = 603)

weekly update:

  • No major update as other urgent work took up most of my time. Had a good discussion with Moyan Zhou of UMN though about the role of AI in mentorship that sparked some thoughts about how AI could potentially both help newcomers with rephrasing their question and mentors in digging up links etc. to make it easier to respond, but largely stay out of the middle of the relationship where that human connection is important.

Related insight from @AJayadi-WMF 's Research:Understanding Organizers' Impact on Newcomer Growth:

"One organizer in this research used to be a Growth mentor. They are taking a break right now. They received so much repeated questions. Usually, what they try to do is giving newcomer editors links to resources that they can read. Even though they are currently taking a break, they still see the value of Growth mentorship feature and open to explore how to socialize this feature to more organizers who would like to be Growth mentors."

Thanks for calling that out @KStoller-WMF ! In my informal conversations with lots of experienced editors, exhaustion is definitely a factor though the motivation/desire to help still exists. I definitely came into this wondering how to help mentees but the more I do it, I think the "how do we help mentors get more enjoyment out of the process" question is also crucial.

From @Easikingarmager's recent work on guided article/section creation, a few other relevant points:

  • On existing usage of GenAI: Some newcomers, mostly but not limited to idwiki, use ChatGPT to understand and paraphrase source content, or to get a reference structure based on which they can look for sources.
  • On importance of surfacing relevant policy: Guidelines remain an integral reference, even for experienced editors who still refer to them for any queries while writing an article, even if they are familiar with creating wiki articles.
  • On wanting mentors to help navigate all the potential guidance and off-wiki mentorship: Most editors ask other editors for assistance via messaging channels, Teahouse, help page, community training. In idwiki, editors are connected to reviewers and other editors on messaging app groups such as Discord, Telegram and WhatsApp. These could be groups created during editing training/workshops and article creation or improvement campaigns. Some enwiki editors such as those belonging to a university wiki fan club were also connected via messaging apps to other editors and more experienced editors who they turn to for assistance. A few editors prefer this option because the guides can be overwhelming with the amount of information.

Other updates:

  • I started to document many of the mentorship/support spaces around the wikis -- e.g., Newcomer Homepage, Edit Requests, Teahouse, Help Me Template, Help Desk, AfC Help Desk, Reference Desk, Edit-a-thons, Talk Pages, LLMs, Internet Search, Off-wiki chats (including IRC help channel), and various portals (WP:Questions, WP:FAQ, Help:Menu, WP:Request Directory, WP:Help button, Help:Getting started).
  • There are A LOT of spaces and if I have the time, I'd like to track down a bit more how widely-used/viewed each one is and try to get a sense of how an editor discovers these. But the Newcomer mentor questions and Teahouse feel like the most discoverable thusfar for newcomers, so I might stick to them for generating ideas for a more curated dataset.
  • For a different Search-related dataset project, I had made the recommendation of not having individual queries but actually having three forms of each query: pure keywords (optimized for current search), a looser natural-language style (how folks often ask when formulating questions), and a fuller natural-language style that actually contains all the necessary context (how folks ideally would ask questions). I might end up applying that here too because I had been struggling with the tension between the messiness of many current questions posted on-wiki and what you hope the question actually ends up being after a few iterations.
  • A number of the portals have search functionality built-in to search their archives. I'd like to look to see what sorts of questions are being asked via that Search, as that might be useful for understanding how editors phrase questions currently (assuming they don't map them to keywords).

Updates:

  • I started to look into the {{Help me}} template (notebook + ping @MGerlach as the person who flagged this pathway to me). The code is hacky because we don't have a nice content diff dataset for talk pages so I had to find Help me sections post-hoc and then try to guess who added the request etc., but there were at least 1700 instances on English Wikipedia of editors whose account was <= 10 days old using the template so this could be a good dataset to mine for more newcomer questions. These almost exclusively happen on the newcomer's user page (usage on article talk pages is much more likely to be more experienced editors).
  • I met with Moyan and Tiziano (external researchers) to discuss some ideas about where this could go. We're going to meet again in early December but they're both excited about the space. Looking ahead, we will work to expand the qualitative coding I'm currently doing of Newcomer Homepage questions (and I think I'll add in the Help Me questions from newer users). This already has revealed quite a bit but we'd then choose one potential space for intervention and build out a prototype and evaluate it. Some of the potential intervention ideas (please chime in if you have others) that have already come from our discussions:
    • Natural-language search of Policy/Help namespaces. This was what I came into the project thinking and very likely will still pursue because it should be effective given that these namespaces are relatively constrained in size, not super dynamic, contain a fair bit of jargon, and have many massive/diverse pages that challenge the utility of keyword search. This is also great for prototyping because it's almost purely back-end and easy to incorporate into tooling to test out if we get to that point. Plus it aligns nicely with other work on Semantic Search happening.
    • Same as above but of FAQ / Question spaces only. Essentially rather than providing directly the answer, this would help editors find similar questions and see how other editors responded (with answers, asks for clarification, caution about breaking policies, etc.).
    • LLM agent to help editors rewrite their questions so they are easier to answer. This could support better Search as well but also ensure there's enough context for an editor to be able to answer directly as opposed to have to first ask for a follow-up (with all the newcomer drop-off that occurs the longer the conversation goes). I like this as a really nicely-constrained and principled use of AI that doesn't get in between the interactions between editors (just tries to ease things from the sidelines). Some similarities to the ideas proposed by Cristian Danescu (meta) but harder to prototype because you need it installed by newcomers so that either requires essentially a full Product deployment or very limited field study at edit-a-thons where you could individually install it for folks.
    • "I'm just a human" auto-responder for mentors. This is kinda a combination of the above two ideas but with more interesting prototyping opportunities. Essentially the idea would be that when a mentee asks a question on their mentor's talk page, if the mentor has opted in, a bot would automatically collect that question, query an AI agent, and post a quick-follow up depending on the level of context provided. Probably always included is some boilerplate language about how editors are people and might not be active in this moment so please be patient and check back. If the question has enough info, maybe the response includes a few relevant links from on-wiki documentation / question banks based on the Search prototype. Maybe if the question is lacking context, the bot asks the editor to clarify. Maybe the AI even tries to answer the question. This could be configurable as well -- e.g., an editor could opt in to just the Search links but no answer or just the Clarification component but not the others.
    • Tool for helping newcomers keep track of the questions they've asked. It'd be great to be able to track whether the question was answered etc. but that gets a lot trickier because questions get moved around as pages get archived. Easiest would be to just retain the original section link and allow the DiscussionTools extension to handle discovery of the section even if it's been moved. And then the improved Thank/Reply functionality for the editor figuring out how to follow-up.

Updates:

  • I built a prototype for natural-language search overtop the Help/Policy namespaces on English Wikipedia. There's a backend API that I can use for testing against an eventual dataset of queries (if we have explicit "correct" answers) and a UI for exploration. The API/UI show the natural-language search results alongside what is returned by our existing keyword search as guided by these entrypoints curated by editors.
  • The nearest-neighbor searches are brute-forced (as opposed to an approximate index) so it takes a second or two. Using the Qwen3-Embedding-0.6B model for embeddings. It showed a strong improvement anecdotally over the much smaller standard sentence-transformers models. I suspect adding in a reranking model would help even more but that would require storing the text too (and not just embeddings) and slow things down a good bit further.
  • This is actually the third iteration -- the first one was all Help/Wikipedia namespaces but it was way too messy with all the admin noticeboard etc. pages. Second was only top-level pages (no subpages) to remove all that discussion but that was too coarse because I lost some important Q&A archives and even though the results were higher quality, they mixed together very different contexts -- e.g., policies, help documentation, Q&A. So in this current iteration, I have explicitly separated out the different sources so that in theory they could be separately contextualized for an end-user -- e.g., here are similar questions, here is relevant policy, here's some how-to etc.
  • A future iteration would probably:

Updates:

  • I updated the code for extracting passages to also grab past questions asked via Growth's mentorship module (i.e. sections on user talk pages that match the format Question from... (<date>) and questions asked via help-me templates on user talk pages (presence of help-me-* template). I'm working on generating their embeddings so they can be added to the question-bank corpus in the prototype. (prototype now updated)
  • Moyan shared her code for her previous experiment with providing feedback to new editors via AI: https://github.com/phoebexxxx/newcomer-llms-user-study/tree/main
    • The core functionality is a nearest-neighbor index overtop several core content policies for RAG purposes combined with an instruction to the agent (gpt-4o-mini) to rephrase the participant's question for better retrieval purposes. I have a working nearest-neighbor index but I think that "please rephrase this question for..." is a key functionality to explore where we prototype workflows with a LLM.
  • I spoke with @Trizek-WMF about his experiences/thoughts around mentorship. My summary below:
    • Answers are often quite slow with 1:1 mentorship (I've been seeing this too in the data).
    • Lots and lots of repeat questions (I've been seeing this too in the data).
    • A number of editors think their mentor is a bot or AI. Makes me think on one hand that having a bot respond to newcomer questions (one idea we have) could exacerbate this but it might also be a reminder to emphasize that they have a human mentor as well who can provide more context/support/etc. It also might be an opportunity to more clearly set expectations for the newcomer.
    • Sometimes newcomers seem to think their mentors are responsible when things don't go well for them. That's hard to do something about but it makes me wonder whether there aren't ways to help mentors better track their mentees so they can step in earlier (if needed) -- e.g., alerts when a mentee is reverted or a form of RecentChanges that is automatically filtered to their mentees. I don't think this latter exists but should be possible to build as rc_actor is a field in RecentChanges so the hard part is deploying a table that has the actor IDs for a mentor's mentees. (EDIT: does exist; see next comment)
    • Because it takes a while for mentors to respond or mentees to return for the answer, pages have often been archived. While DiscussionTools should fix this issue, in reality the "This topic could not be found on this page, but it does exist on the following page:..." message might be missed (or perhaps just confusing for a newcomer?).
    • Different wikis definitely have different systems/norms around mentorship. French Wikipedia for instance doesn't really use help-me templates but does have a Teahouse equivalent (Forum des nouveaux).
    • Mentorship is not recognized within spaces like Admin bids (in the same way that e.g., experience patrolling is). Part cultural but also might be a function of how hard it is to summarize one's impact via mentorship. This is an opportunity for making available more statistics about positive outcomes from mentorship.
    • Some potential issues with answers: many experienced editors use wikitext but newcomers are on VE; rules evolve and so old answers may not always be right; rules evolve and so documentation may be behind; many "rules" aren't written down in a formal way.
    • He thought a bot that can help handle the repetitive questions very quickly would be welcomed by many folks as it would relieve pressure on quick responses and handle the less interesting inquiries.

Sometimes newcomers seem to think their mentors are responsible when things don't go well for them. That's hard to do something about but it makes me wonder whether there aren't ways to help mentors better track their mentees so they can step in earlier (if needed) -- e.g., alerts when a mentee is reverted or a form of RecentChanges that is automatically filtered to their mentees. I don't think this latter exists but should be possible to build as rc_actor is a field in RecentChanges so the hard part is deploying a table that has the actor IDs for a mentor's mentees.

Mentors have a link on their Mentor Dashboard that will navigate them to a filtered view of RecentChanges, limited to their mentees:

Screenshot 2025-11-29 at 6.17.24 AM.png (1×1 px, 228 KB)

Mentors have two different mentee filters available in Recent Changes:

Screenshot 2025-11-29 at 6.17.59 AM.png (316×1 px, 65 KB)

We don't currently send mentors any sort of notifications when their assigned mentees are reverted.

Mentors have a link on their Mentor Dashboard that will navigate them to a filtered view of RecentChanges, limited to their mentees:
Mentors have two different mentee filters available in Recent Changes:

Oh I love this @KStoller-WMF -- thank you for the correction!

A few more details and sources. :)

A number of editors think their mentor is a bot or AI.

The English Wikipedia 2025 Successful Newcomers Survey, which is not yet public, highlighted this, among with other interesting elements.

  • 1% of interviewed newcomers think that their mentor is "definitely a bot"
  • 4% of interviewed newcomers think that their mentor is "probably a bot"
  • 12% of interviewed newcomers don't know if their mentor is a human or a bot

The confidence treshold of the mentor being a human increases with the number of edit made by the newcomer.

it makes me wonder whether there aren't ways to help mentors better track their mentees so they can step in earlier (if needed)

TBH, as a mentor, it is a very time consuming task. Even with close monitoring, it is quite difficult to catch.
Apparently, this happens when the newcomer get a welcome message at their talk page, signed by their mentor, before getting a warning message. Signatures being at the bottom of all message, plus warning messages not looking like a proper conversation, can make the newcomer believing that their mentor left them the message.

talk page comprehension.jpg (511×660 px, 31 KB)

French Wikipedia for instance doesn't really use help-me templates

I'm not sure if many wiki still use them (while they remain available), and how many responses a user would get from using that complicated process (it requires to know that the template exist, how to copy and paste it, and where to do so).

Thanks for this additional data @Trizek-WMF !

I'm not sure if many wiki still use them

Yeah, re: help-me templates, I should be able to pretty easily determine how many are used per year. I've been meaning to add in the earliest date associated with each section for these question banks anyways. I'm realizing that there is a lot of outdated feedback that might not be relevant anymore and it would be useful to be able to either filter based upon date or at least rerank based upon it.

Apparently, this happens when the newcomer get a welcome message at their talk page, signed by their mentor, before getting a warning message. Signatures being at the bottom of all message, plus warning messages not looking like a proper conversation, can make the newcomer believing that their mentor left them the message.

That's a particularly interesting one. Something to remember re: design to at least not exacerbate.

Weekly updates:

  • I was curious about the frequency of certain question types given that qualitatively I've heard that newcomers often have the same questions. As a quick-and-dirty way of investigating common question-and-answer patterns, I applied some basic Kmeans (1000 clusters over 254,137 from the question-banks on English Wikipedia) and started qualitatively going through the largest clusters by looking at ten random examples from each. This helped identify a lot of irrelevant index pages (essentially lists of links to other archives) that I could trim in later variants. Below are the top relevant clusters but you can see a few patterns. Essentially, a lot of non-questions and then a few heavily-recurring questions around references, images, links, infoboxes, user boxes, and handling vandalism. While this was exploratory, I could later reuse this a la SemDeDup to help remove duplicates from the data for identifying instances of core questions for evaluation data.
    • mostly off-topic questions that are redirected to https://en.wikipedia.org/wiki/Wikipedia:Reference_desk (n=722)
    • people saying hi! (n=697)
    • ​​mostly just "what is your question"? type of questions or things that get redirected to Reference Desk etc. (n=677)
    • the people want User Boxes! (n=621)
    • people reporting vandalism (n=603)
    • Editors leaving Wikipedia and asking how to delete accounts :( (n=593)
    • largely empty usages of {{Help me}} template (n=588)
    • largely nonsense questions/statements (n=539)
    • asking for help to resolve reference errors they caused, often surfaced by https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Qwerfjkl_(bot)_17 which includes an easy way to ask for help (n=533)
    • how to add an infobox (but usually they don't know what it's called) (n=521)
    • how do I handle this reverting of mine or other's content? (n=500)
    • challenges with uploading images (n=499)
    • issues with adding (external) links (n=489)
    • A bit more of a hodgepodge but mostly please review, or fix, or help me find an edit I lost (n=475)
    • (and many more remaining for me to analyze)
  • Met with some folks in Product to discuss where this might fit in and there isn't anything in the current fiscal year or next fiscal year planned yet for mentorship. I'll look for less formal ways of testing out new functionality in the meantime so we can continue to learn about this space.

Weekly updates:

  • There's now a Meta page with a basic summary of where things stand: https://meta.wikimedia.org/wiki/Research:Understanding_newcomer_mentorship_on_Wikipedia
  • I've been reading through the reflections captured by @AJayadi-WMF (thank you!) about organizers and their work to support newcomers (meta). Things that stand out to me:
    • I at some level know that a lot of wiki support happens off-wiki but it really has been hammered home for me between the forthcoming results from Successful Newcomers Survey and the mention by essentially every single organizer interviewed of their usage of Telegram, Facebook, Whatsapp, etc. to maintain connections with editors and answer questions.
    • Many mentions of the importance of mentorship for ushering the new editors into the wiki world and answering the many questions. For example:
      • We have a mentor-mentee approach where we pair experienced editors, paired them with new editors. For us, this approach is very effective. When newbies are paired with mentors, they can reach out to them very easily to ask questions.
      • Another highlight is the Wiki Apoia mentorship program, which provides personalized support to new editors. Experienced community members guide participants through practical steps, deepening their understanding of Wikimedia principles, editing techniques, and collaborative dynamics. The program helps editors transition from initial engagement to autonomous and consistent participation in Wikimedia projects.
      • For newcomer editors: mentorship.
    • Many mentions of the importance of helping newcomers feel connected to the broader projects/communities for long-term retention. For example:
      • We see newcomer editors as "social beings", not just as people who edit. As such, we support them to build a sense of belonging...
      • Our main retention strategy is to foster a sense of belonging and shared purpose within the community.
      • ...new attendees regularly come to our events because someone else told them about a group of people in Mexico with whom they can talk and interact to learn how Wikipedia works and how to contribute.
      • From a motivational perspective, we want people, including newcomers, to have sense of belonging and to keep them interested in contributing.
  • This blogpost was shared with me that touches on the usage of AI for answering technical questions at Anthropic by someone who spent a lot of time helping newcomers. Very quick read and fun (though dark)! It touches on the promise: the rate of questions skyrocketed after switching to AI support, presumably because people have lots of questions and seem more likely to ask them when the bar is very low. The concern in the piece relates to their role being automated away but I wonder about the subtext about them as a senior member of their community losing their sense of value and connection with these newcomers. It's possible that they actually feel that they can engage more deeply with the questions that still do come through, but they don't explicitly say anything about that one way or another.
  • Regarding my initial goal of building a dataset of questions: I re-ran the kmeans but at 250 clusters and pulled five random questions from each cluster and began going through and selecting one from each cluster to form a dataset that will hopefully span the sorts of questions being asked on-wiki. I'm also copying the question from the text and indicating what sort of response would be helpful (request clarification, redirect them to a specific page, etc.). I did 45 but I began to feel that it really overrepresented hyper-specific questions from slightly more experienced editors and has a lot of older questions in it too. So I finally extracted the data on the year when each question in my corpus was posted. Ignore slightly odd years -- e.g., mentor questions before mentor module existed. I just used simple logic for assigning dates (grab first date that looks like standard user signature and use that) so presumably a few edge-cases where it's wrong but not enough to change patterns. I likely will repeat but take out anything pre-2024, which will ensure I'm getting more recent questions and give me a much higher proportion of mentorship-module questions (generally true newcomers), which is closer to my goals.
      help-desk  teahouse  mentor  help-me  total-questions
2004         57         0       0        0               57
2005       2806         0       0        0             2806
2006       7286         0       0        3             7289
2007      10885         1       0        3            10889
2008       7217         0       0        4             7221
2009       6873         0       0       18             6891
2010       6338         0       0      504             6842
2011       6506         1       0      965             7472
2012       6433      2063       0      902             9398
2013       5857      3514       0     1036            10407
2014       4716      3870       0     1259             9845
2015       4104      4356       1     1772            10233
2016       3931      3888       0     1903             9722
2017       3678      4713       0     1790            10181
2018       3633      5484       0     1491            10608
2019       3836      6073       1     1366            11276
2020       4642      8391       0     1598            14631
2021       4241      9231     880     1177            15529
2022       3351      6734    2742      748            13575
2023       3788      6274    4849      574            15485
2024       3560      6316    9670      531            20077
2025       2436      4581   14267     1108            22392

# totals
help-desk          106174
teahouse            75490
mentor              32410
help-me             18752
total-questions    232826

This blogpost was shared with me that touches on the usage of AI for answering technical questions at Anthropic by someone who spent a lot of time helping newcomers. Very quick read and fun (though dark)! It touches on the promise: the rate of questions skyrocketed after switching to AI support, presumably because people have lots of questions and seem more likely to ask them when the bar is very low. The concern in the piece relates to their role being automated away but I wonder about the subtext about them as a senior member of their community losing their sense of value and connection with these newcomers. It's possible that they actually feel that they can engage more deeply with the questions that still do come through, but they don't explicitly say anything about that one way or another.

Thank you for sharing this. From this reading I find many values to that AI support that could apply to our context:

  • newcomers have immediate answers to their questions
  • experienced users are getting freer to work on more specific questions
  • like with horses, the role changes from being the workhorse of questions to being a way to be a form of life you can spend more time with and travel with in a more relaxed way.

There is a social aspect here AI can't replace, while providing benefits to both the newcomer and the experienced user.


The numbers you pulled regrading the number of questions asked at different places are really interesting. They confirm the importance of mentorship. I made them a graph for readability:

image.png (599×979 px, 55 KB)

Running these number at another wiki would be interesting to see if the trend is the same. However, none of the wikis with a decent size for comparison (German, French, Italian, Spanish and Portuguese) have all elements needed:

  • German has no help-me template
  • French Wikipedia has everything, but they used Flow for a while, which means that the contents aren't eaisly query-able.
  • Italian has no help-me template or Teahouse-like place
  • Spanish has mentorship, but not for all newcomers
  • Portuguese gathered the help desk and the Tehouse

There is a social aspect here AI can't replace, while providing benefits to both the newcomer and the experienced user.

Yeah, and I think the key is here that we can choose to design these interactions to be more public so the newcomers get the benefit of quick responses but the community still has the ability to curate and build these relationships between newcomers and more experienced editors.

The numbers you pulled regrading the number of questions asked at different places are really interesting. They confirm the importance of mentorship. I made them a graph for readability:

Love this - thank you! With the caveat that 2025 isn't over so we shouldn't read too much into any dips there yet, paints a much clearer picture.

Running these number at another wiki would be interesting to see if the trend is the same.

That's mostly doable though each one will require a bit of tweaking the code -- e.g., I don't think Flow is a major issue but for instance I see that French timezones for discussion comments aren't UTC so I'll have to adjust my date extraction logic for that and I'm sure other little things. If I get a moment, I'll look into running on some other languages. Main thing is just knowing the relevant template names for questions (e.g., Help Me) or subpages where every section is a question (e.g., Teahouse / Help Desk). Growth mentor questions are very easy to count as they're just uniform edit tags and so just require looking at edit metadata and not the actual content of pages.

Progress-wise: I started over with question clusters pulled only from 2024+2025 but haven't gotten very far in the extraction of actual questions. One anecdotal observation is that Mentorship Module questions seem to have a higher rate of generic 'hellos' than questions on Teahouse etc. presumably because of the lower barrier to asking, so the counts for Mentorship Module questions should be taken with the caveat that # of usages does not equal # of questions.

Weekly updates:

  • I've been going through the 2024-2025 question clusters and wow is there a lot of questions about article creation / requests for help in understanding feedback on a draft article. I don't know how prevalent article creation is for newcomers, but it definitely generates a lot of questions!
  • I'll be taking off for the end of the year now but will make an update to the Meta page in the new year and then close this task.
  • There is still much to learn but I feel like I have a much stronger grasp of the nature of the challenges facing newcomers + mentors now. Essentially there are a large number of rather confused newcomers (not surprising as I'm analyzing folks who are explicitly asking for help on-wiki), many who are acting in good-faith even if not always editing within the lines of Wikipedia's policies. There are also a lot of excellent mentors but they are stretched pretty thin. And the back-and-forth required to figure out how to best help a newcomer often seems to take too long and rarely takes advantage of the mentor's expertise.
  • There are open questions about the next steps -- i.e. what is an appropriate intervention to test and how should we go about doing that? What is an appropriate role for AI to play in this space?

Hi @Isaac , I was interested to learn more about your first update from 23 Dec. Did you run any basic analysis on the article creation questions and/or have you filtered and compiled raw data anywhere that would be easy to review? I'm curious what we could learn from this subset of questions that could inform ongoing work @Pginer-WMF is helping lead. Thanks!

Did you run any basic analysis on the article creation questions and/or have you filtered and compiled raw data anywhere that would be easy to review?

@Easikingarmager my initial observation was just based on the number of clusters that seemed to be about article creation. But I just did a quick random sample of 100 questions from 2024/2025 and 35 of them were about some facet of article creation. I'll send you a link to the data separately in case you want to explore more the topics and happy to generate larger samples to comb through. I don't have a great way to automatically identify them so I was just skimming and making a judgment. You could probably come up with a regex that's okay or use a LLM but given the high occurrence rate (~1/3), it's probably just easier to scan samples of all questions and skip over non-article-creation ones if you want to dig into it more.

Some additional context about what sorts of questions are asked related to article creation:

Getting Started	17  # e.g., "how do I write my first article?"
Approval/Review	14  # e.g., "I created a draft. How do I get it reviewed?"
Notability	2   # e.g., "I have X sources. Is that okay?"
Miscellaneous	1   # in this case, it was a user account naming issue preventing them from creating an article
Editing Help	1   # e.g., "How do I fix this reference in my draft?"

Okay, resolving this task (thanks everyone for tagging along with my ruminations and the many questions/inputs)! If there are further questions, please don't hesitate to still add here or reach out directly or put on the meta talk page. The meta page will likely still continue to evolve and future tasks may pick up additional work in this space.

I have prepared an initial dataset (sampling code; dataset), which was the initial goal of this exploration. Though after spending so much time exploring this space and with the data, I'm no longer convinced that this dataset will help with future offline evaluations. It's useful for giving a sense of the range of questions and redundancy, but the responses to these questions are quite variable and rarely reduce down to sharing a specific link -- e.g., here's the relevant policy or relevant help article. I think the range of potentially acceptable answers is such that any proposed interventions would need to be run offline and then manually evaluated for acceptability of responses (as opposed to pre-defining what is acceptable). Some descriptive stats below. For that reason, I did not put energy into scaling up the dataset as I think 50 is sufficient for exploratory work.

Would Search help?# Questions
Yes28
No - requires context / editor input16
No - not a (relevant) question6
Keywords# Questions
22
your first article5
getting started3
conflict of interest; article subject3
edit request2
cross-wiki editing1
edit request; getting started1
wikipedia1
redlink1
interwiki links1
notability people1
conflict of interest; notability people; article subject1
requested articles1
newcomer homepage1
getting started; your first article1
notability people; your first article1
scholarship1
your first article; infobox; references; external links; notability people1
your first article; article subject; conflict of interest1
serial comma1