Task
Build a dataset of realistic inputs and outputs for a system that aims to provide support to editors with questions. The inputs likely are questions that have already been asked on wikis -- e.g., extracted from WP:Teahouse -- but the outputs might take a few reasonable forms:
- Link to the relevant content in a policy/help page. This could have several different levels at different granularities where feasible such as relevant sentence, paragraph, section, page, or even namespace. The relevant namespace might seem trivial/non-useful but one design for an agentic system might involve recommending where to search in the first place.
- Link to a similar question that has been asked. To make the task of discovering this not trivial, either the question itself would need to be masked from the output data or natural examples of this would need to be found on-wiki or some sort of transformation of the questions would be required to fuzz them.
- Text of the actual answer provided to the question.
Considerations:
- Size: while larger datasets are better as they'll provide more detail into the performance of a given approach and potential errors, quality is likely the most important thing given that it's not likely that a system would be fine-tuned (but would instead use existing pre-trained language models for generating embeddings or LLMs for selecting answers). So realistically, a few hundred high-quality and diverse examples is much more valuable than 1,000 or more of mixed or unknown quality.
- Quality: not all answers provided will necessarily be correct. This may require manual evaluation but filtering on answering-editor expertise or other parameters might help with reducing down the scope that needs to be evaluated.
- Diversity: ideally the questions will cover a wide range of potential topics. This diversity could be measured through text similarity metrics, diversity of where the question was asked or features of the editor who asked it, diversity in the namespaces/pages referenced in answers, or potentially even devising a taxonomy of potential topic areas and annotating questions with which they fall in.
When this dataset is compiled, a first task to determine its utility would be to evaluate the current Search API on the dataset and measure its effectiveness at different levels of precision -- i.e. is the correct page returned in the top-k results.
Motivation
As an editor, it can be difficult to get prompt guidance on a particular issue. One would hope that it would either be easy to discover the relevant documentation to the question or get guidance from a fellow editor, but both of these have challenges:
- Mentorship is difficult to scale to the needs of editors:
- There are many programs/space for this mentorship -- e.g., Newcomer Homepage, The Teahouse, Noticeboards, Village pumps, Talk pages, WikiProjects as well as plenty of off-wiki spaces -- but they might not be discoverable by those editors who most would benefit from the help.
- The async/distributed nature of the wikis means that it can take a while before a question is answered.
- Answering these questions can also burnout those editors willing to spend their time providing this support when those questions are often repetitive or require saying "no" because the support that is requested violates Wikipedia policies.
- Existing help/policy documentation is very difficult to index/discover via traditional keyword-based search:
- The documentation and related questions that other editors have asked that might be relevant are spread across numerous namespaces and may even be found on other wikis.
- A lot of Wiki documentation uses highly specific terminology that is not easily discoverable via Search unless you know the name of the policy or the piece of wikitext syntax or name of the extension etc.
- Many of the help/policy pages are quite long and actually combine many related pieces of guidance together. This makes it a needle-in-the-haystack challenge for Search to find the page that has the one snippet of content that's relevant.
- It may be difficult for an editor to convert their actual question into effective keywords to search.
While asking questions and receiving mentorship is not just about receiving the "right" answer but also a valuable learning/social process, there are likely many frictions in this process that are not helpful and can frustrate editors and reduce the capacity for beneficial mentorship. This is a good space for improved tooling ranging from more effective search approaches to potentially even AI-generated answers to questions. Before these can be explored, however, there are many basic questions about how to evaluate any potential solutions in this space to determine their effectiveness.
This further aligns with the AI Strategy (Engage new generations of editors with guided mentorship) and builds on creating more time for human judgment because the primary approach is to leverage that AI excels at handling tasks such as information retrieval.
Resources etc.
- SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions is a relevant past example of a more narrow Q&A challenge that might provide some inspiration for appropriate methods for this broader task.
- WP:Teahouse is probably the best-studied mentorship space. Specifically, newcomers can get assistance via the {{Help me}} template on their talk page (with the {{Help me-helped}} template indicating an answer) since the teahouse is often semi-protected such that unregistered users or accounts that are not confirmed/autoconfirmed cannot post their questions there directly. But newcomers can also ask questions to mentors via the Newcomer Homepage (details) and these follow a common pattern (Question from... as can be seen on this user page or this one) and are tagged with the Mentorship module question so easy to gather.
- Reference desk: https://en.wikipedia.org/wiki/Wikipedia:Reference_desk
- This task is about a dataset for evaluating search for mentorship-type questions. There are also more basic questions that would be beneficial to answer via qualitative methods -- e.g., what do editors need from mentorship? what questions will they ask publicly vs. privately? are AI-generated answers appropriate or should systems stop at providing improved access to relevant documentation? If AI-generated answers are appropriate, there are also interesting design questions about what oversight should be provided -- e.g., how to ensure transparency and some level accountability/curation of the resulting answers. SpinachBot (AI bot for answering SPARQL-related questions) may be one approach but presumably there are other designs and aspects to consider in balancing speed and usability for newcomers with curation, transparency, and the ability to correct answers by more experienced editors.



