User Details
- User Since
- Sep 9 2019, 9:50 AM (327 w, 15 h)
- Availability
- Available
- IRC Nick
- mgerlach
- LDAP User
- MGerlach
- MediaWiki User
- MGerlach (WMF) [ Global Accounts ]
Fri, Nov 28
weekly update:
- Collect a set of representative queries in WP search:
- Conducted privacy check-in about publishing set of queries. As a one-off dataset for English Wikipedia this was approved.
- We will implement an additional filter for the frequency of queries such that analysis is considered high-level (>=25 users)
- Collecting candidate search results:
- Decided and implemented scheme for selecting top-5 paragraphs as candidate search results
- Using annotation tool:
- Requested a privacy survey statement for conducting the data annotation via prolific
- We set up a test-study with synthetic data in the prolific AI task builder to finalize UI of the annotation
weekly update
- starting data collection of revisions where maintenance templates are added or removed
weekly update
- incorporated feedback from Debra, Mike, and Yu-Ming
- finalized new revised version available in this doc (internal)
Fri, Nov 21
weekly update:
- We are continuing the make progress on setting up the full pipeline for the dataset generation.
- Collect a set of representative queries in WP search:
- This is completed from a technical side. We have a pipeline to extract a set of representative queries
- We are waiting for the feedback from the privacy consultation about if and how we can store and publish the selected queries for annotation
- Collecting candidate search results:
- We are testing different options to select the most relevant paragraphs from a set of search results obtained from, e.g., Wikipedia search, to present as candidate search results to be annotated. This is important to avoid selection bias by missing potential relevant paragraphs as they will be implicitly marked as irrelevant since they will not be available for annotation.
- Using annotation tool:
- We are testing the study setup in prolific by using mock-up data (not from the actual query).
- In order to conduct the actual study I am requesting a survey privacy statement. Once I have the details figured out (e.g. retention time and publication) I will submit the request, probably early next week.
- I confirmed that we have available budget in the team to run the study on prolific. I am figuring out the details about the process of how to request/spend the budget correctly.
@BTullis Thank you.
weekly update:
- collaborators can now access stat-machines
- only blocker is kerberos access in order to use hive tables in spark T410389: Request kerberos identity for AnkitaM
- next step is to start collecting the dataset of templates being added/removed
weekly update:
- revising the draft based on feedback I received. I think that I will have a revised version ready by the end of next week.
Tue, Nov 18
@Volans looks like everything is working as expected. Thank you.
Nov 14 2025
@Dzahn: We already signed a MOU/NDA for the formal collaboration with the Research Team. (so its not staff/contractor)
@KFrancis: could you confirm?
Nov 13 2025
weekly update:
- Collect a set of representative queries in WP search:
- Added filter for navigational queries when there is an exact match of the query with an existing page title
weekly update:
- finished a full first draft. available in this doc (internal only)
- currently shared with research-folks active in reader space for feedback and improvement
weekly update:
- officially set up formal collaboration (see announcement on wiki-research-l)
- now working on onboarding to stat-machines (see subtasks)
Nov 12 2025
Hi. we have a new formal collaborator with the Research Team: @AnkitaM. They need access to the stat machines for a new research project.
Let me know if you require more information -- Thank you.
Nov 7 2025
weekly update:
- Collect a set of representative queries in WP search:
- We built a pipeline to collect and filter queries to full text search in English Wikipedia. https://docs.google.com/document/d/1NtBBGZCF18rS8VKT3PjGtF7OqRoCvwlsH3pszPhkRIQ/edit?tab=t.0#heading=h.m0kvhd7lv51h
- We sampled queries of different lengths in buckets of 2-3 tokens (short), 5-7 tokens (medium), and long (8 or more tokens) to capture lexical as well as natural language queries.
- From this, we extracted a small sample of 100 queries for manual review to filter out queries that contain PII or are non-sensical.
- code: https://gitlab.wikimedia.org/repos/research/search-evaluation-benchmark/-/tree/main/notebooks?ref_type=heads
weekly update:
- fully re-organized the outline: 3 major themes (participation, interaction with content, data&methodology). Each theme contains detailed description of what we learned so far and proposes 3 sub-themes.
- finished first full write-up. need one more iteration to polish before I will share with others for feedback.
weekly update:
- reached out to Legal for MOU/NDA
- started technical onboarding (e.g. creating accounts in phabricator, wikitech etc)
Oct 31 2025
weekly update:
- We started parsing through the search-logs of full text searches in webrequest logs based on the notebook to calculate the fraction of natural language queries https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/blob/ed747398e8798ec201f794a08d8af4459f9db372/query-analysis/T404822-natural_language_query_estimate.ipynb
- We are now defining criteria for how to filter and sample queries. For example, removing queries containing PII content or identifying well-formed queries.
- Looking into existing tools used for labeling data and if/how this can be adapted for our purposes.
weekly update:
- managed to pick this up again this week and made some minor progress in writing up the existing bullet points
weekly update:
- created project page on meta: https://meta.wikimedia.org/wiki/Research:Understanding_the_use_of_maintenance_templates
- next step: set up MOU/NDA
Oct 28 2025
Oct 24 2025
weekly update:
- We identify 3 main dimensions for categorizing different types of queries based on existing literature that we think are relevant for search in Wikipedia (details in this doc)
- query intent: following the traditional web query taxonomy, we focus mostly on informational queries (e.g. navigational queries are well served by autocomplete search and are not considered as part of this work). The main distinction of informational queries is whether they are closed or open-ended.
- query form: this is the distinction between, e.g., (short) lexical queries and (longer) natural language queries.
- query result: a common distinction is the expected result, e.g. a description or an entity or a numeric.
- Understanding the different types of queries is important to i) make sure that the benchmark dataset captures a representative sample of queries, and ii) helps to improve different search models by identifying for which types of queries they perform well or poorly.
- We started work to collect a set of queries for the benchmark dataset. We are considering different potential sources:
- Wikipedia search logs (full text search)
- Queries from surveys or user studies, e.g., the queries mentioned/observed in readers foundational research: (see Appendix of the prototype evaluation for semantic search)
- Search’ “golden set” of queries with human-graded results from the Discernatron project
- Public datasets: MS Marco (Bing queries) or Natural Questions (Google queries, though this has very restrictive filters for longer natural language questions)
- We scoped the granularity of annotation of search results. We aim to annotate queries with relevant passages (paragraph-level) of Wikipedia articles. This is motivated by findings in search stating that "retrieving a passage or a shorter piece of text is sufficient to properly answer almost all questions.” Source: An Intent Taxonomy for Questions Asked in Web Search (pdf) In addition, this level of granularity will allow us to quantitatively evaluate performance of different models for semantic search.
weekly update:
- scoped the project with collaborators. they will start drafting a meta-page.
weekly update:
- no update as I didnt manage to dedicate time this week to this project.
Oct 23 2025
Oct 17 2025
weekly update:
- no update this week
- will have coordination meeting with collaborators next week
weekly update:
- Onboarded @Trokhymovych to the project
- Scoped out first subtask to identify relevant query types (e.g. keyword queries vs natural language questions) T407603
- Coordinating how to capture this work as a separate hypothesis in WE3.1
weekly update:
- no update as I didnt manage to dedicate time this week to this project.
weekly update:
- shared draft more widely and incorporated feedback
- closing task as work is completed
Oct 10 2025
weekly update:
- Incorporated feedback from Search Team and Design Research
- Summarized main findings and formulated a set of recommendations
- Finalized full first draft (internal doc)
- Next step: share more widely
weekly update:
- continued writing and updated some of the content to incorporate learnings from showcase presentation
- however, I didnt get very far as I was asked mid-week to dedicate capacity to another urgent, short-term request.
weekly update:
- shared resources with collaborators
- discussing first steps
weekly update:
- no major updates this week
- trying to scope the task
- coordinating potential external support (contractor)
Oct 7 2025
I did an ad-hoc analysis of counting the number of referers from chatgpt some time ago (slack-thread). We saw that traffic from chatgpt showed up in (at least) two different ways:
- F.col("referer")=="https://chatgpt.com/" or
- F.col("uri_query").contains("utm_source=chatgpt.com")
Oct 6 2025
Oct 3 2025
weekly update:
- closed subtask on estimating the fraction of natural language queries on WP search
- summarized insights about use of external search to reach/navigate Wikipedia
- with this, I have compiled a rough full first draft of the review
- currently asking for feedback and incorporating changes from Design Research and Search Team as well as polishing the text
- Next: writing high-level summary with specific recommdantions
weekly update:
- no updates, mostly worked on OKR-work for WE3.1.7 (T404848)
Oct 2 2025
Closing the epic as the research is completed and currently no planned work. If we pick up future work on this, we can re-open the epic.
Closing the epic as the research on this project is completed. If we pick up future work on this, we can re-open the epic.
Closing the epic as the research project is completed. If we pick up future work on this, we can re-open the epic.
weekly update:
- I put together a notebook to collect relevant cleanup-templates across Wikipedia (see data and code)
- this starts from cleanup-templates in English Wikipedia: Wikipedia:Template_index/Cleanup and the templates contained in Category:Cleanup_templates. This yields ~500 different templates
- we then get the corresponding templates in other Wikipedia language versions using the Langlinks-API. This yields ~8K templates across all Wikipedias.
- we also add Wikidata qids (to match templates across languages) and all redirect titles (in order to extract usage of aliases in wikitext).
Sep 30 2025
Sep 26 2025
weekly update:
- Put together high-level statistics of use of search on Wikipedia
- Summarized known pain points of WP's search and identified themes: preference of external search out of habit (e.g. for navigating between articles), lack of understanding of how it works (e.g. lack of match in autocomplete is interpreted as absence of coverage), UI limitations in arriving to/using fulltext search, community wishlists (template discovery, common queries by newcomers, discussion thread), low recall for long queries (not necessarily natural language queries), difficulties of media search on commons, unmet expectations of readers to find information using natural language queries or within sections.
- First estimate for fraction of natural language queries in fulltext search on Wikipedia (4-7%) T404822
weekly update:
- due to showcase presentation not a lot of progress (as part of that moved due date to October)
- the showcase presentation was a good opportunity to get some feedback about ideas on future research areas in this space. Most notably, we identified 5 areas: readership progression (e.g. reader to editor conversion), improving discoverability (e.g. search), identification of bot traffic, Wikipedia's role in the rapidly changing online ecosytem( e.g. impact of LLMs/chatbots on Wikipedia), identifying drivers of change in readership (e.g. causes of knowledge gaps or effectiveness of potential interventions). note: this is not exhaustive and also not finalized.
Sep 25 2025
@EBernhardson Thanks for putting together the notebook. Looks really good, I appreciate the level of detail with respect to manual verification and having confidence intervals.
- from what I understand, you operationalize natural language queries as all queries which contain one of the words who|what|where|when|why|how (and later do some additional manual filtering). Could you confirm? I think that approach makes sense and is sufficient to get a rough idea of the order of mangitude.
- Do you think it would be (easily) feasible to compare the average number of words in lexical vs natural language queries? I think this could be relevant in the context of the planned hypothesis of search around relaxing matching all keywords?
- I think that the current code is not filtering bot/automated traffic of the webrequest data (agent_type=="user"). Do you think there are many of those requests for search such that the results could significantly change? Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).
Sep 23 2025
Sep 19 2025
weekly update:
- started scoping the work for this hypothesis
- Collected relevant resources/literature for the review of on- and off-wiki search
- Started analysis of search queries to estimate fraction of natural language queries T404822
- defining a simple-to-implement heuristic for what a natural language query is. one crucial criterion is to check whether query contains any question words via the following regex: \b(who|what|where|when|why|how)\b
- Identifying the best datas-source to get all full text queries (e.g. using webrequest-table instead of discovery.query_clicks_hourly to also get queries from mobile web)
weekly update:
- no update this week because I spent most of my available time this week on the presentation for the research showcase next week
Sep 17 2025
Thanks @NBaca-WMF for the clarification.
I followed the process you described reaching out to techsupport.
Since that process is outside of phabricator, I am closing this task as declined.
The NaturalQuestions dataset (natural questions from Google search queries annotated with relevant Wikipedia article sections) uses a heuristic to identify natural language queries (described in their paper in Sec. 3.1) which we might serve as a good starting point for us to adapt. Copying here for reference:
- query was issued by multiple users
- query contains 8 words or more
- query matches one of the following conditions
- start with ‘‘who’’, ‘‘when’’, or ‘‘where’’ directly followed by: a) a finite form of ‘‘do’’ or a modal verb; or b) a finite form of ‘‘be’’ or ‘‘have’’ with a verb in some later position;
- start with ‘‘who’’ directly followed by a verb that is not a finite form of ‘‘be’’;
- contain multiple entities as well as an adjective, adverb, verb, or determiner;
- contain a categorical noun phrase immediately preceded by a preposition or relative clause;
- end with a categorical noun phrase, and do not contain a preposition or relative clause.
- query yields a Wikipedia page in the top 5 search results
Sep 5 2025
weekly update:
- wrote a first rough (and partially incomplete) draft
- shared with @YLiou_WMF and @MRaishWMF for early feedback
- no update next week as I am OoO after that will continue to complete the first rough draft
Sep 4 2025
Sep 2 2025
@SCherukuwada: pinging you since you handled a similar request in the recent past T396188. Could you help me with getting access to the data or point me to someone else who I could reach out to? Thanks
Aug 28 2025
weekly update:
- re-organized the skeleton with outline of the doc
- Pulled together main talking points for each section
- Next steps: write up first bad version of the doc and share with @YLiou_WMF and @MRaishWMF for feedback
@jcrespo: pinging you since you are listed as member of the search-console-access-request project. Is there any additional information I should provide? Or do you know anyone else I could reach out to about this request?
Thank you for your help.
Aug 22 2025
weekly update:
- not a lot of upate as I am gathering feedback
- planning on working on a full iteration next week
weekly update:
- updated analysis to remove disambiguation/list pages; these articles are not relevant for simplification since they just contain lists of links to other articles. There are many of these (e.g. 300K disambiguation pages in enwiki alone) and generally tend to have high FKGL scores skewing the overall stats.
- generated list of 1000 example articles for each of the three approaches (spreadsheet)
- shared results with @ovasileva: the list of articles are a starting point for potential discussions with communities about the problem of difficult-to-read articles
Aug 15 2025
weekly update:
- wrote up findings with 3 different options for prioritization on meta: https://meta.wikimedia.org/wiki/Research:Develop_a_model_for_text_simplification_to_improve_readability_of_Wikipedia_articles/Prioritization_for_simplification
- next step: share with interested folks and look into potential follow-up questions
weekly update:
- gave presentation of early ideas in Applied Research meeting
- gathering feedback from individual folks
- next step: refine and iterate into a draft
Aug 8 2025
weekly updates:
- continued some discussions with researchers.
- drafted a first rough outline of the research direction. Specifically, I synthesized the wide range of potential research questions into 5 main themes to provide a framework about what is important and why.
- put together a presentation for next week's applied research team meeting to gather feedback.
Weekly update:
- Performed detailed analysis of 3rd option using Maintenance templates
- Identified all articles in English Wikipedia using the template {{Confusing}} or {{Technical}} (or any if its redirects) in the lead section of the article (some articles contain these templates only in specific sections; those were discarded in this case).
- this yields 3708 articles
- the average readability (FKGL) of those papers is 14.0. This is substantially higher than the average readability of all articles (11.7)
- Thus, the two selected maintenance templates seem promising options to identify articles that could benefit from simplification.
Aug 1 2025
Weekly updates:
- ongoing discussions with members of the team, researchers, and folks from other teams (e.g. Product, most notably the Reader-related teams).
- trying to identify major themes in ongoing efforts as well as open questions. For example, a recurring open question was about better understanding if and how readers progress (i.e. "reader funnel")
- Next step: synthesize themes and present initial ideas in Applied Science meeting on August 11
Weekly update:
- Identified 3 potential approaches for prioritization. I started to explore those with articles in enwiki.
