Page MenuHomePhabricator

MGerlach (Martin Gerlach)
Senior Research Scientist

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Sep 9 2019, 9:50 AM (327 w, 15 h)
Availability
Available
IRC Nick
mgerlach
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 28

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • Collect a set of representative queries in WP search:
    • Conducted privacy check-in about publishing set of queries. As a one-off dataset for English Wikipedia this was approved.
    • We will implement an additional filter for the frequency of queries such that analysis is considered high-level (>=25 users)
  • Collecting candidate search results:
    • Decided and implemented scheme for selecting top-5 paragraphs as candidate search results
  • Using annotation tool:
    • Requested a privacy survey statement for conducting the data annotation via prolific
    • We set up a test-study with synthetic data in the prolific AI task builder to finalize UI of the annotation
Fri, Nov 28, 4:58 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update

  • starting data collection of revisions where maintenance templates are added or removed
Fri, Nov 28, 4:39 PM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update

  • incorporated feedback from Debra, Mike, and Yu-Ming
  • finalized new revised version available in this doc (internal)
Fri, Nov 28, 12:21 PM · Research (FY2025-26-Research-October-December)

Fri, Nov 21

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • We are continuing the make progress on setting up the full pipeline for the dataset generation.
  • Collect a set of representative queries in WP search:
    • This is completed from a technical side. We have a pipeline to extract a set of representative queries
    • We are waiting for the feedback from the privacy consultation about if and how we can store and publish the selected queries for annotation
  • Collecting candidate search results:
    • We are testing different options to select the most relevant paragraphs from a set of search results obtained from, e.g., Wikipedia search, to present as candidate search results to be annotated. This is important to avoid selection bias by missing potential relevant paragraphs as they will be implicitly marked as irrelevant since they will not be available for annotation.
  • Using annotation tool:
    • We are testing the study setup in prolific by using mock-up data (not from the actual query).
    • In order to conduct the actual study I am requesting a survey privacy statement. Once I have the details figured out (e.g. retention time and publication) I will submit the request, probably early next week.
    • I confirmed that we have available budget in the team to run the study on prolific. I am figuring out the details about the process of how to request/spend the budget correctly.
Fri, Nov 21, 6:33 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach closed T410389: Request kerberos identity for AnkitaM, a subtask of T406203: Start formal collaboration on understanding the use of maintenance templates, as Resolved.
Fri, Nov 21, 4:08 PM · Research (FY2025-26-Research-October-December)
MGerlach closed T410389: Request kerberos identity for AnkitaM as Resolved.

@BTullis Thank you.

Fri, Nov 21, 4:08 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • collaborators can now access stat-machines
  • only blocker is kerberos access in order to use hive tables in spark T410389: Request kerberos identity for AnkitaM
  • next step is to start collecting the dataset of templates being added/removed
Fri, Nov 21, 1:55 PM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406203: Start formal collaboration on understanding the use of maintenance templates.
Fri, Nov 21, 1:53 PM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • revising the draft based on feedback I received. I think that I will have a revised version ready by the end of next week.
Fri, Nov 21, 1:53 PM · Research (FY2025-26-Research-October-December)

Tue, Nov 18

MGerlach closed T409893: Requesting access to analytics-privatedata-users for AnkitaM, a subtask of T406203: Start formal collaboration on understanding the use of maintenance templates, as Resolved.
Tue, Nov 18, 7:13 PM · Research (FY2025-26-Research-October-December)
MGerlach closed T409893: Requesting access to analytics-privatedata-users for AnkitaM as Resolved.

@Volans looks like everything is working as expected. Thank you.

Tue, Nov 18, 7:13 PM · Data-Engineering, SRE, SRE-Access-Requests
MGerlach added a comment to T409894: Grant Access to ldap/nda for AnkitaM.

Thank you @Dzahn and @Volans

Tue, Nov 18, 5:09 PM · SRE, LDAP-Access-Requests
MGerlach added a subtask for T406203: Start formal collaboration on understanding the use of maintenance templates: T410389: Request kerberos identity for AnkitaM.
Tue, Nov 18, 11:48 AM · Research (FY2025-26-Research-October-December)
MGerlach added a parent task for T410389: Request kerberos identity for AnkitaM: T406203: Start formal collaboration on understanding the use of maintenance templates.
Tue, Nov 18, 11:48 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering
MGerlach created T410389: Request kerberos identity for AnkitaM.
Tue, Nov 18, 11:37 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering

Nov 14 2025

MGerlach added a comment to T409894: Grant Access to ldap/nda for AnkitaM.

@Dzahn: We already signed a MOU/NDA for the formal collaboration with the Research Team. (so its not staff/contractor)
@KFrancis: could you confirm?

Nov 14 2025, 4:29 PM · SRE, LDAP-Access-Requests

Nov 13 2025

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • Collect a set of representative queries in WP search:
    • Added filter for navigational queries when there is an exact match of the query with an existing page title
Nov 13 2025, 4:18 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach assigned T409559: Collect candidate search results for set of sample queries to Trokhymovych.
Nov 13 2025, 1:31 PM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T409561: Annotate relevance of search results for sample queries.
Nov 13 2025, 1:09 PM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • finished a full first draft. available in this doc (internal only)
  • currently shared with research-folks active in reader space for feedback and improvement
Nov 13 2025, 11:07 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

Nov 13 2025, 11:04 AM · Research (FY2025-26-Research-October-December)
MGerlach added a parent task for T409893: Requesting access to analytics-privatedata-users for AnkitaM: T406203: Start formal collaboration on understanding the use of maintenance templates.
Nov 13 2025, 11:03 AM · Data-Engineering, SRE, SRE-Access-Requests
MGerlach added a parent task for T409894: Grant Access to ldap/nda for AnkitaM: T406203: Start formal collaboration on understanding the use of maintenance templates.
Nov 13 2025, 11:03 AM · SRE, LDAP-Access-Requests
MGerlach added subtasks for T406203: Start formal collaboration on understanding the use of maintenance templates: T409894: Grant Access to ldap/nda for AnkitaM, T409893: Requesting access to analytics-privatedata-users for AnkitaM.
Nov 13 2025, 11:03 AM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406203: Start formal collaboration on understanding the use of maintenance templates.
Nov 13 2025, 11:00 AM · Research (FY2025-26-Research-October-December)

Nov 12 2025

MGerlach updated subscribers of T409898: Set up OpenSearch instance supporting vector search.
Nov 12 2025, 10:58 AM · Essential-Work, Discovery-Search, Research, Data-Platform-SRE (2025.11.07 - 2025.11.28)
MGerlach created T409894: Grant Access to ldap/nda for AnkitaM.
Nov 12 2025, 8:56 AM · SRE, LDAP-Access-Requests
MGerlach updated subscribers of T409893: Requesting access to analytics-privatedata-users for AnkitaM.

Hi. we have a new formal collaborator with the Research Team: @AnkitaM. They need access to the stat machines for a new research project.
Let me know if you require more information -- Thank you.

Nov 12 2025, 8:53 AM · Data-Engineering, SRE, SRE-Access-Requests
MGerlach created T409893: Requesting access to analytics-privatedata-users for AnkitaM.
Nov 12 2025, 8:52 AM · Data-Engineering, SRE, SRE-Access-Requests

Nov 7 2025

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

Nov 7 2025, 3:26 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406207: Create a dataset for evaluation of search on Wikipedia.
Nov 7 2025, 3:23 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach created T409561: Annotate relevance of search results for sample queries.
Nov 7 2025, 3:22 PM · Research (FY2025-26-Research-October-December)
MGerlach created T409559: Collect candidate search results for set of sample queries.
Nov 7 2025, 3:17 PM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406207: Create a dataset for evaluation of search on Wikipedia.
Nov 7 2025, 3:12 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • fully re-organized the outline: 3 major themes (participation, interaction with content, data&methodology). Each theme contains detailed description of what we learned so far and proposes 3 sub-themes.
  • finished first full write-up. need one more iteration to polish before I will share with others for feedback.
Nov 7 2025, 2:45 PM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • reached out to Legal for MOU/NDA
  • started technical onboarding (e.g. creating accounts in phabricator, wikitech etc)
Nov 7 2025, 2:42 PM · Research (FY2025-26-Research-October-December)

Oct 31 2025

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

Oct 31 2025, 8:18 AM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • managed to pick this up again this week and made some minor progress in writing up the existing bullet points
Oct 31 2025, 8:10 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

Oct 31 2025, 8:09 AM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406203: Start formal collaboration on understanding the use of maintenance templates.
Oct 31 2025, 8:09 AM · Research (FY2025-26-Research-October-December)

Oct 28 2025

MGerlach moved T408523: [EPIC] Understanding the use of maintenance template from Backlog to Epics on the Research board.
Oct 28 2025, 9:04 AM · Epic, Research
MGerlach added a parent task for T406203: Start formal collaboration on understanding the use of maintenance templates: T408523: [EPIC] Understanding the use of maintenance template.
Oct 28 2025, 9:01 AM · Research (FY2025-26-Research-October-December)
MGerlach added a subtask for T408523: [EPIC] Understanding the use of maintenance template: T406203: Start formal collaboration on understanding the use of maintenance templates.
Oct 28 2025, 9:01 AM · Epic, Research
MGerlach created T408523: [EPIC] Understanding the use of maintenance template.
Oct 28 2025, 9:00 AM · Epic, Research

Oct 24 2025

MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • We identify 3 main dimensions for categorizing different types of queries based on existing literature that we think are relevant for search in Wikipedia (details in this doc)
    • query intent: following the traditional web query taxonomy, we focus mostly on informational queries (e.g. navigational queries are well served by autocomplete search and are not considered as part of this work). The main distinction of informational queries is whether they are closed or open-ended.
    • query form: this is the distinction between, e.g., (short) lexical queries and (longer) natural language queries.
    • query result: a common distinction is the expected result, e.g. a description or an entity or a numeric.
  • Understanding the different types of queries is important to i) make sure that the benchmark dataset captures a representative sample of queries, and ii) helps to improve different search models by identifying for which types of queries they perform well or poorly.
  • We started work to collect a set of queries for the benchmark dataset. We are considering different potential sources:
  • We scoped the granularity of annotation of search results. We aim to annotate queries with relevant passages (paragraph-level) of Wikipedia articles. This is motivated by findings in search stating that "retrieving a passage or a shorter piece of text is sufficient to properly answer almost all questions.” Source: An Intent Taxonomy for Questions Asked in Web Search (pdf) In addition, this level of granularity will allow us to quantitatively evaluate performance of different models for semantic search.
Oct 24 2025, 2:05 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • scoped the project with collaborators. they will start drafting a meta-page.
Oct 24 2025, 9:50 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • no update as I didnt manage to dedicate time this week to this project.
Oct 24 2025, 9:49 AM · Research (FY2025-26-Research-October-December)

Oct 23 2025

MGerlach added a comment to T407603: Identify a set of relevant query types.

It's worth keeping in mind that this is a two-stage search system. Navigational queries which dominate normal search pipelines are not seen at nearly the same rate in the on-wiki fulltext search because the first stage of search, the autocomplete, sends users directly to the page and typically satisfies the navigational needs.

Oct 23 2025, 2:46 PM · Research (FY2025-26-Research-October-December)
MGerlach updated the task description for T406207: Create a dataset for evaluation of search on Wikipedia.
Oct 23 2025, 2:30 PM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach created T408121: Collect a set of representative queries for the benchmark dataset.
Oct 23 2025, 1:53 PM · Research (FY2025-26-Research-October-December)

Oct 17 2025

MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • no update this week
  • will have coordination meeting with collaborators next week
Oct 17 2025, 9:46 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • Onboarded @Trokhymovych to the project
  • Scoped out first subtask to identify relevant query types (e.g. keyword queries vs natural language questions) T407603
  • Coordinating how to capture this work as a separate hypothesis in WE3.1
Oct 17 2025, 9:35 AM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • no update as I didnt manage to dedicate time this week to this project.
Oct 17 2025, 9:21 AM · Research (FY2025-26-Research-October-December)
MGerlach closed T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki as Resolved.

weekly update:

  • shared draft more widely and incorporated feedback
  • closing task as work is completed
Oct 17 2025, 9:14 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach updated the task description for T406207: Create a dataset for evaluation of search on Wikipedia.
Oct 17 2025, 8:09 AM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach created T407603: Identify a set of relevant query types.
Oct 17 2025, 8:05 AM · Research (FY2025-26-Research-October-December)

Oct 10 2025

MGerlach added a comment to T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.

weekly update:

  • Incorporated feedback from Search Team and Design Research
  • Summarized main findings and formulated a set of recommendations
  • Finalized full first draft (internal doc)
  • Next step: share more widely
Oct 10 2025, 1:57 PM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach updated the task description for T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.
Oct 10 2025, 1:57 PM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • continued writing and updated some of the content to incorporate learnings from showcase presentation
  • however, I didnt get very far as I was asked mid-week to dedicate capacity to another urgent, short-term request.
Oct 10 2025, 11:44 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • shared resources with collaborators
  • discussing first steps
Oct 10 2025, 11:42 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T406207: Create a dataset for evaluation of search on Wikipedia.

weekly update:

  • no major updates this week
  • trying to scope the task
  • coordinating potential external support (contractor)
Oct 10 2025, 11:41 AM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)

Oct 7 2025

MGerlach added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

I did an ad-hoc analysis of counting the number of referers from chatgpt some time ago (slack-thread). We saw that traffic from chatgpt showed up in (at least) two different ways:

  • F.col("referer")=="https://chatgpt.com/" or
  • F.col("uri_query").contains("utm_source=chatgpt.com")
Oct 7 2025, 7:53 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform

Oct 6 2025

MGerlach moved T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki from FY2025-26-Research-July-September to FY2025-26-Research-October-December on the Research board.
Oct 6 2025, 3:06 PM · Research (FY2025-26-Research-October-December), Semantic Search

Oct 3 2025

MGerlach added a comment to T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.

weekly update:

  • closed subtask on estimating the fraction of natural language queries on WP search
  • summarized insights about use of external search to reach/navigate Wikipedia
  • with this, I have compiled a rough full first draft of the review
  • currently asking for feedback and incorporating changes from Design Research and Search Team as well as polishing the text
  • Next: writing high-level summary with specific recommdantions
Oct 3 2025, 8:58 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach updated the task description for T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.
Oct 3 2025, 8:54 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • no updates, mostly worked on OKR-work for WE3.1.7 (T404848)
Oct 3 2025, 8:54 AM · Research (FY2025-26-Research-October-December)

Oct 2 2025

MGerlach created T406207: Create a dataset for evaluation of search on Wikipedia.
Oct 2 2025, 9:32 AM · Discovery-Search (2025.10.20 - 2025.12.31), Research (FY2025-26-Research-October-December)
MGerlach closed T342614: Models for text summarization using LLMs as Resolved.

Closing the epic as the research is completed and currently no planned work. If we pick up future work on this, we can re-open the epic.

Oct 2 2025, 9:14 AM · Epic, address-knowledge-gaps, Research
MGerlach closed T342526: Improving multilingual support for link recommendation model for add-a-link task as Resolved.

Closing the epic as the research on this project is completed. If we pick up future work on this, we can re-open the epic.

Oct 2 2025, 9:10 AM · address-knowledge-gaps, Epic, Research
MGerlach closed T293036: [EPIC] Understanding curious and critical readers as Resolved.

Closing the epic as the research project is completed. If we pick up future work on this, we can re-open the epic.

Oct 2 2025, 9:08 AM · Research, Epic
MGerlach added a comment to T406203: Start formal collaboration on understanding the use of maintenance templates.

weekly update:

  • I put together a notebook to collect relevant cleanup-templates across Wikipedia (see data and code)
    • this starts from cleanup-templates in English Wikipedia: Wikipedia:Template_index/Cleanup and the templates contained in Category:Cleanup_templates. This yields ~500 different templates
    • we then get the corresponding templates in other Wikipedia language versions using the Langlinks-API. This yields ~8K templates across all Wikipedias.
    • we also add Wikidata qids (to match templates across languages) and all redirect titles (in order to extract usage of aliases in wikitext).
Oct 2 2025, 8:55 AM · Research (FY2025-26-Research-October-December)
MGerlach created T406203: Start formal collaboration on understanding the use of maintenance templates.
Oct 2 2025, 8:49 AM · Research (FY2025-26-Research-October-December)

Sep 30 2025

MGerlach updated the task description for T397550: Develop a dataset for editor Q&A.
Sep 30 2025, 6:38 AM · Research (FY2025-26-Research-October-December), Essential-Work

Sep 26 2025

MGerlach added a comment to T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.

weekly update:

  • Put together high-level statistics of use of search on Wikipedia
  • Summarized known pain points of WP's search and identified themes: preference of external search out of habit (e.g. for navigating between articles), lack of understanding of how it works (e.g. lack of match in autocomplete is interpreted as absence of coverage), UI limitations in arriving to/using fulltext search, community wishlists (template discovery, common queries by newcomers, discussion thread), low recall for long queries (not necessarily natural language queries), difficulties of media search on commons, unmet expectations of readers to find information using natural language queries or within sections.
  • First estimate for fraction of natural language queries in fulltext search on Wikipedia (4-7%) T404822
Sep 26 2025, 8:54 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach renamed T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki from [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki to [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.
Sep 26 2025, 8:53 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • due to showcase presentation not a lot of progress (as part of that moved due date to October)
  • the showcase presentation was a good opportunity to get some feedback about ideas on future research areas in this space. Most notably, we identified 5 areas: readership progression (e.g. reader to editor conversion), improving discoverability (e.g. search), identification of bot traffic, Wikipedia's role in the rapidly changing online ecosytem( e.g. impact of LLMs/chatbots on Wikipedia), identifying drivers of change in readership (e.g. causes of knowledge gaps or effectiveness of potential interventions). note: this is not exhaustive and also not finalized.
Sep 26 2025, 8:14 AM · Research (FY2025-26-Research-October-December)

Sep 25 2025

MGerlach added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

@EBernhardson Thanks for putting together the notebook. Looks really good, I appreciate the level of detail with respect to manual verification and having confidence intervals.

  • from what I understand, you operationalize natural language queries as all queries which contain one of the words who|what|where|when|why|how (and later do some additional manual filtering). Could you confirm? I think that approach makes sense and is sufficient to get a rough idea of the order of mangitude.
  • Do you think it would be (easily) feasible to compare the average number of words in lexical vs natural language queries? I think this could be relevant in the context of the planned hypothesis of search around relaxing matching all keywords?
  • I think that the current code is not filtering bot/automated traffic of the webrequest data (agent_type=="user"). Do you think there are many of those requests for search such that the results could significantly change? Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).
Sep 25 2025, 8:47 AM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research

Sep 23 2025

MGerlach changed Due Date from Sep 29 2025, 11:00 PM to Oct 31 2025, 12:00 AM on T400030: Draft first version of research direction on readers.
Sep 23 2025, 7:55 AM · Research (FY2025-26-Research-October-December)

Sep 19 2025

MGerlach added a comment to T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.

weekly update:

  • started scoping the work for this hypothesis
  • Collected relevant resources/literature for the review of on- and off-wiki search
  • Started analysis of search queries to estimate fraction of natural language queries T404822
    • defining a simple-to-implement heuristic for what a natural language query is. one crucial criterion is to check whether query contains any question words via the following regex: \b(who|what|where|when|why|how)\b
    • Identifying the best datas-source to get all full text queries (e.g. using webrequest-table instead of discovery.query_clicks_hourly to also get queries from mobile web)
Sep 19 2025, 8:54 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • no update this week because I spent most of my available time this week on the presentation for the research showcase next week
Sep 19 2025, 8:38 AM · Research (FY2025-26-Research-October-December)

Sep 17 2025

MGerlach closed T400233: Google Search Console access request for MGerlach as Declined.

Thanks @NBaca-WMF for the clarification.
I followed the process you described reaching out to techsupport.
Since that process is outside of phabricator, I am closing this task as declined.

Sep 17 2025, 1:37 PM · Search-Console-access-request
MGerlach added a comment to T404822: Analysis: how many search queries are using natural language vs keywords.

The NaturalQuestions dataset (natural questions from Google search queries annotated with relevant Wikipedia article sections) uses a heuristic to identify natural language queries (described in their paper in Sec. 3.1) which we might serve as a good starting point for us to adapt. Copying here for reference:

  • query was issued by multiple users
  • query contains 8 words or more
  • query matches one of the following conditions
    • start with ‘‘who’’, ‘‘when’’, or ‘‘where’’ directly followed by: a) a finite form of ‘‘do’’ or a modal verb; or b) a finite form of ‘‘be’’ or ‘‘have’’ with a verb in some later position;
    • start with ‘‘who’’ directly followed by a verb that is not a finite form of ‘‘be’’;
    • contain multiple entities as well as an adjective, adverb, verb, or determiner;
    • contain a categorical noun phrase immediately preceded by a preposition or relative clause;
    • end with a categorical noun phrase, and do not contain a preposition or relative clause.
  • query yields a Wikipedia page in the top 5 search results
Sep 17 2025, 11:57 AM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
MGerlach added a parent task for T404822: Analysis: how many search queries are using natural language vs keywords: T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.
Sep 17 2025, 11:47 AM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), Semantic Search, Research
MGerlach added a subtask for T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki: T404822: Analysis: how many search queries are using natural language vs keywords.
Sep 17 2025, 11:47 AM · Research (FY2025-26-Research-October-December), Semantic Search
MGerlach created T404848: [FY25/26-WE3.1.7] Literature Review on Search on-wiki and off-wiki.
Sep 17 2025, 11:46 AM · Research (FY2025-26-Research-October-December), Semantic Search

Sep 5 2025

MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • wrote a first rough (and partially incomplete) draft
  • shared with @YLiou_WMF and @MRaishWMF for early feedback
  • no update next week as I am OoO after that will continue to complete the first rough draft
Sep 5 2025, 8:03 AM · Research (FY2025-26-Research-October-December)

Sep 4 2025

MGerlach added a comment to T401968: Analyze samples of articles to see how many structured tasks we might be able to generate.

@diego, do you know if there is any Hive table in Data Lake that contains aggregate pageviews per article? Would really appreciate your insights on this :)

Sep 4 2025, 8:05 AM · Research, Revise-Tone-Structured-Task, Growth-Team, OKR-Work, Goal, Machine-Learning-Team

Sep 2 2025

MGerlach added a comment to T400233: Google Search Console access request for MGerlach.

@SCherukuwada: pinging you since you handled a similar request in the recent past T396188. Could you help me with getting access to the data or point me to someone else who I could reach out to? Thanks

Sep 2 2025, 4:36 PM · Search-Console-access-request

Aug 28 2025

MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • re-organized the skeleton with outline of the doc
  • Pulled together main talking points for each section
  • Next steps: write up first bad version of the doc and share with @YLiou_WMF and @MRaishWMF for feedback
Aug 28 2025, 9:36 AM · Research (FY2025-26-Research-October-December)
MGerlach updated subscribers of T400233: Google Search Console access request for MGerlach.

@jcrespo: pinging you since you are listed as member of the search-console-access-request project. Is there any additional information I should provide? Or do you know anyone else I could reach out to about this request?
Thank you for your help.

Aug 28 2025, 9:30 AM · Search-Console-access-request

Aug 22 2025

MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • not a lot of upate as I am gathering feedback
  • planning on working on a full iteration next week
Aug 22 2025, 7:16 AM · Research (FY2025-26-Research-October-December)
MGerlach closed T399567: Identify a set of articles for which a simplified version might be most useful, a subtask of T342614: Models for text summarization using LLMs, as Resolved.
Aug 22 2025, 7:11 AM · Epic, address-knowledge-gaps, Research
MGerlach closed T399567: Identify a set of articles for which a simplified version might be most useful as Resolved.

weekly update:

  • updated analysis to remove disambiguation/list pages; these articles are not relevant for simplification since they just contain lists of links to other articles. There are many of these (e.g. 300K disambiguation pages in enwiki alone) and generally tend to have high FKGL scores skewing the overall stats.
  • generated list of 1000 example articles for each of the three approaches (spreadsheet)
  • shared results with @ovasileva: the list of articles are a starting point for potential discussions with communities about the problem of difficult-to-read articles
Aug 22 2025, 7:11 AM · Research

Aug 15 2025

MGerlach added a comment to T399567: Identify a set of articles for which a simplified version might be most useful.

weekly update:

Aug 15 2025, 10:31 AM · Research
MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly update:

  • gave presentation of early ideas in Applied Research meeting
  • gathering feedback from individual folks
  • next step: refine and iterate into a draft
Aug 15 2025, 10:29 AM · Research (FY2025-26-Research-October-December)

Aug 8 2025

MGerlach added a comment to T400030: Draft first version of research direction on readers.

weekly updates:

  • continued some discussions with researchers.
  • drafted a first rough outline of the research direction. Specifically, I synthesized the wide range of potential research questions into 5 main themes to provide a framework about what is important and why.
  • put together a presentation for next week's applied research team meeting to gather feedback.
Aug 8 2025, 9:47 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T399567: Identify a set of articles for which a simplified version might be most useful.

Weekly update:

  • Performed detailed analysis of 3rd option using Maintenance templates
  • Identified all articles in English Wikipedia using the template {{Confusing}} or {{Technical}} (or any if its redirects) in the lead section of the article (some articles contain these templates only in specific sections; those were discarded in this case).
  • this yields 3708 articles
  • the average readability (FKGL) of those papers is 14.0. This is substantially higher than the average readability of all articles (11.7)
  • Thus, the two selected maintenance templates seem promising options to identify articles that could benefit from simplification.
Aug 8 2025, 9:40 AM · Research

Aug 1 2025

MGerlach added a comment to T400030: Draft first version of research direction on readers.

Weekly updates:

  • ongoing discussions with members of the team, researchers, and folks from other teams (e.g. Product, most notably the Reader-related teams).
  • trying to identify major themes in ongoing efforts as well as open questions. For example, a recurring open question was about better understanding if and how readers progress (i.e. "reader funnel")
  • Next step: synthesize themes and present initial ideas in Applied Science meeting on August 11
Aug 1 2025, 7:51 AM · Research (FY2025-26-Research-October-December)
MGerlach added a comment to T399567: Identify a set of articles for which a simplified version might be most useful.

Weekly update:

  • Identified 3 potential approaches for prioritization. I started to explore those with articles in enwiki.
Aug 1 2025, 7:38 AM · Research
MGerlach updated the task description for T400632: Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps.
Aug 1 2025, 6:45 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), DPE-Mediawiki-Content