User Details
- User Since
- Sep 9 2019, 9:50 AM (344 w, 6 d)
- Availability
- Available
- IRC Nick
- mgerlach
- LDAP User
- MGerlach
- MediaWiki User
- MGerlach (WMF) [ Global Accounts ]
Fri, Mar 27
weekly update:
- We have made some progress on collecting the results for the two remaining experiments
- We generated the embeddings and the search results of the semantic search model for 5 different languages (en, de, es, fr, id). We should have the evaluation metrics ready next week.
- We have started running experiments with the multilingual-e5-large-instruct model which was used for the first prototype. We are currently working out how to run the other embedding models in our existing pipeline using spark-nlp. This is bringing some challenges but not yet a blocker. If necessary, we might need to adapt the set of embeddings models in the evaluation.
weekly update:
- we have concluded the first set of exploratory analysis
- Next steps are:
- mapping the maintenance templates to a set of the main policies and guidelines (for English Wikipedia)
- running models to predict maintenance templates in untagged articles
Final update:
- heavily revised the reader research direction based on feedback (latest version here)
- mainly re-organized existing content to make it shorter/more compact and align better with objectives for the next FY
- shared the latest version with Leila; will pick this up in Q4 based on feedback I receive and cover the work in a separate task
- had more detailed discussion about two proposed projects for Q4 on understanding reader retention and a natural experiment for estimating the causal effect: new content leads to additional pageviews
- my current understanding is that I will start working on those in Q4.
- the work on these projects will be covered in separate tasks
Wed, Mar 25
Tue, Mar 24
Mar 20 2026
weekly update:
- We started running the experiments for semantic search with 3 different models and 5 different languages. For each case, we are generating the set of embeddings of all millions of passages. Once the embeddings are generated, we can obtain the top-10 results for each query and calculate the relevance metric. Hopefully, the data is ready early next week.
weekly update:
- we made lots of progress on analyzing statistics of the dataset. for now this is 6 wikis during the exploratory phase, but we will scale this up to more (all) wikis
- Number of templates
- Number of articles with a template
- Distribution of number of templates per article
- Number of add/remove events
- Number of currently added templates (backlog)
- Average time to fix, Fraction fixed in 1st week
- Time evolution of number of templates added to articles
- Per template data: Most common templates across languages, Most/least used templates, Templates which take the longest/shortest from being added to being removed (i.e. time to fix)
- Next step is to organize these results better (plots, tables)
- We will start mapping templates to policies. This is a crucial piece for demonstrating the usefulness of this dataset to broader audiences:
- this paper from wikiworkship presents a dataset for studying policy invocation and enactment. Templates (that are mapped to policies and guidelines) could provide a complementary approach as they can be considered implicit invocations.
- this paper discusses the importance of data about Wikipedia policies for improving natural language understanding tasks of large language models.
weekly update;
- reached out/discussed with Rita, Maryana, and Suman about relevent research questions on readers. Started revising the reader direction incorporating the feedback.
- got feedback about the project on understanding reader retention. This is a question that it relevant to folks in the readers teams. My recommendation would be to start by looking at existing data being collected in ongoing projects (attribution and comparative reader research) to get some initial insights. Based on the feedback, we can set up a dedicated experiment in test kitchen in the next FY, if needed.
Mar 18 2026
Mar 16 2026
Mar 13 2026
Weekly update:
- We scoped the tasks around expanding the offline evaluation of the semantic search using benchmark data
- More models: T419397: Get search results for different embedding models from semantic search. Specified the set of embedding models.
- More languages: T419409: Get search results from semantic search using MIRACL benchmark dataset. Specified the data and languages.
- Next step is generating the embeddings so we can run queries and get search results for evaluation
weekly update:
- went back to the original draft of the reader direction and am starting to revise it heavily based on the feedback I received. No major new items were brought up so far. Thus, I plan to re-organize and shorten the content into at most 5 major research directions. I will then try to get a new round of feedback.
- I wrote up different options for a research project on better understanding retention of readers (googledoc). I shared this with Sherry and Hsuanwei to get feedback on what would be most useful for Product.
- I wrote up a research plan for running an analysis using content translation as a natural experiment (googledoc). Shared this with Debra as a potential project for the reader growth bucket. waiting for feedback on prioritization.
weekly update:
- obtained first results for summary statistics of templates in 6 languages. We will refine the analysis based on the paper On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia. Main extension will be to calculate statistics beyond English Wikipedia.
- Number of templates
- Usage: How often are they used, what are the most commonly used templates
- Evolution: How is usage increasing/decreasing over time? How is the backlog increasing/decreasing? Compare against the total number of articles.
- Time to remove templates
- For this we will also want to aggregate templates by template types (e.g. verifiability vs style issues)
- Discussed with several folks that one way to increase the utility of the dataset would be to match maintenance templates with the main content policies and guidelines.
Mar 9 2026
Mar 6 2026
weekly update:
- met and discussed with Debra: the focus should be on identifying causal relationships that affect reader metrics. We think that the retention survey will be restricted to correlational insights. In addition, some of the ongoing work from others (attribution research and comparative reader research) will already provide some insights into retention. We would like to first assess how much these insights could serve questions around retention from product teams before investing more.
- We also discussed an alternative: The hypothesis is that additional content being available to internet users will lead to additional traffic from external search engines. Content translations could serve as individual natural experiments where some group of internet users are able to access new content (e.g. when an article is translated into ptwiki it is now accessible to speakers of Portuguese) whereas some group of internet users is not able to access the new content (e.g. when an article is translated into ptwiki it does not make a difference for non-Portuguese speakers). These two groups can be considered treatment and control groups and we can do a diff-in-diff comparison of pageviews with search-engine referer between the two groups. I will sketch an outline of this analysis next week.
weekly update:
- Collecting feedback from the first round of analysis for digging deeper.
- We identified different models for testing different variations for the semantic search to assess whether observation from the first round are due to the specific underlying model (qwen-3-0.6B) or hold generally for semantic search:
- Qwen3-8B A larger version of the same qwen-3 as a comparison even if we would not be able to run this in production *Jina-embeddings-v5-text-small recently published state-of-the-art for this model size
- Multilingual-e5-large-instruct the model used for the prototype deployed on Greek Wikipdia
- We identified a multilingual benchmark dataset to i) test variation across languages; and ii) compare our current model's performance with results reported in literature
- MIRACL is available in EN, DE, ES, FR, ID (but not in IT, NL, PT) from the languages that relevant for the current semantic search project
- it only covers natural language questions (not from actual search-logs so not representative of WP queries), but our first round of results indicated that these are the queries where semantic search offers the most advantage to our current search.
weekly update:
- manual check of 100 samples in 6 languages in this spreadsheet. Overall, the data looks reliable. Though we identified some issues in parsing, such as nested templates, multi-tag templates, and some false positives probably due to reverted edits. We plan to fix those in the next iteration.
- We are starting the analysis of high-level metrics
- We are starting parsing of the content of templates to map them to the corresponding policy or guideline (e.g. identifying links to Wikipedia namespace). The first approach is to capture all links from the page to the Wikipedia-namespace and then do a manual filtering.
Feb 27 2026
weekly updates:
- still figuring out feasibility of work to understand better retention of users. so far, we are considering two threads
- retention survey: this would require some engineering work on T417185: Migrate QuickSurveys data collection to Test Kitchen. there are ongoing discussions around prioritization of this work
- retention action: this work would try to identify which reader actions are correlated with high or low reader retention. This question might be possible to address with the data/analysis from the attribution research.
- I started to look in more detail into traffic from external search engines over time. Specifically, I stratified by article topic (aggregating timeseries of all articles in English Wikipedia belonging to the same topic). While very exploratory, there are some interesting aspects: for some topics (such as biographies or films) traffic from external search engines has been stable or even increasing; while for other topics (most notably STEM) the traffic from external search engines has been decreasing. https://gitlab.wikimedia.org/mgerlach/external-traffic/-/blob/main/trend-external-traffic_timeseries-topics.ipynb?ref_type=heads
weekly update:
- Refactored the dataset pipeline into 2 parts: i) getting the full (raw) dataset, ii) then applying the filtering step. With this, we can now also run the pipeline for enwiki. With this we are fairly confident that we can, in principle, run this on all wikis.
- We built a dataset for an initial set of 6 languages (those were selected based on language familiarity to being able to manually check results): bn, de, en, hi, pt, simple. We created a smaller random subsample for manual investigation/verification.
- We started to identify the main summary statistics to report a high-level overview of the dataset (number of templates, number of revisions, number of articles, time between adding/removing a template).
weekly update
- Compiled results for the offline evaluation of semantic search in English Wikipedia using the benchmark dataset and comparing it with our current lexical search.
- Specifically, we evaluate different search models on the new benchmark dataset. We consider Wikipedia search (lexical search) and semantic search (the current model for the MVP) with different variations (e.g. re-ranking results after retrieval). For each model, we get the top-10 search results for each query. We calculate different evaluation metrics to quantify the relevance of the search results using the pytrec_eval package: NDCG, Prevision, Recall, Binary preference (bpref). We evaluate the relevance of the retrieved results on the article and the paragraph level by comparing with the annotations in the benchmark dataset.
- Results can be found in this doc: https://docs.google.com/document/d/1xgdzD0TFIqyAw45mf9uHjzdpeMauugMRefQBlEu8i6I/edit?tab=t.x3x7obtlsqmn
@dcausse Thanks for generation this dataset.
We succesfully used this to run the offline evaluation with the benchmark dataset.
Feb 20 2026
weekly update:
- from discussions in #talk-to-experiment-platform (thread), my understanding is that, in principle, it is possible to combine quicksurveys with test-kitchen. However, it seems that some engineering work still needs to be done before this is ready. Thus, we might not be able to start working on this immediately.
- I started exploring other opportunities. One potential direction is to understand in more detail the traffic from external search; i.e. what are the factors that lead to more/less traffic from external search engines and potentially identifying quasi-causal relationships via natural experiments.
weekly update:
- We refined the set of metrics for evaluation: nDCG@k, precision@k, recall@k, MAP@k, bpref@k for both paragraph and article level
- We collected search results for the 600 queries of the benchmark dataset for Wikipedia search and the semantic search MVP (qwen-model) with different variations (adding re-ranker, additional context) T417242#11636952
- Next step: calculating metrics
weekly updates
- Extracted full dataset for 4 languages: simple, de, bn, hi. en is still pending as we need to figure out settings for resource allocation to avoid memory issues.
- We are looking manually through a small subset of samples to spot-check any processing issues. One potential issue we have identified is that for some edits, one template is removed and at the same time another one (or more) is added. This might indicate that the former issue is not resolved but rather the latter templates provide a more specific characterization of the issue.
- Next step: Starting analysis of basic summary stats (e.g. number of templates, affected articles) over time.
Feb 13 2026
weekly update:
- discussed with Search Team about how to set up evaluation
- Filtering of benchmark dataset: we will remove annotations from inconsistent raters
- Metrics: nDCG, precision, recall
- Granularity: Article- and paragraph level
- Models: Wikipedia search, and prototype model for semantic search T412338: Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP
weekly update:
- expanded code to capture all maintenance templates for a given wiki. also converted notebook to standalone script https://github.com/au-clan/wiki_templates/blob/development/template_dataset/template_dataset_v2_mod.py
- Next step: running pipeline for different wikis and optimizing queries/setup for larger wikis requiring more resources.
Final update:
- We collected the final dataset
- Documentation is available in this google doc (for now internal only)
Feb 12 2026
Feb 6 2026
weekly update:
- Started to focus in more detail on one project around understand better retention.
- One potential approach would be to combine a reader survey via quicksurvey with measurement of retention in test kitchen. First feedback I got was that this is likely possible but my goal next week is to spend some time to figure out in more detail what are the technical options/limitations.
weekly update:
- implemented several refinements for the processing pipeline: resolving redirect names of templates, marking edits with specific template that was added/removed
- next step is to expand the pipeline to include all maintenance templates from a single wiki
weekly update:
- confirmed with Jazmin that this should be captured as a hypothesis under WE3.10
- as we have collected the search result relevance annotation, I am starting to think about the best approach to evaluation.
- metric: likely, we will use nDCG@10 as this is the main metrics in some of the retrieval benchmarks such as MTEB https://arxiv.org/pdf/2210.07316
- coordinating with Search to make sure our approach is meaningful
weekly update:
- we ran the relevance annotation for the full dataset of 600 queries.
- will spend another week on cleaning the dataset and putting together documentation before closing
Final update:
- we ran the annotation using prolific on the full dataset
- basic stats: 600 queries, 10 candidate search results per query (paragraph), 3 annotations each
- raw data in this spreadsheet (internal): https://docs.google.com/spreadsheets/d/1D4ap1_Xo3uqgpohFAgn_pjy1VLpI28PAgVZcOw0f_XA/edit?gid=0#gid=0
Jan 29 2026
weekly update
- Had a meeting with Miriam and Debra, we reached consensus on a shortlist of potential research projects on readers
- We identified one top contender around understanding factors that affect retention of readers. I will try to sketch an outline for how this project could look like.
- Set up follow-up meeting next week to identify/discuss one other focus area
weekly update
- We collected candidate results for the 600 final selected queries
- We are freezing/storing the corpus containing all paragraphs from all enwiki articles using the 20260125 snapshot
final update - task is completed
- We collect top-10 article of results from Wikipedia search and an external search, respectively.
- We identify the top-10 paragraphs from the selected articles (with at most 2 paragraphs from the same article)
- We collected candidate results for the 600 final selected queries
Jan 28 2026
weekly updates:
- no updates this week
Jan 16 2026
weekly update:
- Shared initial results and data pipeline with Search team; received good overall feedback.
- We completed L3SC review of ToS of 3rd party search platforms and identified an option to move forward
- We switched the text processing (extracting paragraphs) to use the Enterprise Structured Content Snapshot dumps
- We completed a pilot study on prolific to get annotations for a small set of queries (sheet internal). The quality of annotations looks promising.
weekly update:
- put together example pipeline to extract revisions for a single template (citation_needed) in a single wiki (simplewiki) https://gitlab.wikimedia.org/repos/research/maintenance-templates/-/blob/main/notebooks/cleanup-templates_collect-events-revisions.ipynb?ref_type=heads
- next step will be formatting the table and scaling to more templates/wikis
weekly update:
- This is currently still blocked by T406207: Create a dataset for evaluation of search on Wikipedia. Once that task is completed, I will pick up work on this task.
weekly update:
- set up a meeting with @Miriam and @DKumar-WMF to have a first round of discussion on prioritization
final update:
- We collected a final set of 915 queries after careful automated and manual filtering. https://docs.google.com/spreadsheets/d/1FMVsCm7AEYw5BsN7u-afMxmaCHkm_nZ1yzKoLb9ygNE/edit?gid=1925455549#gid=1925455549
- We shared our results with the members from the Search Team and received positive feedback that the selected queries constitute a meaningful set for evaluation search
final update
- I drafted a first version of the reader research direction https://docs.google.com/document/d/1ir2JoZEnoSCxqJCx4wiLIMvsGsjboeOYubqR5scVDSY/edit?tab=t.0
- This is still under review within the Research Team and we will keep iterating
- We are starting to think about next steps: T414793: Implementation and dissemination of the reader research direction
Jan 9 2026
weekly updates:
- We collected a set of queries with applied manual filtering (sheet)
- Submitted a request via L3SC for reviewing Terms of Services of 3rd party search platforms for generation of candidate results (asana ticket)
- We are updating the processing pipeline to extract paragraphs from all articles from wikitext to Enterprise' structured content snapshots as this provides cleaner representation of the article text.
weekly update:
- did an iteration to address comments from previous round of feedback.
- shared also with @leila for high-level feedback. based on that I will decide upon next steps.
Dec 19 2025
weekly updates:
- Overall, we are fully on track to get a search result dataset. Before running a smaller test pilot study we need to make minor tweaks to the query filtering.
- Collect a set of representative queries in WP search:
- We implemented a filter for the frequency of queries such that analysis is considered high-level (>=25 users). For this, we also needed to optimize the processing pipeline so that we can consider queries from all 3 months that are available in the logs.
- We are iterating to improve our query filtering to remove, e.g., navigational queries. One example is to make sure we remove queries that exactly match a page title and including all potential redirects
- We are adding an additional bucket for queries that are formulated as questions. Even when considering long queries (8+ terms), few are actually in the form of natural language questions. However, we want to capture those in our dataset as well even if they are currently rare in our logs.
Closing as currently no work that is planned or ongoing.
weekly update
- shared with team and received lots of feedback from @Miriam. I will try to do an iteration and address these comments in the next days.
weekly update
- Closing this task as completed
- formal collaboration has been started and collaborators are fully onboaded
- we started to generate the dataset of template usage but this will require some additional iterations. This work will be captured in a separate task together with a first exploratory analysis of the dataset.
Nov 28 2025
weekly update:
- Collect a set of representative queries in WP search:
- Conducted privacy check-in about publishing set of queries. As a one-off dataset for English Wikipedia this was approved.
- We will implement an additional filter for the frequency of queries such that analysis is considered high-level (>=25 users)
- Collecting candidate search results:
- Decided and implemented scheme for selecting top-5 paragraphs as candidate search results
- Using annotation tool:
- Requested a privacy survey statement for conducting the data annotation via prolific
- We set up a test-study with synthetic data in the prolific AI task builder to finalize UI of the annotation
weekly update
- starting data collection of revisions where maintenance templates are added or removed
weekly update
- incorporated feedback from Debra, Mike, and Yu-Ming
- finalized new revised version available in this doc (internal)
Nov 21 2025
weekly update:
- We are continuing the make progress on setting up the full pipeline for the dataset generation.
- Collect a set of representative queries in WP search:
- This is completed from a technical side. We have a pipeline to extract a set of representative queries
- We are waiting for the feedback from the privacy consultation about if and how we can store and publish the selected queries for annotation
- Collecting candidate search results:
- We are testing different options to select the most relevant paragraphs from a set of search results obtained from, e.g., Wikipedia search, to present as candidate search results to be annotated. This is important to avoid selection bias by missing potential relevant paragraphs as they will be implicitly marked as irrelevant since they will not be available for annotation.
- Using annotation tool:
- We are testing the study setup in prolific by using mock-up data (not from the actual query).
- In order to conduct the actual study I am requesting a survey privacy statement. Once I have the details figured out (e.g. retention time and publication) I will submit the request, probably early next week.
- I confirmed that we have available budget in the team to run the study on prolific. I am figuring out the details about the process of how to request/spend the budget correctly.
@BTullis Thank you.
weekly update:
- collaborators can now access stat-machines
only blocker is kerberos access in order to use hive tables in spark T410389: Request kerberos identity for AnkitaMresolved- next step is to start collecting the dataset of templates being added/removed
