Page MenuHomePhabricator

dcausse (David Causse)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Jun 9 2015, 9:03 AM (566 w, 6 d)
Availability
Available
IRC Nick
dcausse
LDAP User
DCausse
MediaWiki User
DCausse (WMF) [ Global Accounts ]

Recent Activity

Today

dcausse updated subscribers of T423881: RelatedArticles constructs an unofficial page URL instead of amending an official one.

@matmarex encountered similar issues and suggested a change in how we handle wprov when arriving on the landing page (see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1267282). Haven't found a way to reproduce this but did not look closely on what happens with the RelatedArticles extension. I'm all for more standardisation on this.

Tue, Apr 21, 7:08 AM · Reader Growth Team, mediawiki.util, Patch-For-Review, RelatedArticles

Yesterday

dcausse added a comment to T418525: Flink base image should not install into system python environment.

Also it would be good to update search-platform/cirrus-streaming-updater/.pipeline/blubber.yaml, but it needs maven build to make it work. cc @dcausse

Here is proposed diff https://gitlab.wikimedia.org/-/snippets/290

Mon, Apr 20, 8:48 AM · Patch-For-Review, good first task, Data-Platform-SRE (2026-03-27 - 2026-04-17), Data-Engineering, Event-Platform
dcausse added a comment to T422928: HTML Pipeline - Performance improvements.

Thanks @JMeybohm and @dcausse. I'd like to try this!

So, IIUC we should:

  • set x-envoy-max-retries: 0
  • set x-envoy-upstream-rq-timeout-ms to our client timeout - 100ms (ENVOY_OVERHEAD_MS) ?
  • set x-forwarded-for: 127.0.0.1

Is that correct?

Is there a way to the effect? Somewhere and grafana that shows envoy retries for our service? It'd be nice to see the effect explicitly rather than just hoping. :)

For the retries I would expect to only see "limit_exceeded" in the upstream request retry panel from the envoy telemetry, regarding timeouts I'm not so sure but possibly you should rarely see requests timeouts from your app code but only 504 upstream response timeout?

Mon, Apr 20, 8:00 AM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

Thu, Apr 16

dcausse added a comment to T422928: HTML Pipeline - Performance improvements.

From what I recall there are certain limitations on the headers that can currently be used to reconfigure envoy's behavior from client side since T354853: Service mesh envoy does not treat incoming connections as local. As a workaround an additional service-mesh listener with the expected retry policy could be created/used.

Thu, Apr 16, 7:44 AM · Patch-For-Review, Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform

Wed, Apr 15

dcausse added a comment to T421912: wbsearchentities: strictlanguage parameter does not exclude items with labels only in other languages.

We sometimes use search backend query logs for this, the table is event.mediawiki_cirrussearch_request.
See P90838 for a quick example. A quick extraction shows that 0.64% (1,244,901 out of 193,639,885) of the completion requests are using strictlanguage for the first half of April 2026.

Wed, Apr 15, 6:13 PM · Discovery-Search (2026.04.06 - 2026.05.01), Patch-For-Review, Wikibase Reuse Team
dcausse created P90838 wikidata prefix search with strictlanguage.
Wed, Apr 15, 6:09 PM
dcausse removed projects from T420859: EntityHandlerTestCase causes invalid data provider failures under PHPUnit 10: Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch.

Applied a quick workaround in WikibaseCirrusSearch test case, briefly looked over WikibaseLexeme but was unsure about what to do. I think the current design of EntityHandlerTestCase is not flexible enough and assumes that all subclasses have to run all the tests, it seems to me that an approach around composition with traits might be more appropriate but it sounds like a non trivial refactoring.

Wed, Apr 15, 8:25 AM · MW-1.46-notes (1.46.0-wmf.24; 2026-04-14), Wikidata Lexicographical data, Wikidata
dcausse updated the task description for T420859: EntityHandlerTestCase causes invalid data provider failures under PHPUnit 10.
Wed, Apr 15, 8:21 AM · MW-1.46-notes (1.46.0-wmf.24; 2026-04-14), Wikidata Lexicographical data, Wikidata
dcausse added a comment to T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution.

Proper fix:

OpenSearch supports configuring path.plugins in opensearch.yml. If we set it to something like /usr/share/opensearch/wmf-plugins and rebuild wmf-opensearch-search-plugins to deploy there, we sidestep the dpkg conflict entirely. opensearch ignores the upstream-bundled plugins in the default path and only loads from our custom path. No strip/reinstall dance, no dpkg conflicts, survives upgrades cleanly.

This seems like a nice trick to prevent opensearch from loading its default plugins.

Wed, Apr 15, 7:31 AM · Patch-For-Review, Data-Platform-SRE (2026-03-27 - 2026-04-17)

Tue, Apr 14

dcausse added a comment to T418130: Analyse Wrong-Keyboard-Detection Usage.

Things generally look good—and I'm really excited about the second-try searches I see!—but there's one thing that's a little off, though maybe you've accounted for it.

Entries that have position "10" mean that they rolled over to fulltext search, right? (Because the suggestions are 0-9.)

These include all the ones from kawiktionary, one from ruwiki, and two from kawiki. The page titles associated seem to be at least kinda close, but, for example, if you search for "катет" on kawiktionary, you get 10 suggestions in Cyrillic, so there shouldn't be any second-try suggestions shown.

indeed, I checked the code and the last line of the dropdown "search for page containing $query" is tracked as a click and was wrongly assigned to the last results of the backend logs, excluded those to count them as "srp_clicks".
Results are roughly similar (between 1% and 3% of the interactions are due to dwim on wikipedias):

wikidwim_clickssrp_clicksnon_dwim_clicksgo_clicksmissing datatotaldwim pct
ruwiki92731669723368010386480663715802.495560
kawiki458158886914217252.608696
ruwiktionary2406071933415551152358840.668822
hewiki516121014747117021518296931.737783
Tue, Apr 14, 2:41 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse created T423238: ALIS data pipeline produced too many suggestions.
Tue, Apr 14, 8:43 AM · Discovery-Search (2026.04.06 - 2026.05.01), Data-Engineering, Image-Suggestions

Mon, Apr 13

dcausse added a comment to T417694: Perform a one-time clean up of retained data sets in event_sanitize.

@xcollazo

searchsatisfaction — ~2.3M files, 6,426 GB total, actively written, data goes back to 2021. Retention was explicitly requested in T274607 for analytics use. A long-term retention policy may be worth establishing

I no longer work regularly with this dataset. From the product analytics side, I believe it would be fine to delete some of the older data; however, I'd recommend checking with the Search Platform team to confirm. Maybe @EBernhardson or @dcausse?

Mon, Apr 13, 6:24 PM · Essential-Work, Data-Engineering (Q4 FS25/26 April 1st - June 30st)
dcausse claimed T417648: [MEX] M4 - improve findability of properties on lookups.
Mon, Apr 13, 5:26 PM · Patch-For-Review, Discovery-Search (2026.04.06 - 2026.05.01), Wikidata-Omega, Wikidata
dcausse claimed T420582: Migrate Airflow Search instance code away from deprecated VariableProperties.
Mon, Apr 13, 5:24 PM · Discovery-Search (2026.04.06 - 2026.05.01)
dcausse added a comment to T418130: Analyse Wrong-Keyboard-Detection Usage.

@TJones I extracted a sample a query clicks where the clicked result comes from DWIM, if you have a moment could you check the data?
It's at /user/dcausse/T418130-dwim-analysis/sample.csv and can be extracted locally from stat machine using hdfs dfs -text /user/dcausse/T418130-dwim-analysis/sample.csv/part* > dwim_sample_query_clicks and let me know if you spot anything weird (mainly trying to make sure my data extraction is not incorrect).

Mon, Apr 13, 9:20 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch

Fri, Apr 10

dcausse added a comment to T418130: Analyse Wrong-Keyboard-Detection Usage.

Looked briefly into the data as we should be able to join frontend and backend logs and results over 1.5day are quite promising:

Fri, Apr 10, 5:03 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse moved T419397: Get search results for different embedding models from semantic search from In Progress to Needs Review on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Fri, Apr 10, 3:33 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse added a comment to T422511: page_change.v1 increate partitions to 3.

Unless flink is doing something special during dynamic partition discovery it might use its default offset reset strategy? If yes and they use latest offsets they might lose events the time the app is restarted or partitions discovered:

  • IIRC wdqs uses latest and may lose events for at most 5min (default partition discovery interval)

Urgh. You are right.

FLIP-288 suggests that later discovered partitions will reset from earliest. But this has been implemented in kafka connector >= 4.0.0 (we are still on 3.x.x).

Fri, Apr 10, 8:51 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

@Trokhymovych please find the query result pairs in /user/dcausse/semantic_search/T419397/query_result_pairs/ (file names should be self-explanatory).
My problem with multilingual-e5-large-instruct was due to its small context size, around 300k passages were affected, I reran those allowing the model to ignore extra tokens.

Fri, Apr 10, 8:33 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Thu, Apr 9

dcausse added a comment to T422511: page_change.v1 increate partitions to 3.

But, perhaps it would be better if applications didn't rely on canaries for watermark advancement? They are more intended for a pipeline health check than a pipeline 'liveness probe' :)

Flink apps generally set source idleness and don't expect canary events but unfortunately they can't avoid these events to still wake-up idle sources (unless there are ways to apply a filter very early at the source level?).
I actually don't know if this is a problem to have partitions properly idle and a single one being woken up at regular intervals, it might just be OK but sounded a bit odd to have only one partition receiving this traffic.

Thu, Apr 9, 5:07 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
dcausse added a comment to T422511: page_change.v1 increate partitions to 3.

@dcausse @gmodena @gkyziridis please provide your respective team input.

Thu, Apr 9, 3:54 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Event-Platform
dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

@Trokhymovych I think I'll have the data by the end of day.
I have the embeddings extracted and indexed for

  • pplx-embed-context-v1-0.6b: 53M passage in 36h (1024 cores with beefier workers: 64*4G+16G, 16cores)
  • embbedings-jinaai-embeddings-v5-nano: 53M passages in 19h (1024 cores, workers: 64*4G+8G, 16 cores)
Thu, Apr 9, 8:52 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Wed, Apr 8

dcausse updated the task description for T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 6:18 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse updated the task description for T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 6:18 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse added a comment to T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.

Impact: the semantic search API returned results from a very small subset (~10%) of the articles during these periods:

  • ptwiki: 2026-05-05T14:00:00 to 2026-05-08T08:41:00
  • enwiki: 2026-05-05T15:30:00 to 2026-05-08T13:56:00
  • frwiki: 2026-05-05T15:15:00 to 2026-05-08T16:00:00
Wed, Apr 8, 6:09 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse updated the task description for T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 2:44 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse moved T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages from Incoming to Needs Review on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Wed, Apr 8, 1:05 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse edited projects for T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages, added: Discovery-Search (2026.03.03 - 2026.04.03); removed Discovery-Search.
Wed, Apr 8, 1:05 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse added a comment to T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.

Looking at idwiki I see a huge difference in the number of passages extracted:

20260329	3750496
20260405	724490
Wed, Apr 8, 12:38 PM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse updated the task description for T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 9:57 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse renamed T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages from discovery.wiki_content_passage snapshot 20260405 does not contain all to discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 9:56 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse created T422594: discovery.wiki_content_passage snapshot 20260405 does not contain all passages/pages.
Wed, Apr 8, 9:56 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch

Tue, Apr 7

dcausse added a comment to T417648: [MEX] M4 - improve findability of properties on lookups.

I understand the issue as follow:

  • You search for "pet" hoping to select Pet (Q39201)
  • You see a wide variety of matches starting with pet but not what you expect (Pet Q39201)
  • You add a space and search for "pet " hoping for Q39201 to be preferred
Tue, Apr 7, 2:05 PM · Patch-For-Review, Discovery-Search (2026.04.06 - 2026.05.01), Wikidata-Omega, Wikidata
dcausse created T422463: Change the snapshot partitioning of passage/embedding and cirrus_index_without_content tables.
Tue, Apr 7, 7:19 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

@Ahoelzl yes I can confirm the run scheduled on April 2nd did run properly.

Tue, Apr 7, 6:00 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Thu, Apr 2

dcausse added a comment to T420427: Search shouldn't trim trailing space when suggesting suggestions.

The trailing spaces are not trimmed by MW but because the completion algorithm is tokenizing the query and the titles which will cause spaces to be ignored most of the time.
We can attempt to change the behavior but I'd suggest to run a quick test to see possible undesirable side-effects.
There is also an an issue with wikibase prefix search (which is not using the same algorithm) where trailing spaces are supposed to influence scoring but this got disabled by some custom scoring profiles.
We can fix custom profiles (en, de, es fr) to let trailing spaces influence the ranking but I think this is different from what is requested in T417648.
In T417648 I understand that the trailing space should be used as a signal to tell the search engine to look for "exact" matches on the trimmed search string. The wikibase prefix search should already have a ranking signal on an "exact" label/alias match from the trimmed search string but I suspect the different weights are not doing what we expect in this case.

Thu, Apr 2, 4:25 PM · Discovery-Search (2026.04.06 - 2026.05.01), MW-1.46-notes (1.46.0-wmf.23; 2026-04-07), Patch-For-Review, CirrusSearch
dcausse claimed T420427: Search shouldn't trim trailing space when suggesting suggestions.
Thu, Apr 2, 12:49 PM · Discovery-Search (2026.04.06 - 2026.05.01), MW-1.46-notes (1.46.0-wmf.23; 2026-04-07), Patch-For-Review, CirrusSearch
dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

As for your question, I think the proposed approach should work. That said, I would recommend including the title and section context (e.g., title, parent sections, and section name) in all paragraphs, not just the first one.

Thu, Apr 2, 10:04 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Wed, Apr 1

dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

jina-embeddings-v5-text-small is done: 53M passages in 51.3h with 1000cores

Wed, Apr 1, 1:59 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse created P90180 pplx-embed-context section passages .
Wed, Apr 1, 1:57 PM
dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

The task finished running during the night, @dcausse do the numbers look good now?

Wed, Apr 1, 8:47 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
dcausse added a comment to T419409: Get search results from semantic search using MIRACL benchmark dataset.

@Trokhymovych sure, please find those in hdfs:///user/dcausse/T419409-miracl/query_result_pairs with names pure_knn_10_top_10_per_article_miracl_$lang.json.
I collected the top-10 passages per article such that in the worst case (if best matches all belong to the same page you should have them). If you flatten the passage arrays re-ordering by score you should be able to infer the top-10 passages per query. Please let me know if you spot issues and I can try re-shape my query and output format to ignore the per-article breakdown if problematic.

Wed, Apr 1, 8:26 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Tue, Mar 31

dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

All looks good to me, @dcausse if you agree I will run the task publish_page_change_weighted_tags.

Awesome, thanks! yes please feel to re-run this task

Tue, Mar 31, 12:49 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
dcausse removed a project from T228626: Extract interfaces and base classes from SearchResultSet and SearchResult: Patch-Needs-Improvement.
Tue, Mar 31, 6:53 AM · MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Discovery-Search (Current work), Discovery-ARCHIVED, CirrusSearch
dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

Quick update

  • multilingual-e5-large-instruct is done: 53M passages in 29h with 1000cores
  • jina-embeddings-v5-text-small is running (started yesterday and is 40% done)
Tue, Mar 31, 6:35 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Mon, Mar 30

dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

I have created and validated all my update commands, will shortly run them all and past here all the results with the validations. As a next step we can just rerun the image_suggestions_weekly dag correct?

Mon, Mar 30, 12:02 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

I think one of the cause of this confusion is the current contract between ALIS/SLIS and search, for historical reasons (first non-search pipeline to push data to search indices) it used search internal formats which are very fragile (magic words like __DELETE_GROUPING__ and tag/score internal string encoding). I filed T414099 to try to address some of this a update the ALIS/SLIS to use newer contract based on an event schema which is what newer pipelines are using (i.e. revise tone recommendations).

Mon, Mar 30, 10:17 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Hey @dcausse I must have misunderstood and thought the exists| part could be removed. I can update the output of the tables to show the correct form and fix the code to show in the correct form from next run. Does this sound good to you?

Mon, Mar 30, 10:04 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions
dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

@APizzata-WMF sorry I did not spot this earlier but the values ["970"] is not correct, it unfortunately pushed the string 970 as the tag value instead of the score, the tag value must remain exists with the score appended after a |: the full string should look like ["exists|970"].
Is this possible to re-run the pipeline with such fix applied, I'll re-ship the tags right after.

Mon, Mar 30, 9:00 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Fri, Mar 27

dcausse added a comment to T419397: Get search results for different embedding models from semantic search.

Update:

  • jina-embeddings-v5-text-nano is based on eurobert and llama.cpp got it supported just recently and spark-nlp does not support it yet, tried to rebuild spark-nlp with it but faced a blocker, skipping for now but an alternative might be not using spark-nlp and fallback to hf sentence transformers (this requires some adaptation to the current pipeline)
  • pplx-embed-v1-0.6b might be a bit complicated esp. if we want to take benefit from the context model for which I need to adapt the pipeline to emit meaningful batches, I suspect I'll have some questions about how to build the batch to keep a meaning full context
  • started to extract embeddings with multilingual-e5-large-instruct using a quantized (Q8) version I built with llama.cpp tooling and uploaded to /user/analytics-search/spark-nlp/models/multilingual_e5_large_instruct_Q8_0_gguf
Fri, Mar 27, 2:54 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Thu, Mar 26

dcausse moved T419397: Get search results for different embedding models from semantic search from Incoming to In Progress on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Thu, Mar 26, 11:09 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse updated the task description for T419409: Get search results from semantic search using MIRACL benchmark dataset.
Thu, Mar 26, 11:07 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse reassigned T419409: Get search results from semantic search using MIRACL benchmark dataset from dcausse to Trokhymovych.

@MGerlach @Trokhymovych query result pairs should be available in hdfs:///user/dcausse/T419409-miracl/query_result_pairs, files are named pure_knn_10_miracl_$lang.json.

Thu, Mar 26, 11:07 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Wed, Mar 25

dcausse added a comment to T419409: Get search results from semantic search using MIRACL benchmark dataset.

Unfortunately the job failed a couple times last week and embeddings extraction just finished last night, recording some timings (1000cores with llama and qwen3-0.6B-Q8):

  • de: 15.8M, 19h
  • en: 32.8M, 28h
  • fr: 14.6M 13h
  • id: 1.4M 2h
  • es: 10.3M 11h
Wed, Mar 25, 8:40 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Tue, Mar 24

dcausse edited P89910 cirrus_request_log for semantic search.
Tue, Mar 24, 11:30 AM
dcausse edited P89910 cirrus_request_log for semantic search.
Tue, Mar 24, 11:22 AM
dcausse edited P89910 cirrus_request_log for semantic search.
Tue, Mar 24, 11:20 AM
dcausse edited P89910 cirrus_request_log for semantic search.
Tue, Mar 24, 11:15 AM
dcausse created P89910 cirrus_request_log for semantic search.
Tue, Mar 24, 11:08 AM

Mon, Mar 23

dcausse closed T414091: Import passage vectors into opensearch as Resolved.
Mon, Mar 23, 10:09 PM · Discovery-Search (2026.03.03 - 2026.04.03), Semantic Search
dcausse moved T420886: The search token is no longer propagated in autocomplete search satisfaction logs from In Progress to Needs Review on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Mon, Mar 23, 6:31 PM · Discovery-Search (2026.04.06 - 2026.05.01), MW-1.46-notes (1.46.0-wmf.23; 2026-04-07), CirrusSearch
dcausse created T420886: The search token is no longer propagated in autocomplete search satisfaction logs.
Mon, Mar 23, 9:33 AM · Discovery-Search (2026.04.06 - 2026.05.01), MW-1.46-notes (1.46.0-wmf.23; 2026-04-07), CirrusSearch

Mar 19 2026

dcausse created P89890 opensearch secrets for airflow-search.
Mar 19 2026, 4:53 PM
dcausse added a comment to T419409: Get search results from semantic search using MIRACL benchmark dataset.

Started the job at https://yarn.wikimedia.org/cluster/app/application_1773845446826_8538 the dataset is quite big and probably will take quite some time to complete (hopefully finished early next week).

Mar 19 2026, 2:41 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Mar 18 2026

dcausse added a comment to T419409: Get search results from semantic search using MIRACL benchmark dataset.

@Trokhymovych it's working well now, no problem, thanks for the data!

Mar 18 2026, 8:48 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse added a comment to T419409: Get search results from semantic search using MIRACL benchmark dataset.

@Trokhymovych thanks! I don't have access to these folders, could update the perms or possibly upload them to /user/dcausse/T419409-miracl if you don't want to open the perms on your user folder?

Mar 18 2026, 2:20 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse moved T419409: Get search results from semantic search using MIRACL benchmark dataset from Blocked / Waiting to In Progress on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Mar 18 2026, 2:05 PM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Mar 17 2026

dcausse claimed T418130: Analyse Wrong-Keyboard-Detection Usage.
Mar 17 2026, 9:40 AM · Discovery-Search (2026.04.06 - 2026.05.01), CirrusSearch
dcausse moved T414091: Import passage vectors into opensearch from In Progress to Needs Review on the Discovery-Search (2026.03.03 - 2026.04.03) board.
Mar 17 2026, 9:17 AM · Discovery-Search (2026.03.03 - 2026.04.03), Semantic Search
dcausse renamed T420304: VE-Cite tests (quibble-vendor-mysql-php83-selenium) are blocking merges in some repos from VE-Cite tests are blocking merges in mediawiki/skins/MinervaNeue to VE-Cite tests (quibble-vendor-mysql-php83-selenium) are blocking merges in some repos.
Mar 17 2026, 9:07 AM · MinervaNeue (Tracking), MW-1.46-notes (1.46.0-wmf.21; 2026-03-24), WMDE-TechWish-Sprint-2026-03-17-all-of-the-beans, Cite, VisualEditor-MediaWiki-References, VisualEditor, ci-test-error (WMF-deployed Build Failure)
dcausse triaged T420304: VE-Cite tests (quibble-vendor-mysql-php83-selenium) are blocking merges in some repos as Unbreak Now! priority.

Tentatively raising prio because it's failing CI on all CirrusSearch patches too and I suspect some other repos as well.

Mar 17 2026, 9:02 AM · MinervaNeue (Tracking), MW-1.46-notes (1.46.0-wmf.21; 2026-03-24), WMDE-TechWish-Sprint-2026-03-17-all-of-the-beans, Cite, VisualEditor-MediaWiki-References, VisualEditor, ci-test-error (WMF-deployed Build Failure)

Mar 16 2026

dcausse closed T404597: Eventutilities Flink: port SerDe tests from SUP as Resolved.
Mar 16 2026, 7:40 PM · Discovery-Search (2025.10.20 - 2025.12.31), Data-Engineering-Radar, Event-Platform, Data-Engineering, Essential-Work, CirrusSearch
dcausse added a comment to T382315: Flaky ve-cite Cypress test: Visual Editor Wt 2017 Cite Integration should be able to create a VE-Cite tool template (wt2017Integration.cy.js).

I seem to get failures on ve-cite. I got a failures 3 times consecutively on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1252597 (quibble-vendor-mysql-php83-selenium):

Mar 16 2026, 9:05 AM · ci-test-error, Browser-Tests, Cite, VisualEditor

Mar 9 2026

dcausse added a project to T419409: Get search results from semantic search using MIRACL benchmark dataset: Discovery-Search.
Mar 9 2026, 11:21 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search
dcausse added a project to T419397: Get search results for different embedding models from semantic search: Discovery-Search.
Mar 9 2026, 11:11 AM · Discovery-Search (2026.04.06 - 2026.05.01), Research, Semantic Search

Feb 26 2026

dcausse created P89064 knn index mapping.
Feb 26 2026, 6:20 PM

Feb 24 2026

dcausse added a comment to T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots.

Embeddings for dewiki, enwiki, eswiki, frwiki, itwiki, nlwiki, ptwiki and idwiki (main namespace only) are extracted weekly on Sundays and available in the table discovery.wiki_content_embeddings (partitions: snapshot, wiki, model='qwen3_0.6B_Q8')

Feb 24 2026, 12:47 PM · Discovery-Search (2026.02.02 - 2026.02.27), Semantic Search
dcausse moved T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots from To be Deployed to Done on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 24 2026, 12:44 PM · Discovery-Search (2026.02.02 - 2026.02.27), Semantic Search

Feb 23 2026

dcausse moved T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots from Needs Review to To be Deployed on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 23 2026, 10:14 AM · Discovery-Search (2026.02.02 - 2026.02.27), Semantic Search
dcausse moved T404858: A/B test using defaultsort with the completion suggester from To be Deployed to Done on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 23 2026, 10:14 AM · Discovery-Search (2026.02.02 - 2026.02.27), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), Essential-Work, CirrusSearch

Feb 20 2026

dcausse moved T417242: Get search results for queries from benchmark dataset for semantic search model from In Progress to Needs Review on the Discovery-Search (2026.02.02 - 2026.02.27) board.

Please find query-result pairs at: hdfs:///user/dcausse/semantic_search/T417242/. The folder contains a set of outputs:

  • pure_knn_10.json: the top 10 of the bare vector search
  • rerank_at_3_no_context.json: same as above but the with top-3 re-ranked using the bare passage
  • rerank_at_10_no_context.json: same as above but the top-10 are re-ranked
  • rerank_at_3_full_context.json: top-3 re-ranked using the passage with additional context (title, parent sections and section)
  • rerank_at_10_full_context.json: same as above but the top-10 are re-ranked
  • rerank_at_3_with_lead.json: top-3 re-ranked using the passage with additional context (title, parent sections and section and the lead paragraph of the page if different from the best passage)
  • rerank_at_10_with_lead.json: same as above but the top-10 are re-ranked
Feb 20 2026, 5:07 PM · Discovery-Search (2026.02.02 - 2026.02.27), Research, Semantic Search

Feb 19 2026

dcausse moved T417242: Get search results for queries from benchmark dataset for semantic search model from Incoming to In Progress on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 19 2026, 8:20 AM · Discovery-Search (2026.02.02 - 2026.02.27), Research, Semantic Search

Feb 16 2026

dcausse claimed T414091: Import passage vectors into opensearch.
Feb 16 2026, 2:04 PM · Discovery-Search (2026.03.03 - 2026.04.03), Semantic Search
dcausse moved T404858: A/B test using defaultsort with the completion suggester from Needs Review to To be Deployed on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 16 2026, 2:04 PM · Discovery-Search (2026.02.02 - 2026.02.27), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), Essential-Work, CirrusSearch
dcausse added a comment to T404858: A/B test using defaultsort with the completion suggester.

@TJones thanks for the feedback and suggestions!
We have a repo at https://gitlab.wikimedia.org/repos/search-platform/notebooks but even if some notebooks are re-usable you sometimes have to duplicate and adapt it to run your analysis. The double period issue is I think me misusing some of the templates created by Erik.
I tried to update the last report to include more detailed info (per wiki graphs) but unfortunately I waited too long and some of the data is already gone thanks to our retention policy...
A/A test is a nice idea, haven't had the chance to do it but there's possibly a way to do it from the data we have? Some wikis have a real low volume (from what I remember bi, gv and igk had fewer than 100 observations in a week) and where we might want to double check https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines and possibly start using thresholds rather than actual numbers.

Feb 16 2026, 1:44 PM · Discovery-Search (2026.02.02 - 2026.02.27), MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), Essential-Work, CirrusSearch

Feb 13 2026

dcausse moved T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots from In Progress to Needs Review on the Discovery-Search (2026.02.02 - 2026.02.27) board.
Feb 13 2026, 4:15 PM · Discovery-Search (2026.02.02 - 2026.02.27), Semantic Search

Feb 12 2026

dcausse added a project to T417242: Get search results for queries from benchmark dataset for semantic search model: Discovery-Search (2026.02.02 - 2026.02.27).
Feb 12 2026, 8:38 AM · Discovery-Search (2026.02.02 - 2026.02.27), Research, Semantic Search

Feb 11 2026

dcausse created P88776 opensearch rerank connectoer.
Feb 11 2026, 3:21 PM

Feb 10 2026

dcausse updated the task description for T414697: Build the required plugins for opensearch 3.
Feb 10 2026, 8:23 AM · Data-Platform-SRE (2026-02-13 - 2026-03-06), Discovery-Search (2026.02.02 - 2026.02.27)

Feb 9 2026

dcausse moved T415702: Commons search bar suggestions dont include exact matches from needs triage to elastic / cirrus on the Discovery-Search board.

@RoyZuo thanks for raising this. Indeed the ranking strategy for commons is quite suboptimal and the fuzzy matches tend to hide interesting matches from other namespaces. This would require a change in how the suggestions coming from multiple namespaces are blended together.
While we ponder how to prioritize and solve this you can apply a workaround by selecting either:

  • Redirect mode
  • Classic prefix search

from your Search User preferences.
This workaround might help in some cases but not for some others (hopefully rarer) where an "exact" category match is still a widely used prefix.

Feb 9 2026, 5:07 PM · Discovery-Search, Commons

Jan 26 2026

dcausse added a comment to T368987: Add an Image: filtering by suggestion "kind" or "confidence".

Currently the indexed search tags have a hard-coded value named exists but it could be something else, (if deemed useful at search time) tags related to image recommendations could be more granular by including the kind as separate tags.
At query time this would become (suggested syntax): hasrecommendation:image=istype-depicts.

  • hasrecommendation:image>0.8 -hasrecommendation:image=istype-depicts could filter recommendations with a score above 0.8 and without a istype-depicts kind.
Jan 26 2026, 9:38 AM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), Patch-For-Review, Growth-Team, Image-Suggestions

Jan 15 2026

dcausse added a comment to T413969: Make semantic search accessible through Action API.

This will probably be a couple parts:

Jan 15 2026, 10:50 AM · MW-1.46-notes (1.46.0-wmf.20; 2026-03-17), Discovery-Search (2026.03.03 - 2026.04.03), Semantic Search, CirrusSearch
dcausse closed T414066: Download enterprise structured content snapshots in hdfs as Resolved.

Dumps are available as text files (raw ndjson) under /wmf/data/discovery/wikimedia_enterprise/structured_content_snapshots/snapshot=$YYYYMMDD/project=${WIKI}_namespace_0 and will be updated weekly on Sundays.

Jan 15 2026, 8:20 AM · Discovery-Search (2026.01.05 - 2026.01.30), Semantic Search

Jan 14 2026

dcausse claimed T414070: Chunk, trim and generate passage embeddings from enterprise structured content snapshots.
Jan 14 2026, 1:45 PM · Discovery-Search (2026.02.02 - 2026.02.27), Semantic Search

Jan 13 2026

dcausse merged T139647: Search box at top right of pages should italicize redirects into T303013: Indicate when search results are from redirects (sometimes).
Jan 13 2026, 6:18 PM · Readers Essential Work, Reader Experience Team, Patch-For-Review, Codex, Vector 2022
dcausse merged task T139647: Search box at top right of pages should italicize redirects into T303013: Indicate when search results are from redirects (sometimes).
Jan 13 2026, 6:18 PM · CirrusSearch, patch-welcome, good first task, Discovery-ARCHIVED
dcausse added a project to T413969: Make semantic search accessible through Action API: Semantic Search.
Jan 13 2026, 3:36 PM · MW-1.46-notes (1.46.0-wmf.20; 2026-03-17), Discovery-Search (2026.03.03 - 2026.04.03), Semantic Search, CirrusSearch
dcausse added a comment to T414426: Migrate airflow dags from the Search Platform instance to Wikidata Platform.

@gmodena the list of dags seem correct to me, there'll be some parts of drop_old_data_daily.py that might be moved over as well (cleanups of rdf data from import_ttl and query analytics).

Jan 13 2026, 11:35 AM · Discovery-Search (2026.04.06 - 2026.05.01), Wikidata Platform Team, Essential-Work, Wikidata

Jan 12 2026

dcausse added a comment to T414066: Download enterprise structured content snapshots in hdfs.

The full HTML snapshots are "Updated twice-monthly (on the 2nd and 21st)" (1) so I'm curious to know whether Structured Contents follows that same cadence.

Jan 12 2026, 9:29 AM · Discovery-Search (2026.01.05 - 2026.01.30), Semantic Search

Jan 8 2026

dcausse moved T414066: Download enterprise structured content snapshots in hdfs from In Progress to Needs Review on the Discovery-Search (2026.01.05 - 2026.01.30) board.
Jan 8 2026, 6:44 PM · Discovery-Search (2026.01.05 - 2026.01.30), Semantic Search