User Details
- User Since
- Jun 9 2015, 9:03 AM (566 w, 6 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Today
@matmarex encountered similar issues and suggested a change in how we handle wprov when arriving on the landing page (see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1267282). Haven't found a way to reproduce this but did not look closely on what happens with the RelatedArticles extension. I'm all for more standardisation on this.
Yesterday
For the retries I would expect to only see "limit_exceeded" in the upstream request retry panel from the envoy telemetry, regarding timeouts I'm not so sure but possibly you should rarely see requests timeouts from your app code but only 504 upstream response timeout?
Thu, Apr 16
Wed, Apr 15
We sometimes use search backend query logs for this, the table is event.mediawiki_cirrussearch_request.
See P90838 for a quick example. A quick extraction shows that 0.64% (1,244,901 out of 193,639,885) of the completion requests are using strictlanguage for the first half of April 2026.
Applied a quick workaround in WikibaseCirrusSearch test case, briefly looked over WikibaseLexeme but was unsure about what to do. I think the current design of EntityHandlerTestCase is not flexible enough and assumes that all subclasses have to run all the tests, it seems to me that an approach around composition with traits might be more appropriate but it sounds like a non trivial refactoring.
This seems like a nice trick to prevent opensearch from loading its default plugins.
Tue, Apr 14
indeed, I checked the code and the last line of the dropdown "search for page containing $query" is tracked as a click and was wrongly assigned to the last results of the backend logs, excluded those to count them as "srp_clicks".
Results are roughly similar (between 1% and 3% of the interactions are due to dwim on wikipedias):
| wiki | dwim_clicks | srp_clicks | non_dwim_clicks | go_clicks | missing data | total | dwim pct |
| ruwiki | 9273 | 16697 | 233680 | 103864 | 8066 | 371580 | 2.495560 |
| kawiki | 45 | 81 | 588 | 869 | 142 | 1725 | 2.608696 |
| ruwiktionary | 240 | 607 | 19334 | 15551 | 152 | 35884 | 0.668822 |
| hewiki | 516 | 1210 | 14747 | 11702 | 1518 | 29693 | 1.737783 |
Mon, Apr 13
@TJones I extracted a sample a query clicks where the clicked result comes from DWIM, if you have a moment could you check the data?
It's at /user/dcausse/T418130-dwim-analysis/sample.csv and can be extracted locally from stat machine using hdfs dfs -text /user/dcausse/T418130-dwim-analysis/sample.csv/part* > dwim_sample_query_clicks and let me know if you spot anything weird (mainly trying to make sure my data extraction is not incorrect).
Fri, Apr 10
Looked briefly into the data as we should be able to join frontend and backend logs and results over 1.5day are quite promising:
@Trokhymovych please find the query result pairs in /user/dcausse/semantic_search/T419397/query_result_pairs/ (file names should be self-explanatory).
My problem with multilingual-e5-large-instruct was due to its small context size, around 300k passages were affected, I reran those allowing the model to ignore extra tokens.
Thu, Apr 9
Flink apps generally set source idleness and don't expect canary events but unfortunately they can't avoid these events to still wake-up idle sources (unless there are ways to apply a filter very early at the source level?).
I actually don't know if this is a problem to have partitions properly idle and a single one being woken up at regular intervals, it might just be OK but sounded a bit odd to have only one partition receiving this traffic.
@Trokhymovych I think I'll have the data by the end of day.
I have the embeddings extracted and indexed for
- pplx-embed-context-v1-0.6b: 53M passage in 36h (1024 cores with beefier workers: 64*4G+16G, 16cores)
- embbedings-jinaai-embeddings-v5-nano: 53M passages in 19h (1024 cores, workers: 64*4G+8G, 16 cores)
Wed, Apr 8
Impact: the semantic search API returned results from a very small subset (~10%) of the articles during these periods:
- ptwiki: 2026-05-05T14:00:00 to 2026-05-08T08:41:00
- enwiki: 2026-05-05T15:30:00 to 2026-05-08T13:56:00
- frwiki: 2026-05-05T15:15:00 to 2026-05-08T16:00:00
Looking at idwiki I see a huge difference in the number of passages extracted:
20260329 3750496 20260405 724490
Tue, Apr 7
I understand the issue as follow:
- You search for "pet" hoping to select Pet (Q39201)
- You see a wide variety of matches starting with pet but not what you expect (Pet Q39201)
- You add a space and search for "pet " hoping for Q39201 to be preferred
@Ahoelzl yes I can confirm the run scheduled on April 2nd did run properly.
Thu, Apr 2
The trailing spaces are not trimmed by MW but because the completion algorithm is tokenizing the query and the titles which will cause spaces to be ignored most of the time.
We can attempt to change the behavior but I'd suggest to run a quick test to see possible undesirable side-effects.
There is also an an issue with wikibase prefix search (which is not using the same algorithm) where trailing spaces are supposed to influence scoring but this got disabled by some custom scoring profiles.
We can fix custom profiles (en, de, es fr) to let trailing spaces influence the ranking but I think this is different from what is requested in T417648.
In T417648 I understand that the trailing space should be used as a signal to tell the search engine to look for "exact" matches on the trimmed search string. The wikibase prefix search should already have a ranking signal on an "exact" label/alias match from the trimmed search string but I suspect the different weights are not doing what we expect in this case.
Wed, Apr 1
jina-embeddings-v5-text-small is done: 53M passages in 51.3h with 1000cores
@Trokhymovych sure, please find those in hdfs:///user/dcausse/T419409-miracl/query_result_pairs with names pure_knn_10_top_10_per_article_miracl_$lang.json.
I collected the top-10 passages per article such that in the worst case (if best matches all belong to the same page you should have them). If you flatten the passage arrays re-ordering by score you should be able to infer the top-10 passages per query. Please let me know if you spot issues and I can try re-shape my query and output format to ignore the per-article breakdown if problematic.
Tue, Mar 31
Awesome, thanks! yes please feel to re-run this task
Quick update
- multilingual-e5-large-instruct is done: 53M passages in 29h with 1000cores
- jina-embeddings-v5-text-small is running (started yesterday and is 40% done)
Mon, Mar 30
I think one of the cause of this confusion is the current contract between ALIS/SLIS and search, for historical reasons (first non-search pipeline to push data to search indices) it used search internal formats which are very fragile (magic words like __DELETE_GROUPING__ and tag/score internal string encoding). I filed T414099 to try to address some of this a update the ALIS/SLIS to use newer contract based on an event schema which is what newer pipelines are using (i.e. revise tone recommendations).
@APizzata-WMF sorry I did not spot this earlier but the values ["970"] is not correct, it unfortunately pushed the string 970 as the tag value instead of the score, the tag value must remain exists with the score appended after a |: the full string should look like ["exists|970"].
Is this possible to re-run the pipeline with such fix applied, I'll re-ship the tags right after.
Fri, Mar 27
Update:
- jina-embeddings-v5-text-nano is based on eurobert and llama.cpp got it supported just recently and spark-nlp does not support it yet, tried to rebuild spark-nlp with it but faced a blocker, skipping for now but an alternative might be not using spark-nlp and fallback to hf sentence transformers (this requires some adaptation to the current pipeline)
- pplx-embed-v1-0.6b might be a bit complicated esp. if we want to take benefit from the context model for which I need to adapt the pipeline to emit meaningful batches, I suspect I'll have some questions about how to build the batch to keep a meaning full context
- started to extract embeddings with multilingual-e5-large-instruct using a quantized (Q8) version I built with llama.cpp tooling and uploaded to /user/analytics-search/spark-nlp/models/multilingual_e5_large_instruct_Q8_0_gguf
Thu, Mar 26
@MGerlach @Trokhymovych query result pairs should be available in hdfs:///user/dcausse/T419409-miracl/query_result_pairs, files are named pure_knn_10_miracl_$lang.json.
Wed, Mar 25
Unfortunately the job failed a couple times last week and embeddings extraction just finished last night, recording some timings (1000cores with llama and qwen3-0.6B-Q8):
- de: 15.8M, 19h
- en: 32.8M, 28h
- fr: 14.6M 13h
- id: 1.4M 2h
- es: 10.3M 11h
Tue, Mar 24
Mon, Mar 23
Mar 19 2026
Started the job at https://yarn.wikimedia.org/cluster/app/application_1773845446826_8538 the dataset is quite big and probably will take quite some time to complete (hopefully finished early next week).
Mar 18 2026
@Trokhymovych it's working well now, no problem, thanks for the data!
@Trokhymovych thanks! I don't have access to these folders, could update the perms or possibly upload them to /user/dcausse/T419409-miracl if you don't want to open the perms on your user folder?
Mar 17 2026
Tentatively raising prio because it's failing CI on all CirrusSearch patches too and I suspect some other repos as well.
Mar 16 2026
I seem to get failures on ve-cite. I got a failures 3 times consecutively on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1252597 (quibble-vendor-mysql-php83-selenium):
Mar 9 2026
Feb 26 2026
Feb 24 2026
Embeddings for dewiki, enwiki, eswiki, frwiki, itwiki, nlwiki, ptwiki and idwiki (main namespace only) are extracted weekly on Sundays and available in the table discovery.wiki_content_embeddings (partitions: snapshot, wiki, model='qwen3_0.6B_Q8')
Feb 23 2026
Feb 20 2026
Please find query-result pairs at: hdfs:///user/dcausse/semantic_search/T417242/. The folder contains a set of outputs:
- pure_knn_10.json: the top 10 of the bare vector search
- rerank_at_3_no_context.json: same as above but the with top-3 re-ranked using the bare passage
- rerank_at_10_no_context.json: same as above but the top-10 are re-ranked
- rerank_at_3_full_context.json: top-3 re-ranked using the passage with additional context (title, parent sections and section)
- rerank_at_10_full_context.json: same as above but the top-10 are re-ranked
- rerank_at_3_with_lead.json: top-3 re-ranked using the passage with additional context (title, parent sections and section and the lead paragraph of the page if different from the best passage)
- rerank_at_10_with_lead.json: same as above but the top-10 are re-ranked
Feb 19 2026
Feb 16 2026
@TJones thanks for the feedback and suggestions!
We have a repo at https://gitlab.wikimedia.org/repos/search-platform/notebooks but even if some notebooks are re-usable you sometimes have to duplicate and adapt it to run your analysis. The double period issue is I think me misusing some of the templates created by Erik.
I tried to update the last report to include more detailed info (per wiki graphs) but unfortunately I waited too long and some of the data is already gone thanks to our retention policy...
A/A test is a nice idea, haven't had the chance to do it but there's possibly a way to do it from the data we have? Some wikis have a real low volume (from what I remember bi, gv and igk had fewer than 100 observations in a week) and where we might want to double check https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines and possibly start using thresholds rather than actual numbers.
Feb 13 2026
Feb 12 2026
Feb 11 2026
Feb 10 2026
Feb 9 2026
@RoyZuo thanks for raising this. Indeed the ranking strategy for commons is quite suboptimal and the fuzzy matches tend to hide interesting matches from other namespaces. This would require a change in how the suggestions coming from multiple namespaces are blended together.
While we ponder how to prioritize and solve this you can apply a workaround by selecting either:
- Redirect mode
- Classic prefix search
from your Search User preferences.
This workaround might help in some cases but not for some others (hopefully rarer) where an "exact" category match is still a widely used prefix.
Jan 26 2026
Currently the indexed search tags have a hard-coded value named exists but it could be something else, (if deemed useful at search time) tags related to image recommendations could be more granular by including the kind as separate tags.
At query time this would become (suggested syntax): hasrecommendation:image=istype-depicts.
- hasrecommendation:image>0.8 -hasrecommendation:image=istype-depicts could filter recommendations with a score above 0.8 and without a istype-depicts kind.
Jan 15 2026
Dumps are available as text files (raw ndjson) under /wmf/data/discovery/wikimedia_enterprise/structured_content_snapshots/snapshot=$YYYYMMDD/project=${WIKI}_namespace_0 and will be updated weekly on Sundays.
Jan 14 2026
Jan 13 2026
@gmodena the list of dags seem correct to me, there'll be some parts of drop_old_data_daily.py that might be moved over as well (cleanups of rdf data from import_ttl and query analytics).