User Details
- User Since
- Jun 9 2015, 9:03 AM (554 w, 19 h)
- Availability
- Busy Busy until Jan 23.
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Thu, Jan 15
Dumps are available as text files (raw ndjson) under /wmf/data/discovery/wikimedia_enterprise/structured_content_snapshots/snapshot=$YYYYMMDD/project=${WIKI}_namespace_0 and will be updated weekly on Sundays.
Wed, Jan 14
Tue, Jan 13
@gmodena the list of dags seem correct to me, there'll be some parts of drop_old_data_daily.py that might be moved over as well (cleanups of rdf data from import_ttl and query analytics).
Mon, Jan 12
Thu, Jan 8
@OKarakaya-WMF awesome thanks, p50 at 34ms is nice thanks! If I'm reading the numbers right it does seem like the prompt is not adding much overhead.
yes the prompt will be sent on every requests for now, if deemed necessary we could think about some named prompt templates but it's probably too early to think about that at this point.
Tue, Jan 6
Support for the following sorting options:
- incoming_links
- last_edit
- create_timestamp
- title_natural (just recently)
We just deployed the required backend support to enable such sort options (title_natural_asc & title_natural_desc), re-adding the Advanced-Search tag.
Thanks for reporting this, I believe there are indeed a couple issues here:
- the doc is clearly wrong when stating A wildcard can also be used when specifying qualifiers - to find all items that depict a cat of any color use haswbstatement:P180=Q146|P462=*, the use of | is used for something else.
- the syntax haswbstatement:P180=Q146[P462=*] would indeed the one that makes the most sense but it does inhibit the prefix search because it expects * to be the last character
- the syntax haswbstatement:P180=Q146[P462=* could possible work but we lower-case these terms internally and the use of the prefix query does not goes though the term analyzer (we'd have to use a normalizer here)
- finally the syntax haswbstatement:p180=q146[p462=* (lowercasing manually) seems to work but it's definitely a poor workaround
Mon, Jan 5
Tue, Dec 23
I don't think so, I believe that opensearch will simply ignore that part (I don't see anything in the codebase that suggests otherwise but I haven't tested to confirm), please feel free to ignore this requirement and we'll do some testing to confirm. thanks! :)
For the following input:
{"input": [ "text1", "text2" ]}
@kevinbazira thanks, yes this seems like a format that opensearch would be able to work with (P86755 is what is working at the moment, we don't pass the model attribute because we currently use llama.cpp that does not support multi-model serving but we can pass it if required).
Mon, Dec 22
@kevinbazira @OKarakaya-WMF thanks! is there a way to call this API in a way that is compatible with the openAI embedding format?
Regarding query size, qwen3 suggests a prompt that looks like this:
Instruct: Given a web search query, retrieve relevant passages that answer the query Query:$user_query_here
Is there a chance that this prompt gets cached after multiple requests?
Dec 15 2025
Dec 5 2025
Dec 2 2025
forgot to mention that the reindex was started yesterday on the two other clusters (eqiad, cloudelastic)
Dec 1 2025
A/B test results on other wikis: https://people.wikimedia.org/~dcausse/T404858-completion-default-sort-2.html
Going to start the reindex today
Nov 27 2025
went with the approach of enabling on the five georgian wikis at once, please let me know if a more conservative approach (one wiki first) is preferable.
The update process has been fixed.
Existing stale data in the search index will get fixed when:
- a new revision of the page is created
- a template change propagates
- when the continuous cleanup mechanism processes a page with stale data
Nov 26 2025
CirrusSearch has to be careful when specifying timeouts of a regex query.
Regex queries are particularly costly and may cause a lot of stress on the servers if not properly protected.
The 15s timeouts has been setup for this, to ensure that the search backend return before any other timeouts are applied otherwise this might mean that a costly query will continue to run outside of the concurrency protection (T152895).
Unless you noticed that that the regex got a lot slower recently and that more queries are timing out I think it is safer to keep the 15s internal timeout.
Nov 24 2025
Nov 20 2025
Thanks for reporting this, I think there are two different issues that allowed such suggestions to appear:
- defaultsort is indeed not properly removed from the search index when it's erased, a null value unfortunately tells the system to ignore it when updating it, this needs to be fixed for this field
- defaultsort values are allowed to help completion only if they match a particular pattern, this pattern seems too permissive and should be corrected to limit the possibility of such vandalism to impact search suggestions in the future
Nov 19 2025
Should be ready once 1.46.0-wmf.3 is deployed, earliest would be Thursday nov 20 but probably safer to wait til the following week in case we rollback.
Nov 18 2025
it is a bit cumbersome to run unfortunately and some adaptations have to be made (we only used it to backfill article countries). The script is in stat1009.eqiad.wmnet:~dcausse/articlecountry:
- backfill_articlecountry.scala the spark job that reads hdfs://analytics-hadoop/user/dcausse/topic_model/wiki-region-groundtruth/regions-cirrus-upload.tsv.gz and convert it to classification.prediction.articlecountry weighted tags, this one would have to be adapted based on your source data
- wiki.lst: the list of wikis to filter on
- backfill.sh the shell script that orchestrates all this
Nov 17 2025
Indeed, the new debian package wmf-opensearch-search-plugins version 1.3.20+12 has to be installed to run the lastest cirrus version. We generally maintain the cirrussearch-opensearch-image docker image that is used by MW developers and our cirrus integration test suite, but here I think that you install opensearch on the existing quibble image and thus refreshing this image with the new version of the plugin is indeed what should be needed.
This should be fixed, I can see the partial search response instead of the error.
@TJones the change should be live on hewiki and ruwiki, could you draft a message for the tech news possibly by adding some text to https://meta.wikimedia.org/wiki/Tech/News/2025/48?
We received notifications from users that the search API which is configured to allow 50s timeouts to support costly search requests is now failing at 15s with an upstream request timeout (T410007). The user reported that the behavior started to change around nov 11th which is apparently when we started to roll out this new route on group2 wikis. I'm not 100% sure that this change is the cause of this new behavior but IIUC on all wikis except enwiki we now route api.php requests to the rest-gateway. If I'm not mistaken the rest-gateway has a default timeout of 15s which might explain this new behavior? Are there ways to vary this timeout based on the target action API?
Indeed, the internal timeout should be 50s to allow the regex to run. It is possible that something changed in the request flow that there is now a component failing earlier than the allowed 50s.
Nov 14 2025
If pushing to kafka-main you might need to increase broker's message.max.bytes see T344688.
Nov 13 2025
Nov 7 2025
Nov 6 2025
I think this is now fixed, the behavior of items and lexemes should be the same.
The API response looks like this now (on L7 when searching for L7):
{