I think thiemo has got it pinned down, this is "working as intended". It would be convenient if the mediawiki api had a more intuitive method to align limits across the api call, but that's not currently a thing.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Wed, Dec 3
Tue, Dec 2
The code fix itself ended up being pretty straight forward. We might use this opportunity to re-run the most recent dump, learn a bit more about how replacing an already published dump would work.
Mon, Dec 1
In T408154#11408732, @TheDJ wrote:It seems worthwhile to let the test continue running for a second week
@EBernhardson it's now two weeks later. Just in case this dropped off the radar by accident.
Mon, Nov 24
A temporary 3-node cluster has been stood up in T410681. This is running opensearch 3.3.2 and is accessible from the analytics network (stat machines, hadoop, etc.).
Fri, Nov 21
The initial parts of this are complete. The three node cluster has been stood up, and it is accessible from the analytics networks. As a test instance we didn't setup a dns service or tls, but i suspect that is acceptable. The cluster can be accessed on relforge1008, relforge1009, and relforge1010 on port 9200.
Thu, Nov 20
Hmm, that does seem likely. If we add a &cirrusDumpQuery to one of the searches we can see it has timeout: 15s, when indeed regex should get a longer timeout. Not sure yet what changed to cause that.
In T403212#11393825, @doctaxon wrote:@EBernhardson This seems to throw timeout reports, see this dewiki discussion: https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Zeit%C3%BCberschreitung_bei_Insource-Suche
Wed, Nov 19
In T408133#11370867, @phuedx wrote:In T408133#11366050, @pfischer wrote:Regarding CirrusSearch: Do you already have particular open questions/obstacles related to the A/B test scenarios covered in the backend?
- Session starting by following a link to blank Special:Search
- Session starting by following a link to Special:Search with a query
- Session starting at Special:Search with a 'go'
Thanks @pfischer. A couple of questions spring to mind:
- Is Scenario 1 treated the same as the other scenarios? In my mental model, a user searching or interacting with the search autocomplete starts a session. Is that correct?
Tue, Nov 18
Shouldn't have anything to do with time between requests, Deep category search SPARQL query failed mostly means one of the backend services had an error and the frontend should retry. It looks like instead the frontend is treating an error as the end of results.
Mon, Nov 17
In T403212#11377377, @A_smart_kitten wrote:@EBernhardson apologies for bumping this, but do you think it might be worth me filing a follow-up feature request for this, given the similar usage/support in other regex flavours/libraries?
Fri, Nov 14
In T403212#11372531, @Bewfip wrote:I noticed that surrogate pairs can't be used inside a character class. For example, both insource:/😂/ and insource:/[😂]/ work, but insource:/[\uD83D\uDE02]/ returns nothing. Using it in character classes can be handy when searching for a range of Unicode code points. Shall I file a new ticket for this?
Thu, Nov 13
Test has been out for a week, ran the notebook but results are curious. In particular we are seeing a significant change in ZRR, even though the test treatment does not change the retrieval function. This suggests we could have some unbalanced effects in the bucketing. It seems worthwhile to let the test continue running for a second week and run the notebook against the second week to verify the results.
Wed, Nov 12
Checking the last 2 weeks of logged Elastica HttpException there were 5 total exceptions, fairly low volume.
Fri, Nov 7
In the communication we went with promising dumps through november, shutting off sometime in december:
Thu, Nov 6
Curl error 52 is "Empty reply from server."
In T366248#11343661, @brouberol wrote:@EBernhardson if you're happy with these new dumps, do you still want the "old" cirrussearch dumps to run on Airflow?
There is still a remaining problem with the query-time highlighter. The elasticsearch highlighter doesn't support the lucene_anchored flavor so we would need to always request lucene on elasticsearch. We are really trying to avoid extra round trips to the server to determine the version information though, still pondering appropriate solution. It might be the only reasonable way is to removed anchored trigram support from REL1_45
Nov 4 2025
Talked this over with @dcausse. We agreed we should continue to support Elasticsearch in REL1_45. We are adding a workaround for this bug with the regex support, and will add warnings that will be displayed when running scripts that manage search indexes whenever the indexes exist on an elasticsearch instance. The intention is to only support OpenSearch in REL1_46 and beyond.
With puppet deployed should expect to see these arrive at https://dumps.wikimedia.org/other/cirrus_search_index/ after 05:00 UTC tomorrow.
First run of the updated dag completed, dumps were formatted and moved to the exports path in hdfs. Reviewing the output it all looks reasonable and as expected. Next up is to enable the public sync via the puppet patch.
Nov 3 2025
tests are all green, i think this is ready for review.
Oct 28 2025
Oct 27 2025
Expecting to run this test after T404858
This needs a reindex before it can go forward.
Oct 24 2025
Thanks! With the reproduction was pretty easy to work up a fix.
MR merged, CI built the new .deb and should be ready to deploy:
Oct 23 2025
It looks like DE are going to move forward with the existing sync mechanisms (T405360#11277591), we should probably do the same.
Yea that seems reasonable, then can mark this portion as complete.
If i'm understanding correctly, this still needs:
- Release wmf-maven-tool-configs (expecting 1.41)
- Update wmf-jvm-parent-pom to use that newly released version
- Release wmf-jvm-parent-pom (expecting 1.98)
- That allows SUP (and other consumers of eventutilities) to update (in a separate ticket?) the parent pom and then configure the suggested enforce-flink-version enforcement.
I shipped a SUP update yesterday for T406205 which included this update to java 17. No surprises, everything rolled out cleanly on java 17.
AFAICT this is complete? It still has the tag for review, but that looks to be inaccurate. Checking the airflow instance i see that this is marked as having a successfull run last week (on the 17th, well after dag merge+deploy). Being bold and declaring this done.
Deployed the additional logging yesterday. This morning I reviewed the WARN logs emitted by the jobmanager's in codfw since the restart and none of the new log messages are being emitted. Based on the total volume of incorrectly indexed pages it seems likely whatever is causing these issues has happened in the last 12 hours, meaning whatever is causing this it's probably not the parts of SUP where logging was added.
The discernatron dataset: https://people.wikimedia.org/~ebernhardson/esltr/discernatron.json
It's worth keeping in mind that this is a two-stage search system. Navigational queries which dominate normal search pipelines are not seen at nearly the same rate in the on-wiki fulltext search because the first stage of search, the autocomplete, sends users directly to the page and typically satisfies the navigational needs.
Oct 22 2025
Ran relforge reports for adjusting the near_match_weight on commonswiki, along with mediawikiwiki to see if this has different effects in different places. I can't share the full reports as they contain user search queries, but the top level stats are shareable. I'm only including commonswiki here as it was the only interesting one, on mediawiki.org increasing the near_match weights had almost no effect, suggesting this fix is specific to how commonswiki is organized.
Oct 20 2025
In T407521#11290416, @fkaelin wrote:Related question regarding flow of data, based on the comment from the thread you linked.
The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.
A html dataset (T360794) is a request for data engineering with a number of use cases, and has been discussed in related phab tasks for years. The linked phab task is for an incremental html dataset, which is "the easier" part of a html dataset and will hopefully get prioritized soon. I have focused on that part to get something off the ground. The more challenging part is creating the html dataset of historical revisions (e.g. render with what mediawiki version, what to do with templates, etc..).
- do I understand this right that the full re-render loop taking 16 weeks is for the "current" content of all pages (i.e. not historical)? That is indeed a long time.
related slack discussion: https://wikimedia.slack.com/archives/C0975D4NLQY/p1759903593949559
Oct 17 2025
Oct 16 2025
Updated .deb is available from gitlab. This should be ready for hand-off to SRE to upload the deb to apt.wikimedia.org and restart the clusters. Once the .deb is available from apt.wikimedia.org we will also need to:
built a new release and deployed to maven central, as 1.3.20-wmf4. For example: https://central.sonatype.com/artifact/org.wikimedia.search.highlighter/cirrus-highlighter-core
An initial, simple, proposal would be to split the text field on section boundaries, and retain the section title as a header. This would mean having duplicates of the headings (in both the headings and text fields), increasing the importance of the heading content, but probably not a big deal.
it might be convenient if we had some tool that could walk the category graph on wiki, then query the same thing out of blazegraph and compare them. Some quick way to identify where the issues might be.
Oct 15 2025
From our review of the initial reports there is also a bit of surprise around the opening_text language model performing worse than the default language model. One plausible explanation is that there are word patterns seen in queries but not the opening text, only in the title fields. As such it would be interesting to run a follow-up test comparing title+redirect.title vs title+redirect.title+opening_text. For that I've created T407432.
We reviewed the report in the wed meeting, where a report against major spaceless languages was requested. A quick runthrough of a report restricted to zhwiki and jawiki finds that the default_1v profile is significantly better than the others, suggesting that the variant does potentially have benefits, but it may depend on which language. As such I've run a batch of reports against the top few wikis by size, and a few selected languages that have unique language features:
I mistakenly posted this to the parent ticket, but it belongs here:
Oct 14 2025
Preliminary reports. They might become final, but they haven't been reviewed by anyone else yet:
We experiemented with this, and a model is available in production (example query), but the results just aren't good enough. Calling this complete without implementing it into mjolnir.
Oct 10 2025
Oct 6 2025
Copying comment from merged task: