User Details
- User Since
- Oct 7 2014, 4:49 PM (592 w, 6 d)
- Availability
- Available
- LDAP User
- EBernhardson
- MediaWiki User
- EBernhardson (WMF) [ Global Accounts ]
Thu, Feb 12
Tue, Feb 10
I think the existing lead section should be short enough, something like:
Mon, Feb 2
Not entirely sure, but this might be reasonable to include in tech news.
Fri, Jan 30
Mon, Jan 26
Fri, Jan 23
Thu, Jan 22
This is now running again. It has completed the generation stage and is currently creating fresh glent indices in opensearch. At a high level this looks like, but i don't have old enough data to verify 100%, it was caused by a change in meta.dt timestamps to now include millisecond precision, likely as part of T376026.
Wed, Jan 21
Tue, Jan 20
Example query from description now works as expected
Jan 15 2026
Jan 14 2026
This will probably be a couple parts:
Jan 13 2026
Documentation has been placed: https://www.mediawiki.org/wiki/Help:CirrusSearch/Debug
The latency buckets might be a little harder to upstream, but still possible. The latency buckets require special data collection implemented in our opensearch-extra plugin. Totally possible, but means upstreaming to both the main opensearch project and then the exporters once the data is available. It might require a slight re-architecture of the latency collection depending on what upstream thinks of the current implementation. We did propose this to ElasticSearch back in the day and they gave advise for implementing the latency bucket collectoin, but weren't interested in upstreaming.
Jan 12 2026
This looks to now support ElasticSearch in REL1_45
Turns out the problem with the content was a cache on my end, issuing a force-refresh loaded the new content. The link is now pointing to the new content.
Jan 8 2026
Workin on the documentation at P86859. This is still very preliminary, but once it's all pinned down better it will end up somewhere on mediawiki.org
Jan 7 2026
Deprecation doc has been placed, this should be complete.
Patches shipped, 20260104 dump was rerun and looks reasonable. I imported the simplewiki dump into a local instance and it loaded without issues.
I went through today to verify if everything is ready to go:
Jan 6 2026
Something can be done to improve some of the use cases around redirects, but we would need to narrow in on things that can be done. The fundamental limitation here is that in the search data model redirects are not their own pages. The only metadata about redirects stored in the search index is the namespace and title, and that is attached to the page that is redirected to. Additionally search results are always at the granularity of the indexed documents. This means if two redirects to the same page match that can not be represented in the output as two matches. It will always be a match against the document that was redirected to, with a scoring bump for matching twice.
Jan 5 2026
After pondering this one, we think it's best to leave the timestamps as-is. The only "correct" timestamp is the publicly facing dump, changing and re-aligning all the internal timestamps is a good bit of work that seems helpful but ultimately unnecessary.
Dec 5 2025
Dec 3 2025
Dec 2 2025
I think thiemo has got it pinned down, this is "working as intended". It would be convenient if the mediawiki api had a more intuitive method to align limits across the api call, but that's not currently a thing.
The code fix itself ended up being pretty straight forward. We might use this opportunity to re-run the most recent dump, learn a bit more about how replacing an already published dump would work.
Dec 1 2025
Nov 24 2025
A temporary 3-node cluster has been stood up in T410681. This is running opensearch 3.3.2 and is accessible from the analytics network (stat machines, hadoop, etc.).
Nov 21 2025
The initial parts of this are complete. The three node cluster has been stood up, and it is accessible from the analytics networks. As a test instance we didn't setup a dns service or tls, but i suspect that is acceptable. The cluster can be accessed on relforge1008, relforge1009, and relforge1010 on port 9200.
Nov 20 2025
Hmm, that does seem likely. If we add a &cirrusDumpQuery to one of the searches we can see it has timeout: 15s, when indeed regex should get a longer timeout. Not sure yet what changed to cause that.
Nov 19 2025
Nov 18 2025
Shouldn't have anything to do with time between requests, Deep category search SPARQL query failed mostly means one of the backend services had an error and the frontend should retry. It looks like instead the frontend is treating an error as the end of results.
Nov 17 2025
Nov 14 2025
Nov 13 2025
Test has been out for a week, ran the notebook but results are curious. In particular we are seeing a significant change in ZRR, even though the test treatment does not change the retrieval function. This suggests we could have some unbalanced effects in the bucketing. It seems worthwhile to let the test continue running for a second week and run the notebook against the second week to verify the results.
Nov 12 2025
Checking the last 2 weeks of logged Elastica HttpException there were 5 total exceptions, fairly low volume.
Nov 7 2025
In the communication we went with promising dumps through november, shutting off sometime in december:
Nov 6 2025
Curl error 52 is "Empty reply from server."
There is still a remaining problem with the query-time highlighter. The elasticsearch highlighter doesn't support the lucene_anchored flavor so we would need to always request lucene on elasticsearch. We are really trying to avoid extra query-time round trips to the server to determine the version information though, still pondering appropriate solution. It might be the only reasonable way is to remove anchored trigram support from REL1_45
Nov 4 2025
Talked this over with @dcausse. We agreed we should continue to support Elasticsearch in REL1_45. We are adding a workaround for this bug with the regex support, and will add warnings that will be displayed when running scripts that manage search indexes whenever the indexes exist on an elasticsearch instance. The intention is to only support OpenSearch in REL1_46 and beyond.
With puppet deployed should expect to see these arrive at https://dumps.wikimedia.org/other/cirrus_search_index/ after 05:00 UTC tomorrow.
First run of the updated dag completed, dumps were formatted and moved to the exports path in hdfs. Reviewing the output it all looks reasonable and as expected. Next up is to enable the public sync via the puppet patch.
