Page MenuHomePhabricator
Feed Search

Wed, Dec 3

EBernhardson moved T408154: AB Test doubling near match field weights on commonswiki from Needs Review to Done on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Wed, Dec 3, 3:12 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Tue, Dec 2

EBernhardson closed T410886: GeoData wikivoyage queries return some results without coordinates as Declined.

I think thiemo has got it pinned down, this is "working as intended". It would be convenient if the mediawiki api had a more intuitive method to align limits across the api call, but that's not currently a thing.

Tue, Dec 2, 8:26 PM · Discovery-Search (2025.10.20 - 2025.12.31), GeoData
EBernhardson claimed T411347: New CirrusSearch dumps are not properly formatted.

The code fix itself ended up being pretty straight forward. We might use this opportunity to re-run the most recent dump, learn a bit more about how replacing an already published dump would work.

Tue, Dec 2, 8:19 PM · Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch

Mon, Dec 1

EBernhardson moved T409218: Elastica\Exception\Connection\HttpException: Unknown error:52 from Needs Review to To be Deployed on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Mon, Dec 1, 9:50 PM · MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), Discovery-Search (2025.10.20 - 2025.12.31), MediaWiki-extensions-Translate, Wikimedia-production-error
EBernhardson moved T408154: AB Test doubling near match field weights on commonswiki from In Progress to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.

It seems worthwhile to let the test continue running for a second week

@EBernhardson it's now two weeks later. Just in case this dropped off the radar by accident.

Mon, Dec 1, 8:01 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Mon, Nov 24

EBernhardson added a comment to T409898: Set up OpenSearch instance supporting vector search.

A temporary 3-node cluster has been stood up in T410681. This is running opensearch 3.3.2 and is accessible from the analytics network (stat machines, hadoop, etc.).

Mon, Nov 24, 5:09 PM · Essential-Work, Discovery-Search, Research, Data-Platform-SRE (2025.11.07 - 2025.11.28)

Fri, Nov 21

EBernhardson moved T410681: Setup opensearch 3 on relforge servers from Incoming to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.

The initial parts of this are complete. The three node cluster has been stood up, and it is accessible from the analytics networks. As a test instance we didn't setup a dns service or tls, but i suspect that is acceptable. The cluster can be accessed on relforge1008, relforge1009, and relforge1010 on port 9200.

Fri, Nov 21, 9:10 PM · Discovery-Search (2025.10.20 - 2025.12.31)
EBernhardson created P85447 relforge opensearch 3 knn test docker-compose.yml.
Fri, Nov 21, 9:09 PM

Thu, Nov 20

EBernhardson created P85427 validate articletopic terms.
Thu, Nov 20, 10:32 PM
EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.

Hmm, that does seem likely. If we add a &cirrusDumpQuery to one of the searches we can see it has timeout: 15s, when indeed regex should get a longer timeout. Not sure yet what changed to cause that.

Thu, Nov 20, 9:33 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.
Thu, Nov 20, 8:09 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch
EBernhardson created T410681: Setup opensearch 3 on relforge servers.
Thu, Nov 20, 6:02 PM · Discovery-Search (2025.10.20 - 2025.12.31)

Wed, Nov 19

EBernhardson added a comment to T408133: [Spike] Explore Generalizing Enrollment Authorities.

Regarding CirrusSearch: Do you already have particular open questions/obstacles related to the A/B test scenarios covered in the backend?

  • Session starting by following a link to blank Special:Search
  • Session starting by following a link to Special:Search with a query
  • Session starting at Special:Search with a 'go'

Thanks @pfischer. A couple of questions spring to mind:

  1. Is Scenario 1 treated the same as the other scenarios? In my mental model, a user searching or interacting with the search autocomplete starts a session. Is that correct?
Wed, Nov 19, 1:40 PM · Test Kitchen (Experiment Platform Sprint 16), MW-1.46-notes (1.46.0-wmf.4; 2025-11-25), OKR-Work

Tue, Nov 18

EBernhardson added a comment to T410440: Deepcat search stops loading additional results.

Shouldn't have anything to do with time between requests, Deep category search SPARQL query failed mostly means one of the backend services had an error and the frontend should retry. It looks like instead the frontend is treating an error as the end of results.

Tue, Nov 18, 7:32 PM · Discovery-Search, CirrusSearch, Commons

Mon, Nov 17

EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.

@EBernhardson apologies for bumping this, but do you think it might be worth me filing a follow-up feature request for this, given the similar usage/support in other regex flavours/libraries?

Mon, Nov 17, 7:31 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Fri, Nov 14

EBernhardson added a comment to T403212: Support \r, \n, \t, and \uNNNN in insource and intitle queries.

I noticed that surrogate pairs can't be used inside a character class. For example, both insource:/😂/ and insource:/[😂]/ work, but insource:/[\uD83D\uDE02]/ returns nothing. Using it in character classes can be handy when searching for a range of Unicode code points. Shall I file a new ticket for this?

Fri, Nov 14, 2:40 PM · User-notice-archive, Discovery-Search (2025.09.05 - 2025.09.26), CirrusSearch

Thu, Nov 13

EBernhardson added a comment to T408154: AB Test doubling near match field weights on commonswiki.

Test has been out for a week, ran the notebook but results are curious. In particular we are seeing a significant change in ZRR, even though the test treatment does not change the retrieval function. This suggests we could have some unbalanced effects in the bucketing. It seems worthwhile to let the test continue running for a second week and run the notebook against the second week to verify the results.

Thu, Nov 13, 8:34 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Wed, Nov 12

EBernhardson merged T317442: Sanity-check indices before promotion into T363521: Completion suggester can promote a bad build.
Wed, Nov 12, 8:16 PM · Essential-Work, Discovery-Search (2025.08.15 - 2025.09.05), MW-1.45-notes (1.45.0-wmf.15; 2025-08-19), Sustainability (Incident Followup), CirrusSearch
EBernhardson merged task T317442: Sanity-check indices before promotion into T363521: Completion suggester can promote a bad build.
Wed, Nov 12, 8:16 PM · Discovery-Search
EBernhardson added a comment to T409218: Elastica\Exception\Connection\HttpException: Unknown error:52.

Checking the last 2 weeks of logged Elastica HttpException there were 5 total exceptions, fairly low volume.

Wed, Nov 12, 7:43 PM · MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), Discovery-Search (2025.10.20 - 2025.12.31), MediaWiki-extensions-Translate, Wikimedia-production-error

Fri, Nov 7

EBernhardson updated the title for P85094 The CirrusSearch index dumps are moving from The cirrus index dumps are moving! to The CirrusSearch index dumps are moving.
Fri, Nov 7, 4:37 PM
EBernhardson edited P85094 The CirrusSearch index dumps are moving.
Fri, Nov 7, 4:36 PM
EBernhardson added a comment to T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

In the communication we went with promising dumps through november, shutting off sometime in december:

Fri, Nov 7, 4:36 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson edited P85094 The CirrusSearch index dumps are moving.
Fri, Nov 7, 4:27 PM
EBernhardson edited P85094 The CirrusSearch index dumps are moving.
Fri, Nov 7, 4:23 PM
EBernhardson edited P85094 The CirrusSearch index dumps are moving.
Fri, Nov 7, 4:18 PM
EBernhardson created P85094 The CirrusSearch index dumps are moving.
Fri, Nov 7, 2:49 PM

Thu, Nov 6

EBernhardson added a comment to T409218: Elastica\Exception\Connection\HttpException: Unknown error:52.

Curl error 52 is "Empty reply from server."

Thu, Nov 6, 7:02 PM · MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), Discovery-Search (2025.10.20 - 2025.12.31), MediaWiki-extensions-Translate, Wikimedia-production-error
EBernhardson added a comment to T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

@EBernhardson if you're happy with these new dumps, do you still want the "old" cirrussearch dumps to run on Airflow?

Thu, Nov 6, 7:00 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson added a comment to T409070: Latest CirrusSearch is incompatible with ES7.10 and the corresponding WMF extra plugin.

There is still a remaining problem with the query-time highlighter. The elasticsearch highlighter doesn't support the lucene_anchored flavor so we would need to always request lucene on elasticsearch. We are really trying to avoid extra round trips to the server to determine the version information though, still pondering appropriate solution. It might be the only reasonable way is to removed anchored trigram support from REL1_45

Thu, Nov 6, 5:38 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson moved T405466: Upgrade WebdriverIO to v9 in CirrusSearch from Needs Review to Done on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Thu, Nov 6, 5:28 PM · Essential-Work, MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch

Nov 4 2025

EBernhardson claimed T409070: Latest CirrusSearch is incompatible with ES7.10 and the corresponding WMF extra plugin.

Talked this over with @dcausse. We agreed we should continue to support Elasticsearch in REL1_45. We are adding a workaround for this bug with the regex support, and will add warnings that will be displayed when running scripts that manage search indexes whenever the indexes exist on an elasticsearch instance. The intention is to only support OpenSearch in REL1_46 and beyond.

Nov 4 2025, 10:04 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson moved T408678: `ForceSearchIndex` maintenance script falsly reports indexed pages when indexing jobs are skipped from Needs Review to Done on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Nov 4 2025, 10:02 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson added a comment to T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

With puppet deployed should expect to see these arrive at https://dumps.wikimedia.org/other/cirrus_search_index/ after 05:00 UTC tomorrow.

Nov 4 2025, 5:22 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T408909: The cirrus config dump API may produce unexpected json output from Needs Review to To be Deployed on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Nov 4 2025, 4:46 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson added a comment to T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script.

First run of the updated dag completed, dumps were formatted and moved to the exports path in hdfs. Reviewing the output it all looks reasonable and as expected. Next up is to enable the public sync via the puppet patch.

Nov 4 2025, 4:19 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from In Progress to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Nov 4 2025, 4:18 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch

Nov 3 2025

EBernhardson moved T405466: Upgrade WebdriverIO to v9 in CirrusSearch from In Progress to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.

tests are all green, i think this is ready for review.

Nov 3 2025, 7:45 PM · Essential-Work, MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson claimed T408909: The cirrus config dump API may produce unexpected json output.
Nov 3 2025, 7:45 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson claimed T408678: `ForceSearchIndex` maintenance script falsly reports indexed pages when indexing jobs are skipped.
Nov 3 2025, 7:14 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Blocked / Waiting to In Progress on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Nov 3 2025, 4:12 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson claimed T405466: Upgrade WebdriverIO to v9 in CirrusSearch.
Nov 3 2025, 4:11 PM · Essential-Work, MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson moved T405466: Upgrade WebdriverIO to v9 in CirrusSearch from Blocked / Waiting to In Progress on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Nov 3 2025, 4:11 PM · Essential-Work, MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch

Oct 28 2025

EBernhardson updated the task description for T408431: Reindex all wikis.
Oct 28 2025, 3:21 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson moved T408399: Truncate labels.*.near_match fields from Needs Review to To be Deployed on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 28 2025, 3:12 PM · Essential-Work, Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch

Oct 27 2025

EBernhardson added a comment to T408154: AB Test doubling near match field weights on commonswiki.

Expecting to run this test after T404858

Oct 27 2025, 7:25 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson moved T408399: Truncate labels.*.near_match fields from In Progress to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 27 2025, 7:14 PM · Essential-Work, Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Blocked / Waiting to Needs Review on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 27 2025, 7:11 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Needs Review to Blocked / Waiting on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 27 2025, 7:11 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T407432: Follow-up AB test of dym language model variants from Ready for Dev to Blocked / Waiting on the Discovery-Search (2025.10.20 - 2025.12.31) board.

This needs a reindex before it can go forward.

Oct 27 2025, 7:11 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), CirrusSearch
EBernhardson updated the task description for T408431: Reindex all wikis.
Oct 27 2025, 7:02 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson claimed T408399: Truncate labels.*.near_match fields.
Oct 27 2025, 6:19 PM · Essential-Work, Discovery-Search (2025.10.20 - 2025.12.31), CirrusSearch
EBernhardson updated the task description for T407520: Deploy various plugins to fix various things.
Oct 27 2025, 6:14 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson updated the task description for T408431: Reindex all wikis.
Oct 27 2025, 5:44 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson updated the task description for T408431: Reindex all wikis.
Oct 27 2025, 5:35 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson created T408431: Reindex all wikis.
Oct 27 2025, 5:34 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson moved T406205: Investigate and cleanup broken weighted_tags in cirrus indices from Needs Review to To be Deployed on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 27 2025, 4:15 PM · Discovery-Search (2025.10.20 - 2025.12.31), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Patch-For-Review, Essential-Work, CirrusSearch
EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Needs Review to Blocked / Waiting on the Discovery-Search (2025.10.20 - 2025.12.31) board.
Oct 27 2025, 4:14 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch

Oct 24 2025

EBernhardson moved T406205: Investigate and cleanup broken weighted_tags in cirrus indices from In Progress to Needs Review on the Discovery-Search (2025.09.26 - 2025.10.17) board.

Thanks! With the reproduction was pretty easy to work up a fix.

Oct 24 2025, 5:14 PM · Discovery-Search (2025.10.20 - 2025.12.31), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Patch-For-Review, Essential-Work, CirrusSearch
EBernhardson added a comment to T407520: Deploy various plugins to fix various things.

MR merged, CI built the new .deb and should be ready to deploy:

Oct 24 2025, 2:00 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Oct 23 2025

EBernhardson moved T366248: Source the CirrusSearch index dumps from hadoop instead of a MW maintenance script from Blocked / Waiting to Needs Review on the Discovery-Search (2025.09.26 - 2025.10.17) board.

It looks like DE are going to move forward with the existing sync mechanisms (T405360#11277591), we should probably do the same.

Oct 23 2025, 8:20 PM · MediaWiki-Page-derived-data, Discovery-Search (2025.10.20 - 2025.12.31), Data-Platform-SRE, Essential-Work, Patch-For-Review, DPE-Mediawiki-Content, Data-Engineering, CirrusSearch
EBernhardson moved T406020: Tool for testing different weightings in search results from Needs Review to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.

Yea that seems reasonable, then can mark this portion as complete.

Oct 23 2025, 8:10 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson created T408154: AB Test doubling near match field weights on commonswiki.
Oct 23 2025, 8:10 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson updated subscribers of T407520: Deploy various plugins to fix various things.

Yes, we should ship that along with the new normalizers @dcausse is implementing for T40403 all together. Not that they depend on each other, but mostly to save time of having to run the deploy multiple times.

Oct 23 2025, 7:06 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson added a comment to T405591: SUP: Enforce coherent flink version in transitive dependencies.

If i'm understanding correctly, this still needs:

  • Release wmf-maven-tool-configs (expecting 1.41)
  • Update wmf-jvm-parent-pom to use that newly released version
  • Release wmf-jvm-parent-pom (expecting 1.98)
  • That allows SUP (and other consumers of eventutilities) to update (in a separate ticket?) the parent pom and then configure the suggested enforce-flink-version enforcement.
Oct 23 2025, 6:16 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work
EBernhardson moved T404417: Test & upgrade the cirrus-streaming-updater with java 17 from To be Deployed to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.

I shipped a SUP update yesterday for T406205 which included this update to java 17. No surprises, everything rolled out cleanly on java 17.

Oct 23 2025, 5:43 PM · Discovery-Search (2025.09.26 - 2025.10.17), Essential-Work
EBernhardson removed a project from T405712: image_suggestions_weekly fails with event validation errors: Patch-For-Review.

AFAICT this is complete? It still has the tag for review, but that looks to be inaccurate. Checking the airflow instance i see that this is marked as having a successfull run last week (on the 17th, well after dag merge+deploy). Being bold and declaring this done.

Oct 23 2025, 5:40 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T406205: Investigate and cleanup broken weighted_tags in cirrus indices.

Deployed the additional logging yesterday. This morning I reviewed the WARN logs emitted by the jobmanager's in codfw since the restart and none of the new log messages are being emitted. Based on the total volume of incorrectly indexed pages it seems likely whatever is causing these issues has happened in the last 12 hours, meaning whatever is causing this it's probably not the parts of SUP where logging was added.

Oct 23 2025, 4:43 PM · Discovery-Search (2025.10.20 - 2025.12.31), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Patch-For-Review, Essential-Work, CirrusSearch
EBernhardson added a comment to T408121: Collect a set of representative queries for the benchmark dataset.

The discernatron dataset: https://people.wikimedia.org/~ebernhardson/esltr/discernatron.json

Oct 23 2025, 2:38 PM · Research (FY2025-26-Research-October-December)
EBernhardson added a comment to T407603: Identify a set of relevant query types.

It's worth keeping in mind that this is a two-stage search system. Navigational queries which dominate normal search pipelines are not seen at nearly the same rate in the on-wiki fulltext search because the first stage of search, the autocomplete, sends users directly to the page and typically satisfies the navigational needs.

Oct 23 2025, 2:36 PM · Research (FY2025-26-Research-October-December)

Oct 22 2025

EBernhardson added a comment to T406020: Tool for testing different weightings in search results.

Ran relforge reports for adjusting the near_match_weight on commonswiki, along with mediawikiwiki to see if this has different effects in different places. I can't share the full reports as they contain user search queries, but the top level stats are shareable. I'm only including commonswiki here as it was the only interesting one, on mediawiki.org increasing the near_match weights had almost no effect, suggesting this fix is specific to how commonswiki is organized.

Oct 22 2025, 6:57 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Oct 20 2025

EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

Related question regarding flow of data, based on the comment from the thread you linked.

The wikitext -> html happens inside the mediawiki application using the default mediawiki parser. I'm not sure what exactly happens under the hood, i expect it's a full php parser that runs in-process but i haven't paid enough attention to exactly what they do. This is indeed quite expensive, we are running hundreds of pages a second through the parser. Part of the reason i suggest we could do this is because we already parse this flow of data. Even at this high rate, it still takes a long time to get through everything. We have a loop that re-renders everything even if not edited, but it works on 16 week cycles.

A html dataset (T360794) is a request for data engineering with a number of use cases, and has been discussed in related phab tasks for years. The linked phab task is for an incremental html dataset, which is "the easier" part of a html dataset and will hopefully get prioritized soon. I have focused on that part to get something off the ground. The more challenging part is creating the html dataset of historical revisions (e.g. render with what mediawiki version, what to do with templates, etc..).

  • do I understand this right that the full re-render loop taking 16 weeks is for the "current" content of all pages (i.e. not historical)? That is indeed a long time.
Oct 20 2025, 4:55 PM · Epic, Discovery-Search
EBernhardson claimed T407514: Ignore MacOS .DS_Store in parent pom.
Oct 20 2025, 3:27 PM · Discovery-Search (2025.10.20 - 2025.12.31), Data-Engineering (Q2 FY25/26 October 1st - December 31th), Java-Scala-Standardization, Essential-Work
EBernhardson added a comment to T407514: Ignore MacOS .DS_Store in parent pom.

MR: https://gitlab.wikimedia.org/repos/maven/wmf-jvm-parent-pom/-/merge_requests/27

Oct 20 2025, 3:27 PM · Discovery-Search (2025.10.20 - 2025.12.31), Data-Engineering (Q2 FY25/26 October 1st - December 31th), Java-Scala-Standardization, Essential-Work
EBernhardson updated subscribers of T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Oct 20 2025, 3:20 PM · Epic, Discovery-Search
EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

related slack discussion: https://wikimedia.slack.com/archives/C0975D4NLQY/p1759903593949559

Oct 20 2025, 3:17 PM · Epic, Discovery-Search

Oct 17 2025

EBernhardson claimed T406205: Investigate and cleanup broken weighted_tags in cirrus indices.
Oct 17 2025, 4:06 PM · Discovery-Search (2025.10.20 - 2025.12.31), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Patch-For-Review, Essential-Work, CirrusSearch

Oct 16 2025

EBernhardson moved T407520: Deploy various plugins to fix various things from Incoming to Blocked / Waiting on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 16 2025, 7:40 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson edited projects for T407520: Deploy various plugins to fix various things, added: Discovery-Search (2025.09.26 - 2025.10.17); removed Discovery-Search.
Oct 16 2025, 7:39 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson updated the task description for T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Oct 16 2025, 7:31 PM · Epic, Discovery-Search
EBernhardson updated the task description for T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Oct 16 2025, 7:31 PM · Epic, Discovery-Search
EBernhardson edited projects for T407520: Deploy various plugins to fix various things, added: Data-Platform-SRE; removed Patch-For-Review.

Updated .deb is available from gitlab. This should be ready for hand-off to SRE to upload the deb to apt.wikimedia.org and restart the clusters. Once the .deb is available from apt.wikimedia.org we will also need to:

Oct 16 2025, 6:49 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson added a comment to T407520: Deploy various plugins to fix various things.

built a new release and deployed to maven central, as 1.3.20-wmf4. For example: https://central.sonatype.com/artifact/org.wikimedia.search.highlighter/cirrus-highlighter-core

Oct 16 2025, 5:28 PM · Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson added a comment to T407521: Represent text in cirrus as an array of sections, rather than a flat string.

An initial, simple, proposal would be to split the text field on section boundaries, and retain the section title as a header. This would mean having duplicates of the headings (in both the headings and text fields), increasing the importance of the heading content, but probably not a big deal.

Oct 16 2025, 5:19 PM · Epic, Discovery-Search
EBernhardson created T407521: Represent text in cirrus as an array of sections, rather than a flat string.
Oct 16 2025, 5:05 PM · Epic, Discovery-Search
EBernhardson moved T405059: Adapt hasrecommendation to filter by score and possibly rank by score from To be Deployed to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 16 2025, 3:15 PM · Growth-Team, Revise-Tone-Structured-Task, Essential-Work, MW-1.45-notes (1.45.0-wmf.22; 2025-10-07), Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T406920: deepcategory search fails to show all expected results.

it might be convenient if we had some tool that could walk the category graph on wiki, then query the same thing out of blazegraph and compare them. Some quick way to identify where the issues might be.

Oct 16 2025, 2:59 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, Data-Platform-SRE (2025.10.17 - 2025.11.07), CirrusSearch, Commons

Oct 15 2025

EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

From our review of the initial reports there is also a bit of surprise around the opening_text language model performing worse than the default language model. One plausible explanation is that there are word patterns seen in queries but not the opening text, only in the title fields. As such it would be interesting to run a follow-up test comparing title+redirect.title vs title+redirect.title+opening_text. For that I've created T407432.

Oct 15 2025, 8:54 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson created T407432: Follow-up AB test of dym language model variants.
Oct 15 2025, 8:52 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, MW-1.45-notes (1.45.0-wmf.24; 2025-10-21), CirrusSearch
EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

We reviewed the report in the wed meeting, where a report against major spaceless languages was requested. A quick runthrough of a report restricted to zhwiki and jawiki finds that the default_1v profile is significantly better than the others, suggesting that the variant does potentially have benefits, but it may depend on which language. As such I've run a batch of reports against the top few wikis by size, and a few selected languages that have unique language features:

Oct 15 2025, 8:49 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch
EBernhardson added a comment to T404647: AB test did-you-mean query suggester variations.

I mistakenly posted this to the parent ticket, but it belongs here:

Oct 15 2025, 6:39 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Oct 14 2025

EBernhardson added a comment to T390858: Improve CirrusSearch DYM suggestions using the phrase suggester on more content.

Preliminary reports. They might become final, but they haven't been reviewed by anyone else yet:

Oct 14 2025, 9:46 PM · MW-1.45-notes (1.45.0-wmf.19; 2025-09-16), Epic, Discovery-Search, CirrusSearch
EBernhardson moved T376026: Update event-producing tools to overwrite `meta.dt` from Needs Review to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 14 2025, 5:06 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), Data-Engineering (Q1 FY25/26 July 1st - September 30th), Event-Platform
EBernhardson moved T397367: Drop unneeded empty tables from wikis from Blocked / Waiting to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 14 2025, 5:06 PM · Discovery-Search (2025.09.26 - 2025.10.17), MW-1.45-notes (1.45.0-wmf.20; 2025-09-23), DBA
EBernhardson added a comment to T405867: MLR: Mine and use negative samples.

We experiemented with this, and a model is available in production (example query), but the results just aren't good enough. Calling this complete without implementing it into mjolnir.

Oct 14 2025, 5:05 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson moved T405867: MLR: Mine and use negative samples from Blocked / Waiting to Done on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 14 2025, 5:04 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Oct 10 2025

EBernhardson claimed T406020: Tool for testing different weightings in search results.
Oct 10 2025, 7:07 PM · Essential-Work, Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch

Oct 6 2025

EBernhardson moved T405867: MLR: Mine and use negative samples from In Progress to Blocked / Waiting on the Discovery-Search (2025.09.26 - 2025.10.17) board.
Oct 6 2025, 5:49 PM · Discovery-Search (2025.09.26 - 2025.10.17), CirrusSearch
EBernhardson added a comment to T40403: Sortable search results.

Copying comment from merged task:

Oct 6 2025, 4:32 PM · MW-1.46-notes (1.46.0-wmf.2; 2025-11-12), Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch

Oct 3 2025

EBernhardson claimed T404647: AB test did-you-mean query suggester variations.
Oct 3 2025, 4:47 PM · Discovery-Search (2025.10.20 - 2025.12.31), Essential-Work, CirrusSearch