Page MenuHomePhabricator
Search Global Search
Use the application-specific Advanced Search for better results and additional search criteria: Tasks, Commits. (More information)
    • Task
    ``` 23:56:58 1) Wikibase\Lexeme\Search\Elastic\Tests\LexemeFullTextQueryBuilderTest::testSearchElastic with data set "work" ('duck', '/workspace/src/extensions/Wik...pected') 23:56:58 Failed asserting that two strings are equal. 23:56:58 --- Expected 23:56:58 +++ Actual 23:56:58 @@ @@ 23:56:58 "minimum_should_match": 1,\n 23:56:58 "filter": [\n 23:56:58 {\n 23:56:58 - "terms": {\n 23:56:58 - "namespace": [\n 23:56:58 - 146\n 23:56:58 + "bool": {\n 23:56:58 + "must": [\n 23:56:58 + {\n 23:56:58 + "terms": {\n 23:56:58 + "namespace": [\n 23:56:58 + 146\n 23:56:58 + ]\n 23:56:58 + }\n 23:56:58 + }\n 23:56:58 + ],\n 23:56:58 + "must_not": [\n 23:56:58 + {\n 23:56:58 + "term": {\n 23:56:58 + "page_type": "redirect"\n 23:56:58 + }\n 23:56:58 + }\n 23:56:58 ]\n 23:56:58 }\n 23:56:58 }\n 23:56:58 23:56:58 /workspace/src/tests/phpunit/MediaWikiTestCaseTrait.php:227 23:56:58 /workspace/src/extensions/WikibaseLexemeCirrusSearch/tests/phpunit/LexemeFullTextQueryBuilderTest.php:70 23:56:58 23:56:58 2) Wikibase\Lexeme\Search\Elastic\Tests\LexemeFullTextQueryBuilderTest::testSearchElastic with data set "id" (' L2-F1 ', '/workspace/src/extensions/Wik...pected') 23:56:58 Failed asserting that two strings are equal. 23:56:58 --- Expected 23:56:58 +++ Actual 23:56:58 @@ @@ 23:56:58 "minimum_should_match": 1,\n 23:56:58 "filter": [\n 23:56:58 {\n 23:56:58 - "terms": {\n 23:56:58 - "namespace": [\n 23:56:58 - 146\n 23:56:58 + "bool": {\n 23:56:58 + "must": [\n 23:56:58 + {\n 23:56:58 + "terms": {\n 23:56:58 + "namespace": [\n 23:56:58 + 146\n 23:56:58 + ]\n 23:56:58 + }\n 23:56:58 + }\n 23:56:58 + ],\n 23:56:58 + "must_not": [\n 23:56:58 + {\n 23:56:58 + "term": {\n 23:56:58 + "page_type": "redirect"\n 23:56:58 + }\n 23:56:58 + }\n 23:56:58 ]\n 23:56:58 }\n 23:56:58 }\n 23:56:58 23:56:58 /workspace/src/tests/phpunit/MediaWikiTestCaseTrait.php:227 23:56:58 /workspace/src/extensions/WikibaseLexemeCirrusSearch/tests/phpunit/LexemeFullTextQueryBuilderTest.php:70 ```
    • Task
    On `2026-11-06T00:40:00` the cirrus producer job failed with: ``` java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: FAILED at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$7(ApplicationDispatcherBootstrap.java:403) [...] at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) Caused by: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: FAILED at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71) ... 51 more Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60) ... 51 more Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=120000,backoffTimeMS=60000,maxFailuresPerInterval=10) at org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219) [...] at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) [...] at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:233) at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:245) ... 5 more Caused by: org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'Source: mediawiki.page_change.v1-source -> mediawiki.page_change.v1-source-convert-filter_by_wiki-validate -> mediawiki.page_change.v1-source-assign-ingestion-time' (operator 91c710d8d475a802c60b103f3b504f2d). at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:651) [...] at java.base/java.lang.Thread.run(Thread.java:840) Caused by: org.apache.kafka.common.KafkaException: Failed to create new KafkaAdminClient at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:546) [...] at org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$runInEventLoop$10(SourceCoordinator.java:530) ... 7 more Caused by: org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:89) at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:48) at org.apache.kafka.clients.admin.KafkaAdminClient.createInternal(KafkaAdminClient.java:498) ... 14 more ``` The root cause seems to be related to flink not finding any `bootstrap.servers`. Which could have been caused by some maintenance on the kafka cluster. Having the job fails because of transient issues on the kafka cluster is expected but here the job gave up which is not what we want. My understanding is that the job have two layers of restart strategy: * flink itself with its restart strategy which is currently set to: ** restart-strategy.type: failure-rate ** restart-strategy.failure-rate.delay: "1" ** restart-strategy.failure-rate.failure-rate-interval: "2" ** restart-strategy.failure-rate.max-failures-per-interval: "10" * the operator may attempt to restart failed deployments as well AC: * investigate restart strategies and understand why the flink deployment entered a failed deployment
    • Task
    While doing {T428104}... ``` User CirrusSearch Streaming Updater does not have two-factor authentication enabled, so notification has been sent! ``` ```lang=php if ( $this->getUser()->getName() === $engine->getConfig()->get( "CirrusSearchStreamingUpdaterUsername" ) ) { // Bypass poolcounter protection for the internal cirrus user $this->doExecute( $engine ); } ```
    • Task
    Currently cirrus integration tests use `BeforeOnce` hooks and an non-trivial tag tracker to push data to mediawiki. The reasons behind this is that the tests should be isolated with a specific set of documents but the reality is that we have a test corpus and the dependency between the required docs and the tests themselves is likely lost. Isolation might have been a concern at some point but since we do not reset the MW state on every tests isolation is de-facto lost. Given that that populating the data is taking most of the test time we might consider revisiting how the data is populated and introduce a specific phase during the environment building to push the data quicker and remove the tag dependency tracking system. For comparison a full run using an empty state takes 13m, it goes down to 4m (2m15 with concurrency=2) when the corpus is pre-populated. It seems likely that we could find a way to import the data in less than 9min. The open question is how this data preparation process should be handled: * are there existing dump format we could re-use (without impacting too much usability in adding/removing docs) * write a custom maint script and own the test data corpus format? AC: * cirrus integration tests no longer use tags to populate data * concurrency can be set to 2 for all runs and not only the second opensearch version tested * test runtime is drastically reduced, target should be ** 2m30 for setting up the docker env ** 3m(?) for importing the data ** 2m30 for running the test ** 1m30 for switching env to another opensearch version ** 2m30 for a second test run ** total ~12m
    • Task
    **Steps to replicate the issue** (include links if applicable): * Click on https://en.wikipedia.org/w/index.php?search=%22Student+newspapers%22+%22reliable%22+-prefix%3Aarticles&title=Special%3ASearch&profile=advanced&fulltext=1&ns4=1 * Notice that it says it's searching only in the Wikipedia: (Project:) namespace and **not** in the mainspace. **What happens?**: * The results are in multiple namespaces. * The results do not respect the `-prefix:` command within the specified namespace. **What should have happened instead?**: * The results should not include the mainspace. * The results should respect the `-prefix:` command within the specified namespace.
    • Task
    We are currently upgrading to opensearch 2. It would be useful to have the cirrus integration tests to check both opensearch 1 and opensearch 2. The tests are mainly orchestrated by barry a python script, it sounds possible to introduce a loop there that will run the tests against two different opensearch images. AC: * barry is able to loop and run the test suite against different opensearch images. * new patches in cirrus are tested against opensearch 1.3 and 2.19 ** pre-requisites: cirrus should accept opensearch 2.19 (or 2.x?)
    • Task
    Flink 2 is available and we even have images for it. DPE is starting to use it for T425624. I think we should be ready to upgrade as well. AC: * the cirrus updater job runs on flink2 and java25
    • Task
    Once the migration to 2.19 has been completed, continue with the second upgrade step to 3.5+
    • Task
    # Search Quality Monitoring Metrics Dashboard As a member of the Search team, I want to quickly detect unexpected changes in search result quality, so that we can identify regressions early and investigate potential retrieval or ranking issues before they significantly impact users. ## Metrics The following metrics should be available and explorable per wiki: - [ ] Zero results rate - Percentage of search sessions/queries returning no results - Broken down by wiki - [ ] Click position - Distribution and/or average position of clicked results - Lower click positions indicate users find relevant results near the top - Broken down by wiki ## Acceptance Criteria - Both metrics are available in Grafana and/or Superset - Definitions for both metrics are documented - Alerting creates Phabricator tasks ## Notes / Considerations - Clarify whether metrics are computed from: - raw search requests, - search sessions, - or unique users - Define expected handling for: - bots (?), - desktop/mobile - autocomplete/search-as-you-type, - and anonymous vs logged-in traffic - For click position: - consider median and percentile views in addition to averages - consider separating desktop/mobile if behavior differs significantly - For zero results rate: - consider alert thresholds for significant regressions
    • Task
    There's a [[https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#c-Certes-20260426202400-Cirrus_search_query|discussion at English Wikipedia]] about why `intitle:"Clover" intitle:/Clover.*West Virginia/` produces no results, while, for example `intitle:"Clover" intitle:/Clover.*West /` and `intitle:"Clover" intitle:/Clover.*\sWest Virginia/` currently find three pages. This looks like a bug, but it's hard to tell where it's coming from.
    • Task
    Per [[ https://wikimedia.slack.com/archives/C0975D4NLQY/p1776179549641299 | this Slack announcement ]] , the Semantic Search experiment has concluded as of 14 April 2026. That means the OpenSearch clusters backing the experiment, which are currently taking [[ https://docs.google.com/spreadsheets/d/17Ipli-b1Mlrqx22cihgsiJOKUFDQSREFYQKVQPC5zPo/edit?gid=1128647813#gid=1128647813 | ~22% of total capacity in the dse-k8s clusters ]] , can be undeployed and the space reclaimed. However, before we do this, we need to alert the stakeholders and make sure they have collected all the data they need, as the OpenSearch data will be **irrevocably destroyed** when we undeploy. Creating this ticket to: [x] [[ https://wikimedia.slack.com/archives/C0975D4NLQY/p1776803373683989 | Inform stakeholders of tentative undeployment date ]] (21 May 2026 @ 2100 UTC) [] Change timetable per feedback if necessary [] Undeploy `opensearch-semantic-search` and `opensearch-semantic-search-test` in both dse-k8s clusters.
    • Task
    In [[ https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch | Wikibase CirrusSearch ]] it would be amazing to allow a condition like "Instance of OR subclasses of". Example use case: find **cities** 🏰 Example SPARQL to find cities: ```lang=sparql # Instance of or subclasses of Q515 (city) ?city wdt:P31/wdt:P279* wd:Q515 . ``` Yes, there are many types of cities :3 ## Condition Popularity The condition is «recommended» by the Wikidata Query Builder: > {F75858643} > "Include related values in the search (recommended)" > https://query.wikidata.org/querybuilder/ The condition is also mentioned ~70 times from the public examples in the Wikidata SPARQL Query Service https://query.wikidata.org/ For example it's the 2nd [[ https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Horses_(showing_some_info_about_them) | SPARQL example about finding horses ]]. ...and is the 4th [[ https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Map_of_hospitals | SPARQL example about finding hostpitals ]]. ...and is the [[ http://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Recent_events | SPARQL example about finding recent events ]]. ...and is the [[ https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Properties_connecting_items_of_type_zoo_(Q43501)_with_items_of_type_Animalia_(Q729) | SPARQL example about properties connecting items of type zoo ... ]]. etc. etc. Moreover, I noticed that [[ https://www.wikidata.org/wiki/Wikidata_talk:Wikibase_GraphQL#c-Ifrahkhanyaree_WMDE-20260310153900-Niryhpr-20260310030000 | Wikibase GraphQL is based on CirrusSearch]], so having this condition sounds very strategic for WMDE GraphQL too. → The condition "and subclasses" is basic, useful and popular. ## Workaround At the moment to find cities in CirrusSearch we need to know that they exist ~190 subclasses of "city", plus the city iself. https://w.wiki/KzJ3 So I guess the workaround is to build a search condition like this: ``` haswbstatement:P31=Q515 #city OR haswbstatement:P31=Q12102963 #city specifically designated in the state plan OR haswbstatement:P31=Q12131624 #Ukraine city OR haswbstatement:P31=Q15092400 #independent city OR haswbstatement:P31=Q15661340 #ancient city OR haswbstatement:P31=Q21550633 #garden city OR haswbstatement:P31=Q51929311 #largest city OR haswbstatement:P31=Q108454687 #half million city OR haswbstatement:P31=Q7375052 #royal city (this for ~196 times lol) ``` But this workaround is not really feasible since the human being needs an unreasonable deep graph knowledge. Plus, it's clear the user will quickly reach webserver `GET` limits or may hit other limits. Since I don't expect that we can really submit queries with ~200 OR conditions at least. So there is not a feasible workaround in CirrusSearch. ## Proposed Syntax Proposed CirrusSearch syntax, something like this, to find "instance of all cities": ``` haswbstatement_or_subclasses:P31=Q515 ``` Another example, to find "depicts all cats": ``` haswbstatement_or_subclasses:P180=Q146 ``` Etc. - so the search argument should accept both: - a Wikidata P-property - a Wikidata Q-item ## Code Exploration ... none ... ----
    • Task
    On the path to opensearch 3.x we first need to upgrade to 2.19.5. Review the set of plugins we install from gitlab repos/search-platform/opensearch-plugins-deb and update them to be available in 2.19.5.
    • Task
    ### Compatibility assessment * 1.3 → 2.19.x → 3.x path * breaking-change review * plugin/client/index compatibility matrix #### Other Stakeholders * **Extension:Translate** uses [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Translate/+/e890d53825c7bcfde5eb6d45c292e13f720fd46d/src/TtmServer/ElasticSearchTtmServer.php#80 | ElasticSearchTtmServer ]] claims `ElasticTTM is currently not compatible with elasticsearch 2.x/5.x it needs FuzzyLikeThis ported via the wmf extra plugin`; uses chi cluster, see [[ https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/3ff486473879abe7eafcf61e60f09784ee5071eb/wmf-config/CommonSettings.php#3354 | Config ]]
    • Task
    When using the [[ https://fr.wikipedia.org/w/index.php?search=what+is+the+capital+of+France%3F&title=Sp%C3%A9cial%3ARecherche&profile=advanced&fulltext=1&ns0=1&cirrusSemanticSearch | Special:Search ]] UI semantic search queries fail. It looks to be due to interwiki interactions: ``` from /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/Profile/SearchProfileService.php(300) #0 /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/Search/SearchContext.php(706): CirrusSearch\Profile\SearchProfileService->getProfileName(string, string) #1 /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/Searcher.php(323): CirrusSearch\Search\SearchContext->getFulltextQueryBuilderProfile() #2 /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/InterwikiSearcher.php(87): CirrusSearch\Searcher->buildFullTextSearch(string) #3 /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/CirrusSearch.php(326): CirrusSearch\InterwikiSearcher->getInterwikiResults(CirrusSearch\Search\SearchQuery) #4 /srv/mediawiki/php-1.46.0-wmf.18/extensions/CirrusSearch/includes/CirrusSearch.php(279): CirrusSearch\CirrusSearch->searchTextReal(CirrusSearch\Search\SearchQuery) #5 /srv/mediawiki/php-1.46.0-wmf.18/includes/Search/SearchEngine.php(95): CirrusSearch\CirrusSearch->doSearchText(string) #6 /srv/mediawiki/php-1.46.0-wmf.18/includes/Search/SearchEngine.php(187): MediaWiki\Search\SearchEngine->MediaWiki\Search\{closure}() #7 /srv/mediawiki/php-1.46.0-wmf.18/includes/Search/SearchEngine.php(94): MediaWiki\Search\SearchEngine->maybePaginate(Closure) #8 /srv/mediawiki/php-1.46.0-wmf.18/includes/Specials/SpecialSearch.php(387): MediaWiki\Search\SearchEngine->searchText(string) #9 /srv/mediawiki/php-1.46.0-wmf.18/includes/Specials/SpecialSearch.php(198): MediaWiki\Specials\SpecialSearch->showResults(string) #10 /srv/mediawiki/php-1.46.0-wmf.18/includes/SpecialPage/SpecialPage.php(711): MediaWiki\Specials\SpecialSearch->execute(null) #11 /srv/mediawiki/php-1.46.0-wmf.18/includes/SpecialPage/SpecialPageFactory.php(1712): MediaWiki\SpecialPage\SpecialPage->run(null) #12 /srv/mediawiki/php-1.46.0-wmf.18/includes/Actions/ActionEntryPoint.php(504): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, MediaWiki\Context\RequestContext) #13 /srv/mediawiki/php-1.46.0-wmf.18/includes/Actions/ActionEntryPoint.php(144): MediaWiki\Actions\ActionEntryPoint->performRequest() #14 /srv/mediawiki/php-1.46.0-wmf.18/includes/MediaWikiEntryPoint.php(180): MediaWiki\Actions\ActionEntryPoint->execute() #15 /srv/mediawiki/php-1.46.0-wmf.18/index.php(44): MediaWiki\MediaWikiEntryPoint->run() #16 /srv/mediawiki/w/index.php(3): require(string) #17 {main} ```
    • Task
    **Steps to replicate the issue** (include links if applicable): * [[ https://commons.wikimedia.org/w/index.php?title=Special%3AMediaSearch&fulltext=Search&ns6=1&search=-deepcategory%3A%22Maps+of+the+world+by+language%22+deepcategory%3A%222020s+maps+of+the+world%22+-deepcategory%3A%22Wikimania+Map+of+the+world%22&type=image |-deepcategory:"Maps of the world by language" deepcategory:"2020s maps of the world" -deepcategory:"Wikimania Map of the world" ]] shows 231 results with no warning at the top about incomplete results * [[ https://commons.wikimedia.org/w/index.php?title=Special%3AMediaSearch&fulltext=Search&ns6=1&search=-deepcategory%3A%22Maps+of+the+world+by+language%22+deepcategory%3A%222020s+maps+of+the+world%22+-deepcategory%3A%22Wikimania+Map+of+the+world%22+-deepcategory%3A%22SVG+maps+of+the+world+by+language%22+-deepcategory%3A%22English-language+SVG+maps%22+-deepcategory%3A%22Our+World+in+Data+food+and+agriculture+maps+of+the+world%22&type=image | the same after appending -deepcategory:"SVG maps of the world by language" -deepcategory:"English-language SVG maps" -deepcategory:"Our World in Data food and agriculture maps of the world" ]] shows only 58 results (also without warning) but all of the added cats are subcats of the cats in the prior scan **What happens?**: The result count differs and there is no warning. **What should have happened instead?**: The result count should be the same and if it really doesn't work, a warning about incomplete results should be displayed. **Software version** (on `Special:Version` page; skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): I mentioned this in {T414763} but it looks like a separate problem. (The example is what's used to populate [[ https://commons.wikimedia.org/wiki/Category:2020s_maps_of_the_world_in_unidentified_languages | 2020s maps of the world in unidentified languages ]] which is how at least / starting with the most relevant world maps are categorized by language to e.g. better enable translations and hopefully eventually better search results that doesn't show maps in some niche language I can't read at the top when that's not in my configured language(s).)
    • Task
    **Steps to replicate the issue** (include links if applicable): I have a number of search URLs bookmarked to assist in patrolling pending AfC drafts. Up until today they were working fine. For example: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&limit=500&offset=0&ns118=1&sort=create_timestamp_desc&search=coverage+-%22routine+coverage%22+-%22significant+coverage%22+deepcat%3A%22Pending+AfC+submissions%22&sort=create_timestamp_desc&advancedSearch-current={%22fields%22%3A{%22deepcategory%22%3A[%22Pending+AfC+submissions%22]}}&searchToken=1gp5lziima6jhyyrq24ko37gj Alternatively, enter the search term `coverage -"routine coverage" -"significant coverage"` and select the category "Pending AfC submissions" in Advanced Search -> Pages in these categories. **What happens?**: No results. For some queries, it will show results for an unrelated word that shares some letters and with deepcat removed (e.g. `draft deepcat:"Pending AfC submissions"` -> `draw`) **What should have happened instead?**: I should get multiple pages containing the word "coverage". To be clear, I have tried this with other search queries as well, such as the "draft" example I mentioned above (which should *definitely* return results).
    • Task
    To better understand the resource consumption of vector search, we would like to run some tests: * increasing QPS (after warmup) over 20-30 min sweeping from 5 to 100 QPS while measuring percentiles for CPU/mem/disk/GC * constant QPS (after warmup) over 20-30 min increase parallelism while measuring percentiles for CPU/mem/disk/GC * constant QPS (after warmup) over 20-30 min against different index sizes while measuring percentiles for CPU/mem/disk/GC The assumption is, that p99 reveal knee points. To be discussed: * We need a pool of queries to, maybe bucketed by complexity. Should we source the logs? Alternatively, we can reuse the [[ https://docs.google.com/spreadsheets/d/1FMVsCm7AEYw5BsN7u-afMxmaCHkm_nZ1yzKoLb9ygNE/edit?gid=1925455549#gid=1925455549 | ~1.4k queries sampled by Research ]] for the Golden Set. * If possible we should also monitor the quality of results, but maybe that's hard without a golden set ready to use.
    • Task
    For the semantic search MVP on Android, we need an HTTP endpoint that they can consume to retrieve semantic search results. The Android app [[ https://github.com/wikimedia/apps-android-wikipedia/blob/19e510993b8afdf02ee86ac78765efd1e08bd070/app/src/main/java/org/wikipedia/dataclient/Service.kt#L47 | uses the Action API ]] for both, prefix and full text search. For matters of convenience, we would reuse the Action API. Apparently, there are multiple discriminators already: * *gps* `gps(search|limit|offset)` = prefix search + `generator=prefix` * *gsr* `gsr(search|limit|offset)` = full text search + `generator=fulltext` TBD: Do we need another param family or can we expand the possible values, for example, `generator=semantic`? Alternatively might be solved via search profile, see fulltext query dependent profile. Corresponding Android app task: T412986 AC: * w/api.php? can be called to fetch semantic results
    • Task
    **Feature summary** MediaWiki search allows keywords like `insource:` and `intitle:` however these are case sensitive - `intitle:banana` returns pages that contain the word "banana" in the title, however `Intitle:banana` returns pages that have the words "intitle", "intles", or "intitled", etc anywhere in the page text and everything after the colon is apparently ignored. This should be changed so that either the whole keyword or at least the first character is case insensitive. **Use case(s) and benefits** Mobile phones frequently autocapitalise the first letter of an input and it can be fiddly to change, frustrating that you need to remember to do that and a waste of time (and potentially data) if you don't know or forget. **See also** Original request at enwiki, which contains more examples: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(idea_lab)#insource_and_similar_search_terms_allowed_to_be_capitalized?
    • Task
    ### User Story As a reader using Hybrid Search, when my results relate to a biography, I want to see an example semantic-style query (e.g., “Who is Beyonce?"), so that I understand how to ask similar questions and explore topics by meaning. ### Description When Hybrid Search semantic results are primarily sourced from a biography article, display a lightweight, contextual prompt in the semantic results section that surfaces the query pattern: Who is <Article Title>? This prompt is intended to model natural-language and semantic-style queries, helping readers learn how to interact with Hybrid Search without auto-generating or rewriting their query. ### Requirements The Android app uses the action API `action=query` with the parameter `cirrusSemanticSearch=true` to retrieve semantic results. To determine whether the article is a biography of a living person, the apps team will need to have a variable in the action API response that indicates `blp` is `true` or `false`.
    • Task
    Glent seems to be broken since nov 22: ``` Exception in thread "main" java.lang.NullPointerException at org.wikimedia.search.glent.fst.SerializableFST.writeObject(SerializableFST.java:33) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:319) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:321) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:138) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1509) at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:546) at org.wikimedia.search.glent.fst.AllPairsLevenshtein.toLookup(AllPairsLevenshtein.java:147) at org.wikimedia.search.glent.fst.AllPairsLevenshtein.getForwardLookup(AllPairsLevenshtein.java:180) at org.wikimedia.search.glent.fst.AllPairsLevenshtein.apply(AllPairsLevenshtein.java:70) at org.wikimedia.search.glent.fst.AllPairsLevenshtein.apply(AllPairsLevenshtein.java:54) at org.wikimedia.search.glent.SimilarQueriesSuggester.generateCandidatesFromStructs(SimilarQueriesSuggester.java:126) at org.wikimedia.search.glent.SimilarQueriesSuggester.generateCandidates(SimilarQueriesSuggester.java:96) at org.wikimedia.search.glent.Method$M1RunCandidates.accept(Method.java:177) at org.wikimedia.search.glent.Method$M1RunCandidates.accept(Method.java:162) at org.wikimedia.search.glent.GlentControl.main(GlentControl.java:63) ```
    • Task
    The reasons for not implementing a namespace filter in LinkSearch are no longer valid: moved to T12593 Without a namespace filter, the web link search is simply unusable for many large domains, as you often have to skip thousands of hits on discussion pages. Many users circumvent the restriction by using insource search. But that's not a good solution either A solution with good filtering options, e.g. glob or regex, would be desirable. UPDATE 15/12/2025 (Title changed) Add Elasticsearch ot Special:LinkSearch 1. Split URL into domain, path, query and fragment 2. split domain into TLD, domain, subdomains 3. split path into folders 4. split query into keys and values 5. build function like indomain, infolder, inkey, invalue ... 6. Investigate whether URL-specific functions can be used as useful filter options for general insource searches.
    • Task
    Currently when you search Chinese terms in Chinese Wikipedia, you will also get the results matching the search terms matching a converted language variant (see https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/8902802cb92ed0106c3727a88e6644f9b8ccc63d/includes/SecondTry/SecondTryLanguageConverter.php). Multilingual wikis like Commons has many pages with Chinese in titles or contents (https://commons.wikimedia.org/wiki/Category:Chinese-language_books_by_title), and it is nice if we can search using simplified Chinese to find titles in traditional Chinese. Note: 1. It is not a strange idea that this is enabled in all Wikimedia wikis. e.g. Many articles in English Wikipedia contains Chinese characters (proper names), and if it is in simplified Chinese we currently can not find it using traditional Chinese 2. Terms in Chinese characters are not necessary Chinese (e.g. they can be Japanese kanji), but it is harmless if we also try a conversion as if they are in Chinese.
    • Task
    The cirrus-reindex-orchestrator is a tool that is able to run multiple reindex of wiki indices in parallel. It is limited to 8 shards/cluster in parallel which means that a single reindex is happening on large wikis (commons) but could run up to 8 mwscript in parallel for small wikis. Unfortunately the deployment of multiple mwscript-k8s is causing some impact on the k8s api response times: {F70871002} We can see the timing degrading as big wikis get reindexed first and while more smaller wikis are getting processed concurrently the pressure on the k8s resources increases. We could investigate ways to make this process less impactful on the k8s APIs: * investigate using [[https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Running_on_multiple_wikis_(the_safe_way)|--local_dblist]], it's possibly acceptable for small wikis? * complete refactor and prefer using the mediawiki API to return the mapping/index config and schedule the reindex from pythons instead of the maint script * workaround: review the concurrency limits and make the process slower overall * possible small optimizations: the cleanup of helm deployments is not batched, perhaps it could help a bit to batch the cleanups (if `helmfile destroy` on muliple releases at once can help) * other ideas? AC: * running a full reindex does not cause the k8s API response times to increase
    • Task
    In T410602 a contributor relied on some of the cirrus debug APIs to troubleshoot an issue with the search index. This was particularly useful since it allowed to give rapid feedbacks to other contributors about what is possibly happening but also write a detailed phab bug report that greatly sped up the troubleshooting done by the cirrus maintainers. We should better document these APIs so that it becomes easier to understand & explain search behaviors. Missing debug APIs: * dump the completion index document (i.e. `action=cirrusSuggestDump`) * possibly allow to specify the cluster with `action=cirrusDump` Document existing APIs: * indexed documents: `action=cirrusDump` (and future `action=cirrusSuggestDump`) * search explainability: cirrusDumpQuery, cirrusDumResults, cirrusExplain * document building: [[https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcirrusbuilddoc|cirrusbuilddoc]] & [[https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcirruscompsuggestbuilddoc|cirruscompsuggestbuilddoc]] AC: * missing APIs are implemented * all APIs are documented in mw.org
    • Task
    The `CirrusSearch inconsistencies` superset dashboard tracks some inconsistencies detected by comparing the cirrus dumps extracted from the codfw search cluster vs the sqooped mysql tables available in the datalake. (Note some aggregated data is also available in prometheus under the `cirrussearch_content_inconsistency` metric). In T410602 a contributor discovered that the search index kept some stale data in the search index. This is particularly dangerous because this stale data can be from a vandalized edit which may in turn pollute search behaviors with highly problematic responses/suggestions. The current consistency checks failed to uncover such problems early enough, we should evaluate how to improve them to make sure that we can detect these issues more pro-actively. Problems of the current checks: * it does only capture //simple// problems: ** `redirect_in_cirrus`: redirects indexed as plain pages in cirrus ** `in_mysql_but_not_in_cirrus`: page present in the database but not in the cirrus index ** `in_cirrus_but_not_in_mysql`: page in the search index but not in the database ** `revision_mismatch`: indexed page but with the wrong revision * the check compares weekly cirrus dumps vs monthly sqoop db snapshots, and does include some tedious logic to account for the time difference between these two datasets * the checks are only comparing very high level metadata (page_id, revision_id) Suggested improvements: * increase the frequency of the checks from monthly to weekly (can we use other datasources than sqooped tables?) * also check the eqiad index by ingesting it into the datalake as well * include more granular checks of important metadata: ** redirects array in cirrus ** defaultsort ** others
    • Task
    In support of T407521, we want to expose an array of sections for the `cirrusbuilddoc` prop. To support further processing (extraction of vectors), sections should preserve new line markers. AC: * the current structure is kept as is * only one field is added, that contains an array of sections * sections preserve new line markers. * optionally control exposing this field via query parameter
    • Task
    **Steps to replicate the issue** (include links if applicable): * Enter a deepcat query at Commons with enough results to show the "Load more" button at the bottom of the page ([[ https://w.wiki/FzjY | example]]) * Scroll to the bottom of the page to load more results **What happens?**: Eventually additional results stop being loaded: the loading icon at the bottom of the page changes into the false "No more results found" message and "Deep category search SPARQL query failed" error appears underneath the search box. This can happen on the very first attempt to load more results or after several successful ones . The behavior is unpredictable, but failure seems to be more likely the longer you wait between attempts - you'll get more results if you immediately start scrolling to the bottom of the page than if you behave "naturally". **What should have happened instead?**: All results get loaded when "No more results found" message is shown. **Other information** (browser name/version, screenshots, etc.): {F70278651} {F70278655}
    • Task
    On loginwiki the configdump may produce a plain string array instead of a map with namespace ids as keys: https://login.wikimedia.org/w/api.php?action=cirrus-config-dump&prop=replicagroup|namespacemap&format=json&formatversion=2 This is causing the search update to fail with: ``` java.lang.IllegalArgumentException: Cannot deserialize value of type `java.util.LinkedHashMap<java.lang.Long,java.lang.String>` from Array value (token `JsonToken.START_ARRAY`) at [Source: UNKNOWN; byte offset: #UNKNOWN] at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:4544) at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:4485) at org.wikimedia.discovery.cirrus.updater.producer.graph.CirrusNamespaceIndexMap.lambda$load$3(CirrusNamespaceIndexMap.java:144) ``` The reason we discover this issue just now is that loginwiki events got enabled just recently (T408701). The pipeline has been unblocked by temporarily excluding loginwiki (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1200294). AC: - the cirrus config dump with namespacemap should produce a value map of namespace ids -> index_name
    • Task
    **Steps to replicate the issue** (include links if applicable): * set `$wgCirrusSearchDisableUpdate` to `true` in your `LocalSettings.php` file * run `mw dev mw mwscript CirrusSearch:ForceSearchIndex` **What happens?**: * The script reports that all pages were indexed, even though newly indexed pages will be missing due to the configuration being set **What should have happened instead?**: * The script does not report on indexed pages if they are not indexed
    • Task
    via T406020#11299905: Based on this test it seems we should setup a dedicated search profile for commonswiki with the doubled near match weight and run an AB test. There is already an AB test that is about to start (related to completion suggester) and our infra can only run one at a time, so this might be delayed a week or two.
    • Task
    While working on T404632, I noticed that `insource:/𐌀𐌁𐌂𐌃/` fails with an error, regardless of whether it matches anything. Every regex I've tried with with 4 or more four-byte characters fails. They also fail on my local wiki which has a fix for the highlighting issue in T404632. I tried the regexes on enwiki, frwiki, dewiki, jawik**t**, and commons Special Search. Erik pulled out this message: > Provided analyzer generated more than one token, if using 3grams make sure to use a 3grams analyzer, for input [\uDF00\uD800\uDF01\uD800\uDF02\uD800\uDF03] first is [\uDF00\uD800\uDF01\uD800\uDF02] but [\uD800\uDF01\uD800\uDF02\uD800\uDF03] was generated. A little more digging reveals that a regex with a four-byte character followed by three or more characters of //any kind// causes the error. (e.g., `insource:/𐌀aaa/`) (Based on the message Erik found and the behavior I'm seeing, my guess is that the trigram extractor is stepping one //codepoint// forward and not one character when it moves to get the "next" trigram... but we will have to see.)
    • Task
    We ran an AB test in T404647 that compared the default language model (title + redirect.title) vs a varient(opening_text). We were surprised to find the variant field did not improve performance vs the default language model. One possibility is that there are patterns in the titles not found in the opening_text, and we need both. Run a test that compares title+redirect.title vs title+redirect.title+opening_text.
    • Task
    **Steps to reproduce the issue** (include links if applicable): * Go to any page, e.g. https://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Kwestie_techniczne * Type in (or paste), e.g. `insource:/diaboli/` **What happens:** Autocomplete suggestions are shown. I think this is a recent development, so probably a regression. **What should have happened instead:** No suggestions should appear when using special keywords, such as when the search box contains `insource:`, `intitle:`, etc. On enwiki, you might want to check for `...:[/"]`, but I guess it would be unlikely for the vast majority of users to randomly type in those keywords, even in English. **Software version**: plwiki **Other information** (browser name/version, screenshots, etc.): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:143.0) Gecko/20100101 Firefox/143.0 {F66748462} Reported also by a fellow wikimedian here: https://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Kwestie_techniczne#Podpowiedzi_na_Special:Search
    • Task
    Hi. Now, when the CirrusSearch has filtering by date, could you please add simple date selectors to the GUI? It will be very helpful and much better that inserting dates manually. It can, for example, look like this: 1) Dropdown selector between creationdate and lasteditdate. 2) Dropdown selector between =, >, >=, <=, <, probably better expressed by words. 3) Date selector for specific day. 4) Optionally, adding a button in the bottom part of the selector "Use whole month". 5) A button "Add one more set of date selectors". Of course, this does not cover everything, for example `today-35d` or "use whole year" option. But even the most of the functionality will be very helpful.
    • Task
    When a regex search times out, it displays a message saying "A warning has occurred while searching: The regex search timed out, so only partial results are available. Try simplifying your regular expression to get complete results." That's fine, but it would be helpful to also provide some estimate of how complete the search was. For example, right now there are 7,066,411 articles in English Wikipedia. If the search was in article space and managed to get through 1,234,567 articles before timing out, it could report something like "1,234,567 of 7,066,411 pages (17.47%) searched before timeout." Knowing how close to completion was achieved is helpful in various ways. For example, If the results are errors that the user plans to fix, it helps the user estimate how many errors need to be fixed in order to get the search to complete.
    • Task
    Newer versions (>= 8) of ruflin/Elastica might no longer be compatible with OpenSearch that CirrusSearch relies on and we might need to fork/port this library to support it. https://packagist.org/packages/ruflin/elastica#8.0.0 https://packagist.org/packages/ruflin/elastica#8.1.0 https://packagist.org/packages/ruflin/elastica#8.2.0 https://github.com/ruflin/Elastica/releases/tag/9.0.0
    • Task
    Run an A/B test to see whether or not `defaultsort` can improve user experience when search-as-you-type. **Context** The use of the //defaultsort// data in completion searches is believed to help in these cases: * search for last-name first (previously enabled on the mongolian wikipedia: T327878) * search for concepts that are often used with popular prefixes (search //XYZ// finds //List of XYZ//, see T386655) The list of wikipedias where could enable this feature is (filtered on wikipedias that have more than 50% of the pages with a //defaultsort// that matches the pattern we expect and where it would increase recall on more than 20% of the pages). |wiki|recall improvement| |enwiki |26%| |dewiki |28%| |frwiki |23%| |eswiki |21%| |plwiki |23%| |fiwiki |28%| |nowiki |23%| |cswiki |23%| |hewiki |28%| |dawiki |26%| |simplewiki|25%| |bgwiki |20%| |etwiki |21%| |glwiki |20%| |afwiki |33%| |slwiki |20%| |lvwiki |20%| |vowiki |47%| |fywiki |21%| |mnwiki (x) |20%| |gvwiki |25%| |mtwiki |21%| |biwiki |37%| |iglwiki |26%| (x): already enabled in T327878 We could start an A/B test on 3 wikis first: enwiki, frwiki and hewiki. AC: * [*] determine the set of wikis to test * [x] start an A/B test on these wikis * [x] analyse the results ** https://people.wikimedia.org/~dcausse/T404858-completion-default-sort-en-fr-he.html ** https://people.wikimedia.org/~dcausse/T404858-completion-default-sort-2.html * [] possibly enable this by default on these wikis and possibly infer if it can be enabled on more wikis
    • Task
    As part of T390858 we added a new suggest_variant field to the production search clusters that contains the opening_text content, rather than the titles and redirects. We suspect this will produce a language model that more accurately reflects the queries users will type. Run an ABC test with the following variants: * control: existing did-you-mean suggester * d1: existing configuration, but with prefix_length = 1 * d1v: existing configuration, but with prefix_length = 1, field = suggest_variant Test should run for one week, as is typical for search tests. Test should be analyzed with a variant of the [[ https://gitlab.wikimedia.org/repos/search-platform/notebooks/-/tree/main/ab-test/did-you-mean? | dym-ab-analysis notebook ]]. We expect the d1v bucket to see an increased latency, the analysis likely needs to be updated to report on changes in latency between buckets. It may also require updates to properly report A/B/C.
    • Task
    It's really nice being able to easily search for entity schemas via the seach box now due to the dropdown for entity types. And I am far more likely to remember that entity schemas are a thing... {F65938451} However the display of search results is really not ideal.. {F65938454} A) get some raw is content from the entity schema itself {F65938463} B) Even when there is a description, including in my language (lacking showing description) {F65938492} C) Show description in non ideal orders, such as language AST first? {F65938562} {F65938579} Generally speaking, it would be great to get this improved
    • Task
    When T402858 is merged, we can work on integrating the RU/QWERTY and HE/QWERTY mappings into autocomplete.
    • Task
    **Feature summary** (what you would like to be able to do and where): For example if I'd be typing "Tem" into search, it would suggest "Template:" **Benefits** (why should this be implemented?): If you edit templates and modules a lot, it's tedious to write the namespace every time
    • Task
    [] Set up for running and testing locally [] Test CirrusSearch\ChangeListener::onPageMoveComplete [] CirrusSearch\ChangeListener make public accessors (getters and setters), and mark them as @internal (not strictly needed since in most extensions "internal" is the default, but I'd do it anyway, for good measure). [] CirrusSearch\ArchiveChangeListener Once T389593 is done, PageUndeleteComplete can be replaced by a listener for the PageCreated event, plus a check for whether the creation was caused by undeletion. Implementation notes * Docker overview: https://www.mediawiki.org/wiki/MediaWiki-Docker/Extension/CirrusSearch
    • Task
    **Feature summary** (what you would like to be able to do and where): For large categories and/or categories with long chains of nested subcategories, using the deepcategory view or searching the category using deepcategory can show only partial results with this warning message above the search results ([[ https://commons.wikimedia.org/w/index.php?search=deepcategory%3A%22Microscopic+images+relating+to+biology%22&title=Special:MediaSearch&type=image | example ]]): >Deep category query returned too many categories. Only a subset of categories has been applied. {F60693053} Please add a way of loading further results results for those cats where deepcat fails to show all results. * This could be a button saying [click here to load more] * Note that the default-algorithmic and the user-specified sorting and filter functionalities are relevant here so having some infinite scroll feature where more files are only loaded once the user scrolled to the end of results wouldn't be sufficient * It could also display the categories that had too deep subcategories which the user can then fetch results from with the click of a button ** For example in the example it could display this at the top (possibly within that warning box): *** [click here to load all] *** Subcategories of //Scanning electron microscopic images of pollen// have not been included [click here to include these] [click here to exclude these from //load all//] *** Subcategories of //Scanning electron microscopic images of insects// have not been included [click here to include these] [click here to exclude these from //load all//] **Use case(s)** (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): For the example – and there are other possible examples with further uses cases – main real-world use-cases include: # contributors using it to fix miscategorizations to spot files that are not microscopic images to subcategorize things accordingly (eventually this view should really only show microscopic images, not e.g. diagrams or videos or offtopic files) and # Internet users interested in good-quality freely-licensed microscopic images looking for a way to easily conveniently browse them without having to dig through countless unsortable nested subcategories. That it does not load all files which is for example an issue when you'd like to sort by recency or when you'd like to further search the results (e.g. with an extra search term) or when one has scrolled to the end of the search results or just in general when some arbitrary subset of relevant files are not included. **Benefits** (why should this be implemented?): See above, this has huge benefits at least once deepcategory is used more which it could be if for example the FastCCI gadget button is fixed (T367652). See also [[ https://commons.wikimedia.org/wiki/Commons:Requests_for_comment/Technical_needs_survey/Wall_of_images_view_for_category_pages_including_images_in_subcats | Wall of images view for category pages including images in subcats ]] which got lots of support. So benefits include better categorized files, and a much better way to browse them and more usefulness of all the files on Commons and the things people do there like organizing these by subject. **Other info**: Related issue T391876
    • Task
    **Steps to replicate the issue** (include links if applicable): See T376440 (it was closed again and I was asked to instead create a new issue). The deepcategory search does show partial results now so the other issue seemed done. However, apparently it can often fail. The example category did show many results sometimes but sometimes no results again. I thought it was a temporary problem because when I checked the search again, it did show results again but now it failed again to show results * Open/search for `deepcategory:"Manufacturing by product"`([[ https://commons.wikimedia.org/w/index.php?search=deepcategory%3A%22Manufacturing+by+product%22&title=Special:MediaSearch&type=image | link ]]) in the Wikimedia Commons search **What happens?**: It shows no results and the error >Deep category search timed out. Most likely the category has too many subcategories In case it works for you at the time when you look at this issue, here is a screenshot: {F60691898} **What should have happened instead?**: It should always show those partial results (as it does only sometimes). Here's how it looks like when it works: {F60692107} **Software version** (on `Special:Version` page; skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.):
    • Task
    ### Description CirrusSearch is an internal team, built on top of ElasticSearch. ### Conditions of acceptance 1. Investigate the CirrusSearch extension to determine the level of effort for transitioning to Domain Events * [x] Investigate the complexity of the code depending on the hooks * [x] Investigate test coverage * [x] Determine what is required for local infrastructure for testing * [] Identify risks of the migration * [x] Determine if all related hooks can be replaced with Domain Events (PageUpdated, PageDeleted, PageMoved) 1. Engage with the CirrusSearch team to socialize Domain Events work. * Provide context of the importance and impact of Domain Events (Halley & Daniel can help with this) * Give the team a heads up that we are doing this * Request code review & test support [x] Create follow up tickets for implementation.
    • Task
    [ ] Avoid use of "sanity"/"insanity" [ ] Classes and namespaces [ ] Jobs [ ] Config [ ] …
    • Task
    The phrase suggester is a feature used by CirrusSearch to provide //Did You Mean// suggestions. For perf/size reasons the field used by this suggester is populate with title & redirect texts. It is believed that this type of suggester works better on relatively large corpus containing more than just titles. We added the option to feed this suggest field with the `opening_text` as well, unfortunately we haven't been able to test this behavior because like all features depending on index time config it is very hard to A/B test them. Additionnaly the `suggest` field is part of the MLR features and changing it could possibly have negative consequences if not re-trained appropriately. To ease flexibility & testing we could consider creating a dedicated index per language that would be fed from the various text fields available from the cirrus dump in hive. CirrusSearch would have to be adapted to allow creating a separate suggest query to this index. The nature of the text that has to be pulled is up for discussion but using a separate index can certainly increase our ability to iterate a lot quicker. A proof-of-concept could perhaps be tested before automating this pipeline by manually creating an index. We could consider re-using the glent pipeline to automate it. AC: - Glent is able to construct a dataset fit to build an index dedicated to run suggest queries with the phrase suggester - Quick study about what content is appropriate (e.g. title+opening_text, title+redirects+opening_text, ...) - Create an index fit for the phrase_suggester for a couple languages - Adapt CirrusSearch to be able to use an separate index to fetch its DYM suggestions from the phrase suggester - Run an A/B test on a set of wikis - Depending on the outcome automate the pipeline with glent (or something else) - Test & expand the feature to more languages/wikis
    • Task
    **Steps to replicate the issue** (include links if applicable): * https://pl.wikipedia.org/ * Go to search. * Type in "muzeum narodowe kraków" and go to the search page. **What happens?**: The main skip link goes to the search input. There is also a problem if user skips to H1 and then try to find search results. **What should have happened instead?** The skip link should either point to the first result or there should be an additional skip link or skip links. An example with two skip links: https://libra.lib.mol.pl/search/any?q=test - The first skip link (first tab on page load) is generic. - The second skip link skips over filters etc (so to the first search result). This skip link was added after WCAG tests by a certified tester team. The final tests wasn't very wide so take that with a grain of salt. **Software version**: plwiki. **Other information** (browser name/version, screenshots, etc.): Full test (almost complete) by a tester on behalf of the National Museum in Kraków (in Polish): https://www.youtube.com/watch?v=fmoFGMmkbag&list=RQKEwBay1GvpiPDdkTZbEvH_e6VbI&t=11023s Search test starts here: https://www.youtube.com/watch?v=fmoFGMmkbag&list=RQKEwBay1GvpiPDdkTZbEvH_e6VbI&t=11257s The browser was said to be Safari. **Related problems and comments**: Note that some of the issues with reading articles stem from the fact that the navigation is located below the H1. This has been discussed before (I can probably find that task if it's important). The issue is that this specific user prefers to skip to the H1 to understand where he is, and then navigate to the content from there. So an additional skip link after the H1 might also be a good idea. It was mentioned that most of the time the tester actually uses Wikipedia by: 1. Searching via Google — to skip the issues with the search page :( 2. Copying the article text to a different window to read it there, or using "Print PDF" and reading that. This is partially a Safari issue; Firefox has a mode that works much better for just reading the article... Separate problem though. **Stats of screen readers in Poland** BTW, here is a questionnaire of Polish screen reader users: https://ahdar.com.pl/badania/czytniki-ekranow-2024 It wasn't a huge study, but I think it's pretty credible. | Reader | Ans. Count | Percentage | | NVDA | 30 | 58.8% | | JAWS | 2 | 3.9% | | VoiceOver | 13 | 25.5% | | Narrator | 1 | 2% | | //no answer// | 5 | 9.8% | | Browser | Ans. Count | Percantage | | Firefox | 18 | 35,3% | | Chrome | 17 | 33,3% | | Safari | 10 | 19,6% | | Edge | 2 | 3,9% | | Opera | 1 | 2% | | //no answer// | 3 | 5,9% | Safari/VoiceOver usage is obviously higher on Apple devices. It is also worthing noting that most users use Apple Phones even if they use Windows as their desktop/laptop system.
    • Task
    CirrusSearch can be used in various ways to provide ranked list of pages. To identify some of the use-cases we rely on crude heuristics that are often error prone and hard to replicate. It could be interesting to see whether it is possible to have a more explicit way to track the various use-cases where search is involved. The benefits could be: - have a better understanding of how the various features are using the search APIs - possibly let the engine tune the profiles based on a provenance, some users sometimes have to explicitly select rescore profiles, knowing the provenance the engine could select the best profiles automatically based on a centralized config - possibly be more "generous" in some ways for specific use-cases by relaxing some of the limits imposed by the search APIs The way to achieve this is yet unclear but the main idea would be to propagate a `tag` identifying the feature all the way down to CirrusSearch and all the search log events. It has some similarities with [[https://wikitech.wikimedia.org/wiki/Provenance|wprov]]. Knowing the features that are using search might take time but here is a quick list of possible candidates to get an idea: - the various search boxes of the visual editor (templates, images...) - Special:ContentTranslation - [[https://www.mediawiki.org/wiki/Extension:InputBox InputBox]] search mode AC: - design how this can be done, from the backend SearchEngine up to the UI - implement the logic - target a couple use-cases and implement them - seek other use-cases and promote the approach
    • Task
    CirrusSearch can use several highlighters: - Custom cirrus-highlighter - fvh OpenSearch provides the unified highlighter which was designed and built after we wrote the cirrus-highlighter. We could evaluate if the unified highlighter has the features we need and see whether or not it provides better highlights than what is provided by our custom highlighter. We should list all the custom features it adds and evaluates if they are required: - highlight of regex queries - `return_snippets_and_offsets` used by WikibaseCirrusSearch to distinguish between label & alias matches - `skip_if_last_matched` (could still be a valid optimization on wikibase which highlights labels.*.plain?) - [TODO] AC: - Add support for the unified highlighter - Evaluate if an A/B test between two highlighters is feasible/worthwhile -- If yes create a task to schedule an A/B test - If proven successful consider simplifying the cirrus highlighter to only provide the features we need
    • Task
    Currently the table image_suggestions_search_index_delta is shaped to have a line per wiki_id, page_id, tag tuples. For articles where multiple tags are updated we should ideally schedule a single update not multiple ones. The way to achieve this is unclear, it could be done upstream by changing the schema of image_suggestions_search_index_delta to have for instance `map<string, array<string>>` where the key is the tag and the value is the array of tag values. It could be an extra transformation step on the search side too but given that we would like to adapt this data-pipeline to use the unified weighted_tags stream (T372912) it might be preferable to do the grouping early on the image_suggestions pipeline side. AC: - image_suggestions tag updates are grouped per page not per page, tag
    • Task
    **Steps to replicate the issue** (include links if applicable): * From the Wikipedia home page, using the old Vector skin (or any page with search accessible) via https://en.wikipedia.org/wiki/Main_Page?useskin=vector * Start typing in the search bar **What happens?**: * Typing "Trigonometric Iden" offers "Trigonometric Identies" as a suggestion. **What should have happened instead?**: * The search suggestion should say "Trigonometric Identities" **Software version** (on `Special:Version` page; skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): Firefox {F58413348}
    • Task
    **Feature summary (what you would like to be able to do and where):** Support for searching specifically by page titles within CirrusSearch. This feature implements `doSearchTitle` method of the abstract `SearchEngine` class for CirrusSearch, which allows users to perform title-only searches through the CirrusSearch engine (by calling `searchTitle` on the CirrusSearch instance), enhancing the search capabilities of MediaWiki, particularly in the context of the Nuke Extension. Currently, calling `searchTitle` (which is a valid method defined in the `SearchEngine` class in MediaWiki Core) on the CirrusSearch search engine will just return null because at present CirrusSearch only implements the `doSearchFullText` method, but implementing the `doSearchTitle` function means that it works. **Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying):** I'm working on developing enhanced searching for the Nuke extension in collaboration with the moderator tools team, where users can find pages to delete based on the user that created them and terms in the title, amongst other fields. Currently the title terms field uses a LIKE query which is inefficient so I was migrating it to CirrusSearch (which was a solution that received a positive moderator tools team developers) but noticed CirrusSearch doesn't actually support the `searchTitle` method so added support for this. **Benefits (why should this be implemented?):** Efficiency: More precise searches when users only need to search by page titles, rather than full content. . Enhanced moderation tools: It supports improvements to the Nuke Extension, making it easier for admins to identify and manage pages based on titles, streamlining their workflow. Compliance with the `SearchEngine` class: Users would except CirrusSearch, the full-features search engine for MediaWiki to support all methods defined in SearchEngine, including `searchTitle`.
    • Task
    WeightedTags can be sourced from different mechanisms using a dedicated stream. Some producers might actually be MediaWiki itself and for such producers we might offer the ability to refresh (refers to the "oldDocument" semantic in the Saneitizer). This would ensure that the tags are "recomputed" once in a while even if no source events triggered a change to the tags of this page. A concrete example is the PageAssessments extension and {T378868} that made use of the search weighted tags. The initial approach taken was to send a weighted tag update on every LinksUpdate even in there are no changes in the underlying source data of the weighted tags. This had the advantage of allowing to populate the search index, the main drawback is that we now emit events that lead to NOOP. If we exposed a hook triggered when an "oldDocument" is found by the Saneitizer we could optimize the such producers by letting them only send tag updates when the data is actually changed but also allow them to fix the search index at the same rate the Saneitizer is running hopefully leading to meaningful decrease in the number of events. AC: - A hook is exposed to let weighted_tags producers to compute the set of weighted_tags of a particular page - The Saneitizer would call this hook when fixing and/or refreshing an old document - The SUP consumer is updated to understand the new response of the Saneitizer when used from its API endpoints (needs to figure where to fit the weighted_tags in this response) - The PageAssessments extension is updated to take benefit from this new hook - The PageAssessments extension is updated to only emit actual changes on LinksUpdate: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageAssessments/+/1088592
    • Task
    **Steps to replicate the issue** (include links if applicable): * Go to https://de.wikipedia.org * Enter `en:Wikipedia#History` in the search box. **What happens?**: It goes to https://en.wikipedia.org/wiki/Wikipedia The same happens if you start at https://commons.wikimedia.org and search `w:Wikipedia#History` **What should have happened instead?**: It should remember the anchor and go to https://en.wikipedia.org/wiki/Wikipedia#History A wikilink `[[:en:Wikipedia#History]]` goes to the section as expected. If you enter `Wikipedia#History` in the search box at enwiki itself then the anchor is remembered and it goes to the section as expected. It's unexpected that the behaviour is different in searches with a prefix.
    • Task
    **Steps to replicate the issue** (include links if applicable): * Go to https://ru.wikipedia.org/w/index.php?search=intitle%3A"Иванов"+intitle%3A"Иван"&sort=create_timestamp_desc&title=Служебная:Поиск&ns0=1&uselang=en **What happens?**: The dates in search results display last edit date, not page creation date. This is very confusing since it makes it seem that the search doesn’t actually sort by page creation date, since the dates are all random. The search should display page creation dates when sorting results in `create_timestamp_` modes. I encountered this while adding the title search link to the default output of `{{disambiguation}}` template in Russian Wikipedia, see https://ru.wikipedia.org/wiki/Иванов,_Иван#footer Even I didn’t understand at first that the list is actually sorted because of the dates, I don’t think other editors would not be able to.
    • Task
    **Feature summary**: * After the user intuitively used `intitle:` on Wikidata to search text inside entry labels, to no success, the UI should show them a warning that they should use `inlabel:` instead. * That should happen in the content namespaces only. I assume `intitle:` is no-op in content namespaces. It works as usual, though, in non-content namespaces like "Wikidata:". ** Alternatively, `intitle:` could be made to search in labels, //excluding// aliases. The benefit could be tangible: in an [example search of "organism"](https://www.wikidata.org/w/index.php?title=Special:Search&limit=500&offset=0&ns0=1&search=organism), only 174 links out of 500 contained in the label as opposed to the alias / the label in other languges (`$('.wb-itemlink-label:contains("organism")').length`). (This is unstable though, as search results are indeterministic: in other instance, 425 out of 500 labels contained it). **Use case**: # I don't have enough experience on Wikidata to know that Wikidata's search is anyhow different from how CirrusSearch works on other WMF wikis. # I want to find an item that, I remember, has "organism" in the title. # I search for "intitle:organism" in the main namespace, like I do it with page titles on Wikipedia. # I get no results and no indication that, in fact, I need to use `inlabel:` instead: {F57650395} # There is no indication in the UI that Wikidata's search works differently, e.g. that this link in the top right corner {F57650390} leads to some place other than where this links leads to on other wikis – it looks exactly the same, as does the rest of the search UI. # There is no way I can discover the right feature. (In fact, I had to create this Phabricator task for @Bugreporter to hint me about that.) **Thoughts**: * The fact that `intitle:` //doesn't// search for labels and instead searches only for regular wiki pages (of which none are in the main namespace) is counterintuitive. Labels are displayed as titles in the search UI: {F57649907} So it makes total sense to search for them using `intitle:`. If that doesn't work, the user should be hinted how to do it right. * I believe a warning makes more sense than a notice, since `intitle:` searches are no-op in the content namespaces anyway. * There already is a notice above the search input that says "To search for Wikidata items by their title on a given site, use Special:ItemByTitle." It in unrelated to `intitle`.
    • Task
    **Steps to replicate the issue** (include links if applicable): Unfortunately, I don't know how to reproduce this exactly. I created a bunch of subpages of my user sandbox, messed up a bit and had to fix it with a sequence of page moves, deletes, and undeletes. When I ran some searches, I was getting bizarre results. The attached screenshot illustrates one example; it claims it is showing "Results 1 – 6 of 6", but it only lists 5 results. {F57622298} If I re-execute the search now (https://en.wikipedia.org/w/index.php?search=User%3A+subpageof%3ARoySmith%2Fsandbox&title=Special%3ASearch&profile=advanced&fulltext=1&ns2=1&searchToken=2axmqdm7tg5njl1mx6nwho86n) it says "Results 1 – 5 of 5", so not easily reproducable.
    • Task
    **Normalizing Orthographic Re-Mapper (aka N.O.R.M.)** Build out the necessary infrastructure to support various kinds of text-mapping "second-try" searches, including "DWIM"-style wrong-keyboard searches (i.e., accidentally typing on a Russian/Cyrillic on a US/Latin keyboard) and transliterated searches (i.e., typing Georgian or Hindi in Latin script). A good place to start is replicating the Russian and Hebrew DWIM gadget's autocomplete results enhancement, and then extending that breadth-first to Georgian and Hindi transliteration in autocomplete, or depth-first into full-text results. wrong keyboard tickets: * {T138958} * {T155104} translteration tickets: * {T297761} * {T127003} **Note:** Naming is hard. //DWIM// ("do what I mean") is/was an on-wiki gadget that supported wrong-keyboard searches on Russian and Hebrew wikis. However, it sounds a little too much like //DYM// ("did you mean"), our query reformulation suggestion feature. We've used //second-chance// and //second-try// in the past to refer to a number of related approaches that are a superset of what is under consideration here. Hence "**N.O.R.M.**", the //Normalizing Orthographic Re-Mapper,// which would be a shared infrastructure that would allow us to convert both //Fhbcnjntkm// to Аристотель ("Aristotle") on Russian wikis and //devanagari ka itihas// to देवनागरी का इतिहास ("history of Devanagari") on Hindi wikis in a variety of useful ways. Previous on-wiki write ups: * [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English | Typing on the Wrong Keyboard—Russian and English ]] * [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/DWIM_as_API | DWIM as API ]] * [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Unpacking_Notes#Hindi_Wikipedia_Zero_Results_Queries | Hindi Wikipedia Zero Results Queries ]] (includes unsuccessful transliterated queries)
    • Task
    In the parent ticket we are debugging a problem with missing events from edit requests. The source of this problem appears to be that a request ran for multiple hours in post-send, and many (all?) of the deferred's timed out. To see how common this is i wrote a script to query reqIds that logged `EmergencyTimeoutException`, and then did an aggregation query that filtered logs for the same (host, reqId) combo (to exclude jobs that reuse reqId) and reported the delta between the earliest and latest log message. This reports requests with > 10 minutes between start and end Script: P69109 Results for Sep 1 - 11: P69110 It's not a crazy number of requests per day, on average < 10 with 20 on the worst day, but we have multiple requests per day that run for 2+ hours. The longest request runs for 173 minutes. I should note that this depends on the initial request logging something. If the initial request didn't log anything and the logs start at the timeout they are likely not included here. Perhaps `EmergencyTimeoutException` could be adjusted to report the current request runtime in the error message to give more concrete information.
    • Task
    **Steps to replicate the issue** (include links if applicable): * I'd like to add a parameter for [[ https://www.mediawiki.org/wiki/Help:CirrusSearch#filetype | filetype ]] to an input box template with which people can easily search a category using the deepcategory search operator. * When going to the "Images" tab in the MediaSearch it shows all images of all kinds – SVG, PNG, and so on **What happens?**: * When searching via search operator for filetype:image or filetype:bitmap SVG files are excluded * SVG files are also excluded when selecting filetype Images in SpecialSearch which excludes images like datagraphics by Our World in Data despite that these are also images **What should have happened instead?**: * Some search operator setting should show all images like it's the case with MediaSearch so this can be used e.g. in the template * The SpecialSearch is misleading and should be fixed by merging "Drawings" into "Images" since most of these aren't actually drawings or simply to use the new search operator variable of the prior point (e.g. "All images" which one would think is what "Images" is about) **Software version** (on `Special:Version` page; skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): When viewing the Images tab of MediaSearch `&type=image` is in the url.
    • Task
    **Steps to replicate the issue** (include links if applicable): * Search for a word which appears on a translatable page which has lots of translations * * **What happens?**: The search results are filled with translations of the same page, which makes it hard to find what you are looking for. Examples: https://www.wikidata.org/w/index.php?search=merge&ns12=1 has a link to Help:Merge and 19 translations of it. The second page of results has another 20 translations of it. The third page of results has another 18 translations and only two other pages. https://www.wikidata.org/w/index.php?search=redirect&ns12=1 has a link to Help:Redirects, 18 translations of it and only one other page. **What should have happened instead?**: Translatable pages should be collapsed into a single result (the link could use `Special:MyLanguage/`) and/or translations in languages which are not the interface language (or one of its fallback languages) should be ranked much lower. **Software version** (on `Special:Version` page; skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.):
    • Task
    **Feature summary** --- Please implement a configuration variable to set the default method of how search results are sorted: "relevance" (current hardcoded default), "title_asc", "title_desc" **Use case(s)** --- Within the source, I found the following function: ``` /** * Set the type of sort to perform. Must be 'relevance', 'title_asc', 'title_desc'. * @param string $sort sort type */ public function setSort( $sort ) { $this->sort = $sort; } ``` And, I see the variable used within the source as well: ``` /** * @var string sort type */ private $sort = 'relevance'; ``` However, in looking through the extension's configuration variables [[ https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/docs/settings.txt | HERE ]], I do not see an existing configuration variable for changing the default behavior of the extension. **Benefits** --- I think the benefits are pretty obvious, and it seems a pretty straightforward feature to implement. The main benefit I see would be to avoid users from needing to manually edit the source files if they do not want the default to be "relevance". For what it's worth, my users are requesting that the sorting be "by title" because I believe that's what they were used to with the default Mediawiki Search (before I added the CirrusSearch extension.) In other words, I'm making this request because my users want it -- not just for the sake of asking for something new. :)
    • Task
    The article https://en.wikipedia.org/wiki/Tilde_Tilde_Tilde is not found by a search for "~~~" even though it clearly contains three tildes. The issue isn't just that symbol-only searches aren't supported, because the article https://en.wikipedia.org/wiki/Double_tilde oddly is
    • Task
    Hi. I tried to find in the hewiki search box the page "שמש:קיפודנחש", (a talk page of one of the users), and started with "שמש:קיפודנ" waiting for autocomplete. I get a list of the talk page's subpages with archives (<name>/archive/1 and so on), but not the master page (<name>). I believe it should be prioritised on the subpages. It happened a couple of times last week, on different master pages, and looks like exact continuation of T156840. * Enter "שמש:קיפודנ" in the hewiki wikisearch field. * Expected "שמש:קיפודנחש" in the autocomplete suggestions list. * Got: it wasn't. Thank you.
    • Task
    Enter phone in the "Search the archives" box at https://en.wikipedia.org/wiki/Wikipedia:Reference_desk It gives this search: https://en.wikipedia.org/wiki/Special:Search?fulltext=Search+the+archives&fulltext=Search&prefix=Wikipedia%3AReference+desk%2FArchives&search=phone&ns0=1 Two hits on the first page are currently to section headings ending in a space followed by a question mark " ?". Both hits display the section heading in red and make an apparent redlink to the full page instead of a blue link to the section on the page. For example, there is a hit saying: Wikipedia:Reference desk/Archives/Miscellaneous/2018 November 6 (section Where is this phone number ?) "Where is this phone number ?" links to: https://en.wikipedia.org/w/index.php?title=Wikipedia:Reference_desk/Archives/Miscellaneous/2018_November_6&action=edit&redlink=1 The page does exist so the link redirects to: https://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Miscellaneous/2018_November_6 The link in the search hit should have been blue and included the heading: https://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Miscellaneous/2018_November_6#Where_is_this_phone_number_%3F If a section heading ends in a question mark without a space before it then there is no problem, for example: Wikipedia:Reference desk/Archives/Computing/2016 January 18 (section Does having a phone service reduce available ADSL bandwidth?) "Does having a phone service reduce available ADSL bandwidth?" links correctly to: https://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Computing/2016_January_18#Does_having_a_phone_service_reduce_available_ADSL_bandwidth%3F
    • Task
    As of now, the SUP does not leverage fat update events. Hence, there is no need for for a strict [[ https://schema.wikimedia.org/repositories/primary/jsonschema/development/cirrussearch/update_pipeline/update/latest.yaml | `fields` ]] property. AC: * [ ] replace fields property with a less explicit definition * [ ] make sure `org.wikimedia.discovery.cirrus.updater.common.model.UpdateEventTypeInfo` works with the generic schema * [ ] make sure `org.wikimedia.discovery.cirrus.updater.common.model.FetchFailureEncoder` works with the generic schema
    • Task
    To save us from typo-fused misconfigurations (wikis != wikiids), we should fail if there are any unknown properties left. Flink's configuration abstraction does not seem to support this out of the box, so we should discuss if this is worth the effort.
    • Task
    **Feature summary** (what you would like to be able to do and where): Commons picture search filters by dimensions: portrait, landscape, square **Use case(s)** (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): When searching for pictures of commons, and for specific dimensions **Benefits** (why should this be implemented?): It helps getting same dimensions for a better page style or gallery. {F46824866}
    • Task
    How to reproduce: Run a wiki search, so that some of the pages are less than 1 kio ([[ https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Recherche&limit=50&offset=0&ns2=1&search=intitle%3A%22vector.js%22&uselang=fr | example of search]]). Current: Size is displayed as « 200 octet ». Expected: Size should be displayed as « 200 octets ». As a reminder, in French, zero is singular. So the « 0 octet » texts are correct. The message is [[ https://translatewiki.net/wiki/MediaWiki:Search-result-size/fr | Search-result-size ]], and it should be accompanied with [[ https://translatewiki.net/wiki/MediaWiki:Size-bytes/fr | Size-bytes ]]. The messages seems to be correct… so maybe the error is somewhere when determining if the value is plural. Also note the issue happens only in French. There is no such issue when displaying in English or Spanish, for instance. Code searches: * The "search-result-size" message in called in [[ https://gerrit.wikimedia.org/g/mediawiki/core/+/6cf7a4e4e220655493a8c96108adf6018786271c/includes/search/searchwidgets/FullSearchResultWidget.php#270 | FullSearchResultWidget.php#270 ]]. * [[ https://codesearch.wmcloud.org/search/?q=getByteSize&files=&excludeFiles=&repos=Extension%3ACirrusSearch%2CMediaWiki+core | Search for "getByteSize" ]]
    • Task
    **Feature summary** (what you would like to be able to do and where): Allow excluding bot edits in CirrusSearch's "last date modified" filter. **Use case(s)** (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): Bots such as MalnadachBot (https://en.wikipedia.org/wiki/User:MalnadachBot) often run maintenance tasks that edit old pages, such as archives. When searching, pages that are recently edited by the bot (eg to fix lint errors) are treated as new, but they have not received actual content edits. **Benefits** (why should this be implemented?): This restores the conventional purpose of the "last date modified" filter, which is to try to find newer/older pages. When bot edits are included, pages can be erroneously marked as newer. **Additional idea**: Perhaps there can be some settings to exclude certain bots (when inputted)? For example, MalnadachBot doesn't make substantitive edits, but other bots might (eg at AIV), and this distinction might be significant for some contributors.
    • Task
    Using `getWithSetCallback` is generally the preferred approach to using WANObjectCache but CirrusSearch does use separate get and set methods. Using separate get/set was probably done to help track custom hit/miss statistics but we might ponder refactoring the code to take benefits of the features provided by `getWithSetCallback()`: - Stats https://grafana-rw.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group - cache slam avoidance and key versioning features AC: - ponder if the refactor is worth the effort - refactor if yes
    • Task
    **Feature summary**: The disambiguation page for a search term should be on the top of the result list and the suggested list **Use case(s)**: If I type a search term, the shortest string is usually the most general and will lead to an disambiguation page. E.g. if I look for something or someone call "Goethe", the algorithm suggests "Johann Wolfgang von Goethe" on top, which makes sense. But "Goethe (disambiguation)" doesn't even make the top ten, so it is never shown on the mobile page, where there is no result page. There might be the "Template:Redirect" as a workaround like in this case, but not in all cases. It's not reliable and not a good user experience. **Benefits**: Especially casual and mobile users will have a better search experience with more successful search results
    • Task
    **Steps to replicate the issue** (include links if applicable): On a site with simple CirrusSearch installation (without Wikimedia fancy ES plugins): * Create a page with ``` <noinclude>12🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶🎶</noinclude> ``` * Search for `insource:noinclude` **What happens?**: An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. I dived into it and saw the ES backend gives the following: ``` "{"took":6,"responses":[{"took":6,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.47712126,"hits":[{"_index":"zhwiki_content_first","_type":"_doc","_id":"2","_score":0.47712126,"_source":{"namespace_text":"","wiki":"zhwiki","namespace":0,"text_bytes":317,"title":"Heart connect","timestamp":"2023-10-10T15:50:53Z"},"fields":{"text.word_count":[74]},"highlight":{"source_text.plain":["<noinclude>12\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C"],"text":["12\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6\uD83C\uDFB6"]}}]},"status":200}]}" ``` Note there is an unpaired UTF-16 surrogate `\uD83C` at the end of the `"source_text.plain"` field. So the `ruflin/elastica` library failed to decode and parse this piece of JSON and returned the whole response as a string in `$response->getResponse()->getData()['message']`. **What should have happened instead?**: Give the search result without error. **Software version** (skip for WMF-hosted wikis like Wikipedia): master branch **Other information** (browser name/version, screenshots, etc.): I can not reproduce it in wmf-hosted wikis ([[ https://zh.wikipedia.org/wiki/User:Func86/Sandbox|my sandbox ]] and [[ https://zh.wikipedia.org/w/index.php?search=insource%3Anoinclude+prefix%3AUser%3AFunc86%2F&title=Special:%E6%90%9C%E7%B4%A2&profile=advanced&fulltext=1&ns2=1 | search ]]), so I guess the issue was somehow bypassed with the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/search/highlighter | experimental highlighter ]] plugin for ES.
    • Task
    **Feature summary** (what you would like to be able to do and where): Expose [[ https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after | search_after ]] parameter in SearchEngine. **Use case(s)** (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): {T345713} identified the reach of the maximum value for the offset parameter in a GrowthExperiments maintenance script. The error is fair but it should be possible to request further results of a search call. ES supports it via [[ https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after | search_after ]] parameter. Per T345713#9175219 it seems the parameter is exposed in Cirrus raw queries but not for SearchEngine queries. **Benefits** (why should this be implemented?): Navigating through result sets of 10K+ records should be possible
    • Task
    Implement the Hook system added in MediaWiki 1.35 in the extensions, see [[ https://phabricator.wikimedia.org/source/mediawiki/browse/master/docs/Hooks.md | Hooks.md ]] for documentation. [x] Use hook handlers for core hooks [ ] Use of hook handlers for own hooks (not good practice, avoid this and call own code without hook handlers)
    • Task
    Implement the Hook system added in MediaWiki 1.35 in the extensions, see [[ https://phabricator.wikimedia.org/source/mediawiki/browse/master/docs/Hooks.md | Hooks.md ]] for documentation. [ ] Use hook handlers for core hooks [ ] Use hook handlers for CirrusSearch hooks [ ] Use hook handlers for Wikibase hooks (T391445)
    • Task
    **Steps to replicate the issue**: * go to https://yi.hamichlol.org.il/?safemode=1 * type `תפילת` into the search input box **What happens?**: No suggestion is turning up. **What should have happened instead?**: It should suggest the redirect page `תפילת רבי אלעזר בן ערך`, which [[ https://yi.hamichlol.org.il/w/index.php?title=תפילת_רבי_אלעזר_בן_ערך&redirect=no | does exist ]], and which comes up when you type in the whole name **Software version**: MediaWiki 1.39.4 Elasticsearch 7.10.2 CirrusSearch 6.5.4 **Other information** (browser name/version, screenshots, etc.): What is happening is that our wiki has set `$wgCirrusSearchPrefixSearchStartsWithAnyWord` to true, which has directly caused redirects to stop being auto-completed. As can be seen [[ https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/207d36311b0b8a7291a63413c4ba2d59daaae29e/includes/Query/PrefixSearchQueryBuilder.php#L30 | in the code ]] that runs prefix searches, there are two main code paths, one when `$wgCirrusSearchPrefixSearchStartsWithAnyWord` is true, and the other when it's false. The false condition queries the redirects, but it looks like the true condition does not. Any reason why this should be? or can it be fixed? Thanks!
    • Task
    **Issue**: only one missing word can make the search fail. **Use cases**: typo in a word, use of synonym term, unsupported derived form (stemming not enough powerful)… **Requested solution**: relaxing 'AND' operator to use 'OR' by default instead between query terms. This way, even if one wrongly spelled word is present, the other query words may help to find better results. Examples: * [[https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Recherche?search=Je+suis+venir+te+dire+que+je+m%27en+vais | Je suis *venir* te dire que je m’en vais]] vs [[https://fr.wikipedia.org/wiki/Spécial:Recherche?search=Je+OR+suis+OR+venir+OR+te+OR+dire+OR+que+OR+je+OR+m’en+OR+vais | Je OR suis OR venir OR te OR dire OR que OR je OR m’en OR vais]] (T241265) * [[https://en.wikipedia.org/w/index.php?search=John+Fitgerald+Kennedy&title=Special:Search&profile=advanced&fulltext=1&ns0=1 | John *Fitgerald* Kennedy]] vs [[https://en.wikipedia.org/w/index.php?search=John+OR+Fitgerald+OR+Kennedy | John OR Fitgerald OR Kennedy]] Anyway, according to [[https://en.wikipedia.org/wiki/MediaWiki:Advancedsearch-help-plain | inline doc]] (“There may be results that do not contain one or more of your search terms.”, relaxing 'AND' is expected. Follow up of {T112178} (T112583 analysis may help)
    • Task
    The query `intitle:/.*lav/` https://cs.wikipedia.org/w/index.php?title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&limit=500&offset=0&ns0=1&ns100=1&ns102=1&search=intitle%3A%2F.%2Alav%2F does seem to fail producing `Search backend error during regex search for 'intitle:/.*lav/' after 2525: runtime_exception: runtime_exception: Unreachable` in the logs. A quick look suggests that it might be caused by https://gerrit.wikimedia.org/r/plugins/gitiles/search/highlighter/+/refs/heads/master/experimental-highlighter-lucene/src/main/java/org/wikimedia/highlighter/experimental/lucene/hit/AutomatonHitEnum.java#189 It does fail only when setting `limit=500` in the example query above and thus might indicate that the bug is dependent on the characteristics of the content being highlighted.
    • Task
    compare * https://commons.wikimedia.org/w/index.php?search=%22%E6%A2%81%E5%A4%A9%E7%90%A6%22 ("梁天琦") * https://commons.wikimedia.org/w/index.php?search=%E6%A2%81%E5%A4%A9%E7%90%A6 (梁天琦) **What happens?**: out of 19,193 results, only 24 (maybe a few more with only english aliases are missing but definitely <100 overall) match the intended keywords 梁天琦 (a politician's full name). and pay close attention to where these 24 appear in the 19,193 results. only 10 appear in top 30. the other 20 in top 30 are results that only match the chinese characters separately (which means, they often are mismatches. it's like if you search "pineapple" but find "an apple was found next to a bag of pine nuts".) i only tested chinese chars. i'm not sure about japanese kanas and korean, but i guess they probably have the same problem. **What should have happened instead?**: when searching a string of uninterrupted CJK chars: - results that match the entire string, should appear top. they are most likely the intended results, especially if the string consists of 4 chars or more. it's quite uncommon to have proper names that have the same 4 chars in the same order. - results that match a longer substring (more consecutive chars) should appear closer to the top. suppose a 4 char word ABCD is searched, results matching ABC or BCD are more likely matches than AB BC CD. **Software version** commons. probably other wikis too. (technically this is not a bug but bad search algorithm.)
    • Task
    I recently opened up Special:Search on enwiki, set some things, then clicked search. This was the resulting URL: https://en.wikipedia.org/w/index.php?search=bundle+-intitle%3Abundle+subpageof%3A%22Articles+for+deletion%22&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%22fields%22%3A%7B%22subpageof%22%3A%22Articles+for+deletion%22%7D%7D&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns118=1&ns119=1&ns710=1&ns711=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1 `&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns118=1&ns119=1&ns710=1&ns711=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1` sticks out to my eye as too verbose. There's opportunity for improvement here. I propose the following improvements: * add URL support for... * [ ] ?ns1=1&ns2=1 should be consolidated into ?ns=1,2 * [ ] There should be a ?ns=all shortcut added for searches that should check every namespace * change the default search URLs emitted from Special:Search to use the new format * [ ] ?ns1=1&ns2=1 should be consolidated into ?ns=1,2 * [ ] Detect when the user has clicked the "All" checkbox or has selected every namespace, and emit ?ns=all instead of ?ns=1,2,3,4,5 etc. * [ ] keep the old URL format for backwards compatibility I guess
    • Task
    Incident followup: https://wikitech.wikimedia.org/w/index.php?title=Incidents/2023-06-18_search_broken_on_wikidata_and_commons AC: - The system should alert us when almost all searches are failing on a wiki
    • Task
    ==== Error ==== * mwversion: 1.41.0-wmf.13 * reqId: a78484f3-32b5-4b95-9f9b-99abd5a87eba * [[ https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=(time:(from:'2023-06-14T20:00:27.627Z',to:'2023-06-15T20:07:38.574Z'))&_a=(query:(query_string:(query:'reqId:%22a78484f3-32b5-4b95-9f9b-99abd5a87eba%22'))) | Find reqId in Logstash ]] ```name=normalized_message,lines=10 [{reqId}] {exception_url} Wikimedia\Assert\PostconditionException: Postcondition failed: Regex failed: 4 ``` ```name=exception.trace,lines=10 from /srv/mediawiki/php-1.41.0-wmf.13/vendor/wikimedia/assert/src/Assert.php(203) #0 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Parser/QueryStringRegex/NonPhraseParser.php(99): Wikimedia\Assert\Assert::postcondition(boolean, string) #1 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Parser/QueryStringRegex/QueryStringRegexParser.php(693): CirrusSearch\Parser\QueryStringRegex\NonPhraseParser->parse(string, integer, integer) #2 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Parser/QueryStringRegex/QueryStringRegexParser.php(629): CirrusSearch\Parser\QueryStringRegex\QueryStringRegexParser->consumeWord(integer) #3 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Parser/QueryStringRegex/QueryStringRegexParser.php(357): CirrusSearch\Parser\QueryStringRegex\QueryStringRegexParser->nextToken() #4 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Parser/QueryStringRegex/QueryStringRegexParser.php(317): CirrusSearch\Parser\QueryStringRegex\QueryStringRegexParser->expression() #5 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/Search/SearchQueryBuilder.php(141): CirrusSearch\Parser\QueryStringRegex\QueryStringRegexParser->parse(string) #6 /srv/mediawiki/php-1.41.0-wmf.13/extensions/CirrusSearch/includes/CirrusSearch.php(245): CirrusSearch\Search\SearchQueryBuilder::newFTSearchQueryBuilder(CirrusSearch\SearchConfig, string, class@anonymous ``` ==== Notes ==== * Superficially similar to T334681 stack trace is slightly different * Reproducible via GET request * Volume of errors is much higher [[ https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#Logspam | obscuring legitimate errors ]] than T334681 if it is unimportant, could we remove the error and return a message to users instead?
    • Task
    Building the CirrusSearch document do seem to trigger an exception: `Caught exception of type Wikibase\\DataModel\\Services\\Lookup\\EntityLookupException`. https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusbuilddoc&titles=Lexeme%3AL1065304&formatversion=2
    • Task
    When searching Wikidata, the search results when the interface language is set to English versus when it is set to British English differ far more than expected. Example: [[https://www.wikidata.org/w/index.php?limit=500&ns0=1&search=wh&uselang=en|"wh" using English]] vs [[https://www.wikidata.org/w/index.php?limit=500&ns0=1&search=wh&uselang=en-gb|"wh" using British English]]. Top ten results on Special:Search, for `en` and `en-gb`: {F36948062, layout=inline} {F36948063, layout=inline} There are no results in common. The top result for `en` is the 28th result for `en-gb`. The second result for `en` is the 128th result for `en-gb`, after many other results which have no obvious connection to the search term. Top ten results when doing an entity search, for `en` and `en-gb`: {F36948071, layout=inline} {F36948070, layout=inline} There is only one result in common. 5 of the 7 results for `en` show a "wh" label or alias. Only 1 of the 7 results for `en-gb` shows a "wh" label or alias. [[https://www.wikidata.org/wiki/Q12874593|Q12874593 (watt hour)]] as displayed on Special:Search, in `en` and `en-gb`: {F36948067, layout=inline} {F36948066, layout=inline} `en` shows the label with the matched alias after the link. `en-gb` replaces the label with the matched alias.
    • Task
    **Feature summary** try https://commons.wikimedia.org/wiki/Special:Search or https://commons.wikimedia.org/wiki/Special:MediaSearch , both have the same problem. try searching " typewriter " or " nurse ", etc. for Special:Search , untick file namespace and/or gallery. for MediaSearch , click "Categories and Pages". you will have a hard time finding Category:Typewriters or Category:Nurses . lots more other cats appear before these. **Solution** whatever the keyword is, " Category:<plural form> " is usually where files about that keyword are categorised. the plural form is often the -s form. i guess "search results sorted by relevance" are sorted by some kind of weight? increase its weight if " Category:<search keyword>+s " exists and make it appear as top as possible. **Benefits** users want to go to the category. give them that. dont waste users' time.
    • Task
    **Steps to replicate the issue** (include links if applicable): * Search for "Kementrian Perhubungan" (Department of Transportation) in Indonesian Wikipedia: https://id.wikipedia.org/w/index.php?title=Istimewa:Pencarian&search=Kementrian+Perhubungan . Note that `Kementrian` is misspelled and should be `Kementerian` * The top 500 result doesn't contain what I'm looking for at https://id.wikipedia.org/wiki/Kementerian_Perhubungan_Republik_Indonesia (the full name) or the disambig page at https://id.wikipedia.org/wiki/Kementerian_Perhubungan . Notice there's a slight variation on the word `Kementrian` vs. `Kementerian` {F36921344} * I could still understand if it's not the first, but at least top 3 or 5. But not even in top 500?? What's going on here?? * Meanwhile, the autocomplete in the search box itself returned at least five articles with the prefix "Kementerian Perhubungan" {F36921347} * I've tested it in logged-out mode, and same result. **What happens?**: Search a string, "kementrian perhubungan" without quote or "intitle", but the articles: https://id.wikipedia.org/wiki/Kementerian_Perhubungan_Republik_Indonesia nor https://id.wikipedia.org/wiki/Kementerian_Perhubungan are not in the first 500 result, although the search dropdown could easily find the article I'm looking for **What should have happened instead?**: The search should be more clever to see that the article that I want only have 1 letter difference, as the autocomplete (Img 2.) can automatically guess. **Software version** (skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): I tried other words starting with "Kementrian", also have the same weird result. I suppose the autocomplete employ Levenshtein distance, while the search does not.
    • Task
    **Steps to replicate the issue** (include links if applicable): * https://commons.wikimedia.org/w/index.php?search=category%3Awet&title=Special:MediaSearch&type=page **What happens?**: I see these repeated patterns. **What should have happened instead?**: I should see no patterns. **Software version** (skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): When I click the items to edit them, the patterns aren't present in the wikitext. So they must not exist. Or, OK, if they do exists, they look like a bug. {F36895758}
    • Task
    This is the result of a search for "finger" on Commons, requesting file type GIF; the user was apparently hoping to find an image of someone "giving the finger": https://commons.wikimedia.org/w/index.php?search=finger&title=Special:MediaSearch&go=Go&type=image&filemime=gif (WARNING: THE ABOVE LINK GIVES A HIGHLY NSFW RESULT) I don't know what is going on behind the scenes here, but an innocuous search term should not result in displaying a set of sexually explicit videos. It seems to me that this is an inappropriate enough result that it should be treated on the same priority as a bug.
    • Task
    **Steps to replicate the issue** (include links if applicable): * go to special:Search * Search for an article that has very little text (less than the 4 lines of text shown in the snippet) **What happens?**: The snippet shows ... at the end of the snippet **What should have happened instead?**: The ellipsis should not have shown as there is nothing extra to be shown **Other information** (browser name/version, screenshots, etc.): {F36889700} NOTE: From @SimoneThisDot: This is in PHP Mediawiki not in SearchVue. I am not sure what is the best way to find out the actual article size without having to fetch it. The effort required on the Backend to find out if the snippet is small may be too big for the edge case.
    • Task
    Hi! I think wikiprojects, maintenance pages, admin boards, backlogs and all sorts of other background pages and communities would benefit a lot from a way to embed search results. With all the new filters and features such as "linskto", "subpageof", "incategory", "insource", etc, search could become a VERY powerful way to create dynamic reports. Crucially, there should be some control over the format of the output, so the end result can be further customized via Lua modules or templates. Some days ago I created [[ https://www.mediawiki.org/wiki/Extension:SearchParserFunction | Extension:SearchParserFunction ]] to do just that, and it's already revolutionizing the wiki where I work (appropedia.org). I figured you may want to consider something similar for Wikimedia sites. It would also be a great way to multiply the impact of all the work you've been doing with search. BTW, thanks for that!
    • Task
    **Steps to replicate the issue** (include links if applicable): * Search for book title "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀ꦭꦩꦶ" in Commons works (the full title), found [[ https://commons.wikimedia.org/wiki/File:%EA%A6%A5%EA%A6%BC%EA%A6%A5%EA%A6%BC%EA%A6%9B%EA%A6%B6%EA%A6%8F%EA%A7%80%EA%A6%8F%EA%A6%A4%EA%A7%80%EA%A6%B1%EA%A6%8F%EA%A6%B6%EA%A6%81%EA%A6%A5%EA%A6%BF%EA%A6%97%EA%A6%9A%EA%A7%80%EA%A6%97%EA%A6%B6%EA%A6%AA%EA%A6%A4%EA%A7%80%EA%A6%AD%EA%A6%A9%EA%A6%B6.pdf | the PDF ]] * Search partial "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀" or "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ" or "ꦥꦼꦥꦼꦛꦶꦏ꧀" or any parts of the full title won't work **What happens?**: * First identified 10 years ago, T46350, marked wont fix, since last time was during migration from Lucene to Cirrus Search. * I identified there was a problem with scriptio continua nature of Javanese script (no word marker) * T58505 Cirrus Search ticket was closed as solved **What should have happened instead?**: The Commons search et. al should be able to find parts of the full title. **Software version** (skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): A bit background on the way the script is written: * Scriptio continua is a hassle to display in web, because there's no obvious line break. Therefore, in projects such as Wikisource, if not handled properly, would break the page view of the transcribed documents. (become too wide) * Using certain keyboards, including the way we handle the problem is, to automatically insert ZWS (zero-width space) after certain character (e.g. comma, period, etc. ) I can gave you the full list. Therefore, the line break would still works, except in very rare cases where there's no occurrences of those characters (thus the ZWS not auto-inserted). [the zws doesn't always equal to the Latin space] * AFAIK ZWS is not supported in page titles, (e.g. when I upload books with Javanese script titles that contain ZWS), so none of the titles in jv wiki projects contain ZWS, and thus the cirrus search won't be able to know the word delimiter.
    • Task
    Reproduce: Copy some text begins with no-break spaces and paste it into the main search box or value selector, such as "<nbsp>Wikipedia" Excepted: no-break spaces (and other [[ https://en.wikipedia.org/wiki/Whitespace_character | whitespace characters]]) before and after search terms are stripped, similar to that in Wikipedia opensearch (which use the opensearch endpoint) Actual: You does not got the item for Wikipedia: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%C2%A0Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item This will be inconvenient if you are copying search terms from other websites that contains such whitespace characters, as users may be unaware of the existence of such characters. Note: Space characters that are stripped: [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%20Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | space ]], [[https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%09Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item|tab]], [[https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%0aWikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item|linebreak]] Space characters that are not stripped (but should be stripped): [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%E2%80%82Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | en-space ]], [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%E3%80%80Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | ideographic space ]]
    • Task
    Reproduce: 1. https://www.wikidata.org/w/index.php?search=Lexeme%3Aweek Expected: The search result will also display lexemes with "week" in glosses of some senses, such as https://www.wikidata.org/wiki/Lexeme:L778137
    • Task
    We're starting to see this problem on the search airflow instance (an-airflow1001). A simple sensor task might fail with the following context (mail date: `Wed, 14 Dec 2022 17:00:19 +0000`): ``` Try 0 out of 5 Exception: Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally? Log: Link Host: an-airflow1001.eqiad.wmnet Log file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00.log Mark success: Link ``` When looking at the actual state for this task it has succeeded: ``` *** Reading local file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00/1.log [2022-12-14 16:28:13,261] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,309] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,310] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,310] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 16:28:13,310] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,361] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:28:13,365] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310416', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpzo8pxyn_'] [2022-12-14 16:28:14,038] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,038] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32615 [2022-12-14 16:28:14,790] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,788] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:28:14,791] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,790] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:28:14,925] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,923] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:28:15,203] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,202] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,205] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,205] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,209] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,262] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,387] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:28:18,218] {logging_mixin.py:112} INFO - [2022-12-14 16:28:18,217] {local_task_job.py:124} WARNING - Time since last heartbeat(0.03 s) < heartrate(5.0 s), sleeping for 4.97169 s [2022-12-14 16:28:23,199] {logging_mixin.py:112} INFO - [2022-12-14 16:28:23,196] {local_task_job.py:103} INFO - Task exited with return code 0 [2022-12-14 16:31:27,704] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:841} INFO - [...] -------------------------------------------------------------------------------- [2022-12-14 16:56:59,970] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:56:59,970] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310665', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmp03vdcbvm'] [2022-12-14 16:57:00,662] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:00,661] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=25095 [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,595] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,596] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:57:01,696] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,695] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:57:01,987] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,986] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,989] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,988] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,993] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,064] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,219] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:57:04,388] {logging_mixin.py:112} INFO - [2022-12-14 16:57:04,387] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960811 s [2022-12-14 16:57:09,356] {logging_mixin.py:112} INFO - [2022-12-14 16:57:09,354] {local_task_job.py:103} INFO - Task exited with return code 0 [suspicious hole here, time when the mail was sent] [2022-12-14 17:15:26,675] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [...] -------------------------------------------------------------------------------- [2022-12-14 18:25:45,431] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 18:25:45,431] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 18:25:45,479] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 18:25:45,479] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2311270', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpymzjynuf'] [2022-12-14 18:25:46,170] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:46,169] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32566 [2022-12-14 18:25:47,164] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,163] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 18:25:47,165] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,164] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 18:25:47,282] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,281] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 18:25:47,558] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,557] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,559] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,559] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,563] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,641] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,712] {base_sensor_operator.py:123} INFO - Success criteria met. Exiting. [2022-12-14 18:25:50,354] {logging_mixin.py:112} INFO - [2022-12-14 18:25:50,353] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.955724 s [2022-12-14 18:25:55,320] {logging_mixin.py:112} INFO - [2022-12-14 18:25:55,316] {local_task_job.py:103} INFO - Task exited with return code 0 ``` Note the hole between `2022-12-14 16:57:09` and `2022-12-14 17:15:26` (when the email was sent), with a poke interval at 3minutes the sensor should have been queued at `2022-12-14 17:00:09` so something prevented it from running at the expected time. Full logs: P42712 Possible cause: https://github.com/apache/airflow/issues/10790 Possible workarounds: - increase poke_interval to 5mins or more - reduce the load on the machine (provision a bigger instance?)