Page MenuHomePhabricator
Search Advanced Search
    • Task
    When the elasticsearch servers run out of heap memory they start intermittently triggering our current latency and old gc/hr limits, but those don't end up being particularly actionable because they can trigger for lots of other reasons. On review of our [[ https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&from=now-90d&to=now&var-datasource=codfw%20prometheus%2Fops&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2053&var-top5_fielddata=All&var-top10_terms=All&var-top5_completion=All | Elasticsearch Memory ]] dashboards for instances that have been having trouble recently there are some metrics that might more clearly distinguish an instance that needs to be rebooted, and possibly temp-banned from the cluster to rebalance shards. Current alerts: * p95 latency for all requests made between cirrus and elastic. When this alerts it says something might be wrong, but nothing about what it might be. * > 100 old gc/hour. Current problem servers steady state at around 20-25 and don't trigger the alert. Should it be lower? Some metrics we could think about using instead: * `JVM Heap - survivor pool` goes from varying up to a couple hundred mb's to holding a solid value of 0. A possible alert could be if survivor pool has held close to 0 for the last N hours. * `JVM Heap - old pool` goes from a 1+ GB sawtooth to a ~10MB sawtooth, almost flatlining against the max value. A possible alert could be if max-min over the last N hours is less than X MB.
    • Task
    For example, searching Commons for "wikimedia.org -commons.wikimedia.org" in the [[https://commons.wikimedia.org/w/index.php?search=wikimedia.org+-commons.wikimedia.org&ns0=1&ns6=1|main and file namespaces]] finds 168,000 results. Searching [[https://commons.wikimedia.org/w/index.php?search=wikimedia.org+-commons.wikimedia.org&ns6=1|only the file namespace]] finds 37.5 million results. Searching for [[https://commons.wikimedia.org/w/index.php?search=wikimedia.org&ns6=1|"wikimedia.org" in only the file namespace]] also finds 37.5 million results, so it appears to be ignoring the "-commons.wikimedia.org" part of the search when only the file namespace is selected, counterintuitively resulting in more results when fewer namespaces are selected.
    • Task
    (Lydia asked that I write this up, just in case) I thought that "," comma was already added to the Elasticsearch standard tokenizer and would be excluded from simple search? But it seems that there is some overriding decision to have the default config this way on Wikidata? Perhaps the word_delimiter is being used and incorrectly? > Avoid using the word_delimiter filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead. Below as seen in my screenshot, I was looking for entities that contained all 3 words, but it seemed if I DID NOT include the comma, then the entity was not found. The only way that it was displayed was if I did include the comma. {F34615713} I noticed that the string "foot locker inc" will not show the entity in the dropdown, but only "foot locker, inc." which includes the comma? Exact match should only happen by default if a user wraps in double quotes, such as ``` "Foot Locker, Inc." ``` where in my example screenshot I have to include the comma to find the entity. But my expectation was that any U+002C comma in the search string would not be included in the search query. (On that entity, I have since added the full legal name into the alias field to help improve searchability, but still would like to know the decision on why U+002C comma is not being excluded) Why was U+002C comma decided to be included in simple search? Must users use the Advanced Search on Wikidata or the API if they want to actually do simple searches that are not exact match phrases? Doing something advanced in order to do something simple would seem counter-intuitive and the reverse of most users expectations.
    • Task
    # Status - The current Documentation is not having the sufficient information for locally setting up CirrusSearch - Debugging is challenging without having the repo locally. - Patch : [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/708516 | 708516]]
    • Task
    [2021-08-12T18:00:46,822][WARN ][o.e.d.i.q.BoolQueryBuilder] [gXPW_Qb] Should clauses in the filter context will no longer automatically set the minimum should match to 1 in the next major version. You should group them in a [filter] clause or explicitly set [minimum_should_match] to 1 to restore this behavior in the next major version.
    • Task
    [2021-08-12T18:00:46,783][WARN ][o.e.d.r.a.a.i.RestGetIndicesAction] [gXPW_Qb] [types removal] The parameter include_type_name should be explicitly specified in get indices requests to prepare for 7.0. In 7.0 include_type_name will default to 'false', which means responses will omit the type name in mapping definitions.
    • Task
    **List of steps to reproduce** * Open Turkish Wikipedia (or any other Turkish wiki) * Search for "special:import" in the search bar. * It sents to "Special:İmport" page instead of "Special:Import". **What should have happened instead?**: For some reason it changes the letter "i" to "İ" instead of "I". {F34587372} {F34587732}
    • Task
    The CirrusSearch README recommends setting `$wgJobQueueAggregator` in the Job Queue section. This should probably be removed since that variable was completely removed in MediaWiki 1.33.0.
    • Task
    As a search user, I want to search up to date documents. In case of stalled indices, I want those to be fixed promptly. As a maintainer of the search data pipeline I want to be alerted if some DAGs are not being scheduled so that I can re-enable them. When doing hadoop maintenance it might happen that the DAGs are disabled (and/or the airflow scheduler is stopped) to help drain the YARN cluster. If we forget to re-enable those after the maintenance is over we get no alerts, airflow SLAs are being checked since they depend on the fact that the DAG can be executed. There should be external monitoring making sure that airflow is not in this "maintenance mode" for too long (48h?). AC: - alert when active DAGs are off for too long - alert when the airflow scheduler is off for too long.
    • Task
    As a user of the Search REST api I want to be able to use the same parameters that I used to pass when using opensearch or the action API using search modules so that I can tune, instrument and debug CirrusSearch. How to reproduce: pass `cirrusUseCompletionSuggester=yes` or `cirrusUseCompletionSuggester=no` to enable/disable the completion suggester: Opensearch gets varying results: - https://en.wikipedia.org/w/api.php?action=opensearch&search=Test%20crocket&cirrusUseCompletionSuggester=yes - https://en.wikipedia.org/w/api.php?action=opensearch&search=Test%20crocket&cirrusUseCompletionSuggester=no The REST api does not vary results: - https://en.wikipedia.org/w/rest.php/v1/search/title?q=Test%20crocket&cirrusUseCompletionSuggester=yes - https://en.wikipedia.org/w/rest.php/v1/search/title?q=Test%20crocket&cirrusUseCompletionSuggester=no This is particularly problematic as Cirrus relies on this kind of parameters for configuring its instrumentation and A/B test infrastructure. Seen on tr.wikipedia.org where the search widget is relying on the REST api instead of opensearch. Relates to T281578 AC: - discuss and decide what approach to take - instrumentation and debugging options should still be possible when CirrusSearch is called from the REST API
    • Task
    I’ve noticed that in some search results on mediawiki.org, the link to the section containing the result is a redlink. [standalone group](https://www.mediawiki.org/w/index.php?title=Special:Search&search=standalone+group&ns0=1&ns12=1&ns100=1&ns102=1&ns104=1&ns106=1): {F34534400} [phpunit testing list groups](https://www.mediawiki.org/w/index.php?search=phpunit+testing+list+groups&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns12=1&ns100=1&ns102=1&ns104=1&ns106=1): {F34534405} (Note that the first result has a blue section link.) The links point to the target page without the section, e.g. https://www.mediawiki.org/w/index.php?title=Manual:PHP_unit_testing/Writing_unit_tests_for_extensions&action=edit&redlink=1 for the first example. So far, I haven’t been able to reproduce this on another wiki.
    • Task
    As a user who wants to add an image as a structured task, I want to see a feed of articles that are unillustrated, have image matches, and fit my topics of interest, so that I can find interesting and valuable tasks for me to do. * I will not need to filter on image source. * I will not need to filter on a confidence or difficulty level. * I will not need the suggestions to be kept up to date or refreshed. The first iteration of "add an image" as built by the Growth team will operate off a static file of image suggestions that will be generated once. After it is generated in {T285816}, it needs to be loaded to the Search index so that the attribute of "having an image match" can be used for searching alongside other criteria, like topics, categories, and templates. Only articles that are unillustrated //and// that have image matches should be tagged in Search. Though the dataset coming from Platform will contain other information, like the image source and confidence, those things will not need to be searchable in Iteration 1 of the Growth team's feature. Though they may need to be searchable in subsequent iterations, the Growth and Search teams have decided not to index those additional attributes yet. It also important to mention that when users accept or reject an image suggestion, the Growth team will use [[ https://wikitech.wikimedia.org/wiki/Search/WeightedTags#Resetting_the_data_from_MediaWiki | this existing Mediawiki functionality ]] to invalidate the suggestions in the Search index. No additional work from the Search team is required to make this possible.
    • Task
    As a maintainer of the search infrastructure I want the long running maintenance tasks to be resilient to node restarts so that such processes do not fail regularly. The scroll API relies on a non persisted state maintained on the elasticsearch nodes that may disappear if the node restarts and will cause the underlying maintenance task to fail. This problem currently affects: - dump generation (T265056) - title completion index rebuild - ttmserver - reindex? (might be solved upstream https://github.com/elastic/elasticsearch/issues/42612) One solution is to move the state to the client performing the long running task using `search_after` on a stable field (the page id). AC: - the scroll API is no longer used by long running tasks - a node crash does not cause a long running task to fail
    • Task
    As a maintainer of the search cluster I want to easily know what is the state of the various indices on all the clusters in regard to analysis settings/mappings so that I can more confidently enable/deprecate features without breaking existing usecases. Changing the analysis settings and mappings in CirrusSearch requires reindexing the affected wikis, this process is generally slow and is often delayed so that more chances are packed together. The drawback is that it's prone to mistakes as the maintainer has to remember what was done and what left to be done. AC: - verify that UpdateSearchIndexConfig is still able to properly detect discrepancies between the actual and expected settings - a small tool is available in `CirrusSearch/scripts` that produces a list of wiki-cluster pairs to verify
    • Task
    An i18n message needs creating for `apihelp-cirrus-config-dump-param-prop` {F34527353}
    • Task
    https://wikitech.wikimedia.org/wiki/Switch_Datacenter#ElasticSearch Currently before a datacenter switchover, we hardcode the more_like queries to go to the active datacenter so that after the switchover when caches are cold, it still has a hot cache. After ~24h the caches are warmed up enough and the hardcoding is removed. Ideally this process would be automated potentially by: * Replicating cache * A cookbook to warm up the cache ahead of time * Deciding that it's fast enough with the cold cache (likely what would happen in an emergency) and no longer require manual intervention
    • Task
    From IRC in `#wikimedia-operations`: ```lang=irc 11:55:23 <Krinkle> If there is code storing data directly in memc bypassing getWithSet(), then that would be a problem. 11:55:26 <Krinkle> I don't know if that's the case. 11:55:49 <Krinkle> It being called out here suggests that maybe it is doing something like that, as otherwise why is it called out at all? 11:56:24 <legoktm> I think the problem is that the CirrusSearch cache is too cold to use after the switchover 11:57:32 <Krinkle> well, if the procedure laid out here is what was done in the past, then I suppose we can do it again, I mean, nothing has changed in terms of wan cache 11:57:39 <legoktm> https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/3df4a9b30707a2ef9ba1ebfcc84f09b915c78e15/includes/Searcher.php#642 11:57:53 <Krinkle> if its use of wan cache is new, then we can re-evaluate it indeed 11:58:37 <legoktm> it's probably not new, I'm just checking to make sure the docs are still up to date, and it seems like they are 11:59:00 <Krinkle> aye, yeah, but this does seem a bit of an anti-pattern. 11:59:30 <Krinkle> it's bypassing virtually all scale and performance levers and automation in the wanobjectcache class by not using getWithSet, I think. 12:05:16 <legoktm> Krinkle: is there a link somewhere that explains why getWithSet is better than get/set? 12:05:20 -*- legoktm is filing a bug 12:08:11 <Krinkle> legoktm: the docs for get() and set() say to consider using getWithSet, and the raw get()/set() enumerate a lot of things to consider if you call them directly. https://doc.wikimedia.org/mediawiki-core/master/php/classWANObjectCache.html 12:08:23 <Krinkle> but more generally, if you ask me, these methods just shouldn' be public in the first place. 12:09:00 <Krinkle> They probably are only public to allow for an optimisation in one or two places somewhere where we haven't bothered to accept or accomodate it in a way that is less damanging to the public API 12:09:17 <Krinkle> and they probably are only called here because someone migrated the code from wgMemc to wanCache 12:09:24 <Krinkle> which is a step in the right direction I guess. ```
    • Task
    Proposed syntax: creator:Foo and creator:#123 Use case: 1. A bot created a large number of articles, and we want to find search results not created by this bot. 2. I want to work on articles in a category, but I want to skip articles created by User:Example. 3. I want to find articles created by some user that is less recently edited. Note: 1. Internally we only store user ID for registered users, so there will be no disruption on user rename 2. Delete/Undelete/Revdel/Import may affect the data (no creator may be exposed if it is revdeled) 3. UserMerge may disrupt existing data but it is not enabled in Wikimedia wikis
    • Task
    As a user, I want to order search results by the page size of each result, so that I can prioritize articles to work on. Split from T11519: Use cases: to find longest pages in a specific category, such as "stubs".
    • Task
    Split from T11519: Introduce a new keyword "pagesize" to search page with given size e.g. incategory:Stubs pagesize:>3000 will give stubs with more than 3000 bytes. Similar keyword filesize already exists
    • Task
    CirrusSearch adds a total word count of the wiki to Special:Statistics. As far as I can tell, this word count is not available anywhere else (`countContentWords()` is only called in `Hooks::onSpecialStatsAddExtra()`); it would be useful to also have it in the API, e.g. in `meta=siteinfo`.
    • Task
    **List of steps to reproduce** (step by step, including full links if applicable): * [[ https://en.m.wikipedia.org/w/index.php?title=Special:Search&limit=500&offset=0&profile=default&search=%22Azerbaijan%3A+MTN+%28until+2015%29%22&ns0=1 | Perform this mobile search ]] for "Azerbaijan: MTN (until 2015)" on the English Wikipedia **What happens?**: * You see about 300 articles in the search results. * Click on any of the articles in mobile, and observe that the text "Azerbaijan: MTN (until 2015)" does not appear in the rendered article or in its source text. * Switch to desktop view of any of the articles, and you can see that text inside one of the navboxes. **What should have happened instead?**: * In the mobile view, navboxes are not displayed, so search results on mobile should exclude text that appears in navboxes. **Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc**: Mobile version of English Wikipedia
    • Task
    ==== Error ==== * mwversion: `1.37.0-wmf.4` * reqId: `ce75c6aebfd3363649ed46e8` * [[ https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=(time:(from:'2021-05-04T21:55:42.000Z',to:'2021-05-06T14:24:57.993Z'))&_a=(query:(query_string:(query:'reqId:%22ce75c6aebfd3363649ed46e8%22'))) | Find reqId in Logstash ]] * [[ https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=(time:(from:now-30d,to:now))&_a=(query:(query_string:(query:'normalized_message:%22Exception%20thrown%20while%20running%20DataSender::%7Bmethod%7D%20in%20cluster%20%7Bcluster%7D:%20%7BerrorMessage%7D%22'))) | Find normalized_message in Logstash ]] ```name=normalized_message Exception thrown while running DataSender::{method} in cluster {cluster}: {errorMessage} ``` ```name=exception.trace,lines=10 from /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Model/URLReference.php(27) #0 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Model/URLReference.php(64): Flow\Model\URLReference->__construct(Flow\Model\UUID, string, Flow\Model\UUID, Title, string, Flow\Model\UUID, string, string) #1 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/Mapper/BasicObjectMapper.php(40): Flow\Model\URLReference::fromStorageRow(array, NULL) #2 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ObjectLocator.php(315): Flow\Data\Mapper\BasicObjectMapper->fromStorageRow(array) #3 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ObjectManager.php(307): Flow\Data\ObjectLocator->load(array) #4 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ObjectLocator.php(119): Flow\Data\ObjectManager->load(array) #5 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ObjectLocator.php(70): Flow\Data\ObjectLocator->findMulti(array, array) #6 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ManagerGroup.php(127): Flow\Data\ObjectLocator->find(array) #7 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Data/ManagerGroup.php(139): Flow\Data\ManagerGroup->call(string, array) #8 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/LinksTableUpdater.php(131): Flow\Data\ManagerGroup->find(string, array) #9 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/LinksTableUpdater.php(51): Flow\LinksTableUpdater->getReferencesForTitle(Title) #10 /srv/mediawiki/php-1.37.0-wmf.4/extensions/Flow/includes/Content/BoardContent.php(196): Flow\LinksTableUpdater->mutateParserOutput(Title, ParserOutput) #11 /srv/mediawiki/php-1.37.0-wmf.4/includes/Revision/RenderedRevision.php(266): Flow\Content\BoardContent->getParserOutput(Title, integer, ParserOptions, boolean) #12 /srv/mediawiki/php-1.37.0-wmf.4/includes/Revision/RenderedRevision.php(235): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(Flow\Content\BoardContent, boolean) #13 /srv/mediawiki/php-1.37.0-wmf.4/includes/Revision/RevisionRenderer.php(217): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array) #14 /srv/mediawiki/php-1.37.0-wmf.4/includes/Revision/RevisionRenderer.php(154): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array) #15 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array) #16 /srv/mediawiki/php-1.37.0-wmf.4/includes/Revision/RenderedRevision.php(197): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array) #17 /srv/mediawiki/php-1.37.0-wmf.4/includes/content/ContentHandler.php(1443): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput(array) #18 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(85): ContentHandler->getParserOutputForIndexing(WikiPage, ParserCache) #19 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(68): CirrusSearch\BuildDocument\ParserOutputPageProperties->finalizeReal(Elastica\Document, WikiPage, ParserCache, CirrusSearch\CirrusSearch) #20 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/BuildDocument/BuildDocument.php(165): CirrusSearch\BuildDocument\ParserOutputPageProperties->finalize(Elastica\Document, Title) #21 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/DataSender.php(316): CirrusSearch\BuildDocument\BuildDocument->finalize(Elastica\Document) #22 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(136): CirrusSearch\DataSender->sendData(string, array) #23 /srv/mediawiki/php-1.37.0-wmf.4/extensions/CirrusSearch/includes/Job/JobTraits.php(136): CirrusSearch\Job\ElasticaWrite->doJob() #24 /srv/mediawiki/php-1.37.0-wmf.4/extensions/EventBus/includes/JobExecutor.php(79): CirrusSearch\Job\CirrusGenericJob->run() #25 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array) #26 {main} ``` ==== Impact ==== 15 instances today. Presumably breaks search indexing (but then, does that even work for Flow pages?) ==== Notes ==== The URL for which UrlReference triggers this error is `http://?????????/mediawiki/load.php...` which looks weird. Some kind of internal hack?
    • Task
    As a search engineer I want to know the set of available tools so that I can decide which ones are more adapted to my needs and possibly deprecate some of the tools written in relevancyForge. Family of tools and few examples: - judgement list creation & management (grading query sets) -- [[https://gerrit.wikimedia.org/g/wikimedia/discovery/discernatron|discernatron]] -- [[https://github.com/o19s/quepid/|quepid]] -- [[https://github.com/cormacparle/media-search-signal-test|media-search-signal-test]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/search/MjoLniR/+/refs/heads/master|mjolnir query grouping & DBN click model]] - evaluation engines -- [[https://github.com/o19s/quepid/|quepid]] -- [[https://github.com/SeaseLtd/rated-ranking-evaluator|rated-ranking-evaluator]] -- [[https://github.com/cormacparle/media-search-signal-test|media-search-signal-test AnalyzeResults.php]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge|relevance forge engine scorer]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge|relevance forge engine scorer & diff tools]] Aspects to evaluate: - ability to customize search integration: how is it to integrate a new search subsystem? - ability to store data and manage history (compare performance overtime) - UX: UI, multi-tenancy AC: - produce a comprehensive list of tools with a description of their features
    • Task
    Steps to reproduce: * Upload images and pdf's with a similar name. eg (mycat.png, category.jpg, cat1.pdf) * Edit Page * Insert Gallery * Enter a term in the search box that matches a known image name and a known pdf filename. eg. cat Expected result: * 2 results with thumbnails for mycat.png and category.jpg Actual Result: No results returned Ajax result contains message like 'Could not normalize image parameters for cat1.pdf' If I change the search query to: -filemime:pdf cat I get results correctly.
    • Task
    Currently CirrusSearch jobs are configured to not do any retries cause cirrus jobs manage retries internally. Instead, cirrus jobs should report success in case they failed and have scheduled a retry, and let change-prop do overall retries in case of a catastrophic job failure.
    • Task
    As a search engineer I want a dedicated dataset with the wikidata entities referenced from commons so that requests do not have to be made to wikidata directly Commons and wikidata RDF data are available in a hive table. Create a spark job in wikidata/query/rdf/rdf-spark-tools that pulls all wikidata items linked from a mediainfo item using the property `P180` or `P6243` with the following data: - item - labels - aliases - descriptions - P31 (instance of) - P171 (taxon) Example for Q42: ```lang=json { TODO } ``` The resulting dataset should be available in a hive table for downstream operators. Hive table: `discovery.mediasearch_entities` HDFS folder: `hdfs:///wmf/data/discovery/mediasearch_entities` Schedule: should probably be rebuilt as soon as the commons mediainfo RDF dump is processed AC: - a new spark job in `wikidata/query/rdf/rdf-spark-tools` - a new dag in airflow to schedule this new job
    • Task
    # Context As part of a [[ https://fr.wiktionary.org/wiki/Wiktionnaire:Pages_propos%C3%A9es_%C3%A0_la_suppression/f%C3%A9vrier_2021#Discussion |discussion about integration of some rare word within the French Wiktionary]], it was pointed out that currently searching with the default internal engine of the Wiktionnaire for a term like [[https://fr.wiktionary.org/w/index.php?search=mqmqn&title=Sp%C3%A9cial%3ARecherche&profile=advanced&fulltext=1&searchengineselect=mediawiki&advancedSearch-current=%7B%7D&ns0=1&ns100=1&ns106=1&ns110=1|mqmqn]] will return no result. Popular general purpose web search engine will rightfully suggest "did you mean “maman?”" Indeed, in this case, it's obvious for any knowledgeable person that someone most likely on a qwerty keyboard layout like if it was an azerty one. # Desired behaviour The minimum improvement would be that the internal search engine could provide a good suggestion for cases like this one. I'm not aware of the actual algorithms behind the search engine, but the Levenshtein distance (LD) between `mqmqn`and `maman` is only 2. It should certainly be taken into account rather than providing not even a single result. Compare for example with how [[https://fr.wiktionary.org/w/index.php?search=mamqn&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&searchengineselect=mediawiki&advancedSearch-current=%7B%7D&ns0=1&ns100=1&ns106=1&ns110=1|searching for mamqn]] can suggest something like `brandwerend maken`, which has a 15 LD with the provided input. It would be even better, if it was possible to feed the engine with a list of common misspells with a comment on the causes of such a misspell. Such a facility could provide both exhaustive and comprehensive specification lists, that is both something as `mqnqn -> maman : "…qwerty [on] azerty…"`and `p.p. -> papa : "The community decided to abuse the regexp facility to suggest papa as result to pépé, pipi, popo, pypy and so."` – although this later example would be a defective use of the feature. The result could then generate search result page with a leading text such as `Did you mean “[[maman]]”? This is a common misspell of a French word resulting from typing the word on a [[w:qwerty|]] keyboard layout as if it was an [[w:azerty| ]] one.`
    • Task
    How to reproduce: install MediaWiki 1.35.1 with ElasticSearch extension from git, run composer update and then run `php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php` you would get an error: ``` php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php indexing namespaces... Indexing namespaces...done content index... Fetching Elasticsearch version...6.8.13...ok Scanning available plugins...none Picking analyzer...english Inferring index identifier...mediawiki_content_first Creating index...ok Validating number of shards...ok Validating replica range...ok Validating shard allocation settings...done Validating max shards per node...ok Validating analyzers...ok Validating mappings... Validating mapping...different...corrected Validating aliases... Validating mediawiki_content alias...alias is free...[6e4df9b2173aee354e984b31] [no req] Error from line 451 of /var/www/html/w/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php: Class 'Elasticsearch\Endpoints\Indices\Aliases\Update' not found Backtrace: #0 /var/www/html/w/extensions/CirrusSearch/includes/Maintenance/Validators/SpecificAliasValidator.php(137): Elastica\Index->addAlias() #1 /var/www/html/w/extensions/CirrusSearch/includes/Maintenance/Validators/SpecificAliasValidator.php(79): CirrusSearch\Maintenance\Validators\SpecificAliasValidator->updateFreeIndices() #2 /var/www/html/w/extensions/CirrusSearch/includes/Maintenance/Validators/IndexAliasValidator.php(98): CirrusSearch\Maintenance\Validators\SpecificAliasValidator->updateIndices() #3 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateOneSearchIndexConfig.php(458): CirrusSearch\Maintenance\Validators\IndexAliasValidator->validate() #4 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateOneSearchIndexConfig.php(411): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->validateSpecificAlias() #5 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateOneSearchIndexConfig.php(269): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->validateAlias() #6 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(61): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->execute() #7 /var/www/html/w/maintenance/doMaintenance.php(107): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute() #8 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(70): require_once(string) #9 {main} ``` It is because `composer update` installs [[ https://packagist.org/packages/elasticsearch/elasticsearch#v6.8.0 | elasticsearch/elasticsearch ]] version 6.8.0 (since 2021-03-01 18:54 UTC) and compared to previous version 6.7.2 some classes were removed, see http://comparabl.com/upgrade/elasticsearch-elasticsearch/v6.7.2/v6.8.0 [[ https://www.mediawiki.org/wiki/Extension:Elastica | Elastica ]] extension requires "ruflin/elastica": "6.1.1" which requires elasticsearch/elasticsearch: ^6.0 and composer installs elasticsearch-elasticsearch version 6.8.0. I fixed it adding to composer.local.json file: ``` "require": { "elasticsearch/elasticsearch": "6.7.2" } ``` It seems without the fix CirrusSearch works fine but you face the problem when you run the maintenance scripts.
    • Task
    ```lang=php public function allowRetries() { return false; } ``` This overridde has a [doc comment](https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/a221008bd90151c60f0e2db568e83da552deb813/includes/Job/ElasticaWrite.php#88) which explains why it is disabled, namely that the job has its own retry logic. However, if I understand correctly, this also means that the job is lost without recovery if the PHP process is killed. E.g. due to deployment and we roll over php-fpm across the fleet, or due to any other php-fpm restart (such as opcache capacity being reached, which is done by a cronjob currently), due during switch overs when we kill running processes, etc. Disabling retries is relatively rare, and when done it is typically for jobs that exist only as optimisation or that can self-correct relatively quickly (e.g. warming up thumbnail or parser cache), or for unsafe/complex code that isn't atomic and cannot restart with e.g. a user asynchronously waiting on the other end who would notice and know to re-try at the higher level (such as upload chunk assembly). I don't know if there is a regular and automated way by which this would self-correct for CirrusSearch. If not, then it might be worth turning this back on. Given that the code is already wrapped in a try-catch, it should be impossible for blind job queue growth to happen. The cases where a runtime error is found of the kind that you don't want to retry, the existing code will kick in as usual and signal that it should be counted as success. It's only when the process is aborted from the outside, and thus the job runner never gets a response or gets HTTP 500, that it will thus have permission to retry/requeue upto 3 times.
    • Task
    In T265894 @Tgr suggested the idea of a maintenance script for CirrusSearch to allow setting arbitrary field data in the ES index, for local development. In our suggested edits feature, we rely on ORES topics which are not populated in our local wiki; @Tgr wrote a script P10461 to populate this data. We would also like to set the `hasrecommendation` field (T269493) for articles. Having a maintenance script provided by the extension would be convenient.
    • Task
    According to their blogpost (https://www.elastic.co/blog/licensing-change): > Starting with the upcoming Elastic 7.11 release, we will be moving the Apache 2.0-licensed code of Elasticsearch and Kibana to be dual licensed under SSPL and the Elastic License, giving users the choice of which license to apply. Their //FAQ on 2021 License Change// : https://www.elastic.co/pricing/faq/licensing Considering this is happening: 1. WIll CirrusSearch rely on ElasticSearch in the near future (say on version 7.x)? 2. Will CirrusSearch rely on ElasticSearch in long-term? 3. According to task T213996, the switch from MongoDB had happened because it was removed in Debian and therefore will be unsuitable for long-term use., and therefore it was unclear if a precedent was achieved. Are there any policy change or clarification if another dependency announced to switch to a license that may not be suitable for MediaWiki? ---- Context on proprietary relicensing: https://sfconservancy.org/blog/2020/jan/06/copyleft-equality/ Existing alternative venues: https://opendistro.github.io/for-elasticsearch/contribute.html ("distribution" or "fork" depending who you ask, no CLA) Some announced forks: https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch/ https://logz.io/blog/open-source-elasticsearch-doubling-down/
    • Task
    Right now we use: * `ruflin/elastica` on `6.1.5`; `>=7.1.0` supports PHP 8.0 * `elasticsearch/elasticsearch` on `6.5.1`; `>=7.11.0` supports PHP 8.0 Upgrades blocked by {T263142}. For `ruflin/Elastica`, the major version of the release matches the ES major release: {F34106476} For `elasticsearch/elasticsearch`: {F34106478}
    • Task
    **Problem:** Special:Search right now only allows for searching across all languages in the Lexeme namespace. It would be useful to allow to restrict the search to a specific language in order to make finding the right Lexeme easier. In order to do this we should introduce new cirrus search keywords. These could be `haslemma:en` and `haslang:Q1860`. **Example:** A search for "[[https://www.wikidata.org/w/index.php?search=a&title=Special:Search&profile=advanced&fulltext=1&ns146=1|a]]" to find the English indefinite article. It is currently the 17th result. **BDD** GIVEN a Lexeme search AND a keyword "haslang:Q1860" THEN the results only contain Lexemes with English as the Lexeme language GIVEN a Lexeme search AND a keyword "haslemma:en" THEN the results only contain Lexemes with English as one Lemma's spelling variant **Acceptance criteria:** * Results on Special:Search can be restricted by language via 2 new keywords **Notes:** * existing keywords specific to Wikibase: https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch
    • Task
    **User Story:** As a search user, I want to get the same results for cross-language suggestions regardless of the case of the query, because that usually doesn't matter to me. As noted below, searching for транзистор on English Wikipedia generates Russian cross-language suggestions, while searching for Транзистор does not (they only differ by the case of the first letter). Language identification via TextCat is currently case-sensitive because the n-gram models were generated without case folding. This makes sense as a model because word-initial caps are different from word-final caps in many cases, and some languages, like German, have different patterns of capitalization that can help identification. However, a side effect of that is that words that differ only by case can get different detection results—usually in the form of "no result" because one string is "too ambiguous" (i.e., there is more than one viable candidate). It would be mostly straightforward to case-fold the existing models (merging n-gram counts) to generate case-insensitive models, but we would have to re-evaluate the models' effectiveness. **Acceptance Criteria:** * Survey of how often differently-cased versions of the same query (original, all lower, all upper, capitalized words) get different language ID results, using the current TextCat params, to get a sense of the scope of the problem. * A review of any accuracy changes for case-folded TextCat models, using the currently optimized parameters. * If the problem is large enough and the accuracy of case-folded models drops too much, we need a plan (i.e., a new sub-ticket) to re-optimize the TextCat params for the case-folded and slightly lower-resolution but more consistent models. _____ **Original Description:** It's an issue I found as I was reporting T270847 :) If I [[ https://en.wikipedia.org/w/index.php?search=%D0%A2%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace of the English Wikipedia for "Транзистор" ]], I find zero results in the main screen, and one result in the right-hand sister project sidebar: "транзистор" in the English Wiktionary. The word means "transistor" in several languages that are written in the Cyrillic alphabet, and note that the search string begins with an uppercase Cyrillic letter. The title of the Wiktionary result, which //is// found, is written with a lowercase letter. If I [[ https://en.wikipedia.org/w/index.php?search=%D1%82%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace of the English Wikipedia for "транзистор" ]], which is the same word, but in all lowercase letters, then I get the same Wiktionary result in the sidebar, and also many results from the Russian Wikipedia (I'd also expect other languages, but that's another issue, T270847). Searching probably shouldn't be case-sensitive, at least not in a case like this.
    • Task
    **User Story:** As a Wikipedia user and speaker of a given language, I would like to know that results are available in my language when searching on a Wikipedia in a different language, so I can read articles in my own language. Our current language identification process chooses the //one// most likely language to show results from. There may be other languages with exact title matches or other reasonable results. It would be useful to let users know that articles/results in those languages exist if possible. Potential pitfalls include the expense of searching more than one additional language and increased potential for poor relevance for general results. Some possible approaches: * Allow more than one language to be used for cross-language searching; possibly based on one or more of geolocation, user language preferences, browser language(s), and language ID results. * Search multiple languages for results; could limit additional languages to title matches or exact title matches. * Update UI: Display multiple results sets, or provide a language selector, or provide links to results/exact title matches. **Acceptance Criteria:** * An assessment of how many languages we can realistically search * If n == 1, give up. `:(` * Best option for how to choose which languages to search * Best option for how to search additional languages * A plan for updating the UI (may require help from outside the team) If/When we move this to current work, this ticket may need to be upgraded to an EPIC to support all those different tasks. ______ **Original Description:** Ukrainian Wikipedia is only sometimes shown in cross-wiki search results even if a relevant result is available. For example, if I [[ https://en.wikipedia.org/w/index.php?search=%D1%82%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace English Wikipedia for the string "транзистор" ]] (it means "transistor"), it finds nothing in the English Wikipedia, as you would expect, and shows results from the Russian Wikipedia. An [[ https://uk.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80 | article with the exact same title ]] exists in the Ukrainian Wikipedia, but it's not shown in the results. Evidently, the showing of results from the Ukrainian Wikipedia works in cross-wiki search for some searches, but not for all. If I [[ https://en.wikipedia.org/w/index.php?search=%D0%BF%D0%B5%D1%82%D1%80%D0%BE%D0%BF%D0%B0%D0%B2%D0%BB%D1%96%D0%B2%D1%81%D1%8C%D0%BA%D0%B0+%D0%B1%D0%BE%D1%80%D1%89%D0%B0%D0%B3%D1%96%D0%B2%D0%BA%D0%B0&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search for "петропавлівська борщагівка" ]] (Petropavlivska Borshchahivka, a name of a village), I get one result from English and multiple results from Ukrainian. I'd expect to see a result from Ukrainian also for "транзистор" (transistor), and not only from Russian. There are also [[ https://www.wikidata.org/wiki/Q5339 | several more wikis ]] where there's an article with the exact same title: Bashkir, Bulgarian, Chechen, Kazakh, and more, and I'd expect to see all of them. It would also be OK to prioritize results for languages that I have configured in my browser, but I tried configuring Ukrainian, and I still see only Russian results. (And even if my browser language is //prioritized//, other languages should be //available//.)
    • Task
    There is no way for users to opt for a default search option to exclude documents. Adding "-filemime:pdf -filemime:djvu" verges on being incomprehensible for most non-tech users. This could be usefully added as a site user preference, or made another field in the Commons search UI. This is an issue made more significant recently, with the IA books project adding a million PDFs to the collections on Commons. Consequently even simple (non-document type) searches like "cats with flowers" are returning lots of uninteresting looking PDFs in the top search returns, unless you happen to be very interested in Seed Trade Catalogs.
    • Task
    It was brought to my attention that [[ https://fa.wikipedia.org/w/index.php?search=data+science&title=%D9%88%DB%8C%DA%98%D9%87%3A%D8%AC%D8%B3%D8%AA%D8%AC%D9%88&go=%D8%A8%D8%B1%D9%88&ns0=1 | the third search result for "data science" in fa.wikipedia.org ]] is the suicide methods article ([[ https://www.wikidata.org/wiki/Q2485083 | Q-ID ]]). I'd appreciate if you can look into it and fix it. See a screenshot at F33941442.
    • Task
    **User story:** As an Elasticsearch developer, I want to be able to add useful filters in a logical order without having to worry about how they might interact to create an invalid token order. **Notes:** As outlined in the parent task (T268730) and related comments, because `homoglyph_norm` creates multiple overlapping tokens and `aggressive_splitting` splits tokens, the two can interact to create tokens in an invalid order if `homoglyph_norm` comes before `aggressive_splitting`. For example, a stream of tokens with offsets (0-5, 6-7, 0-5, 6-7), which should be properly ordered as (0-5, 0-5, 6-7, 6-7). The short-term solution is to swap their order, but that is not the logical order they should be applied—though the outcome is the same in the majority of cases (but not all). There is a specific and a generic approach to solving the problem: * Specific: recreate either `aggressive_splitting` or its component `word_delimiter` in such a way that it doesn't create out-of-order tokens. This would require caching incoming tokens to make sure that none that come immediately after would be out of order. * Generic: create a general-purpose reordering filter that would take a stream and reorder tokens in an invalid order (up to some reasonable limit—it shouldn't have to handle a thousand tokens in reverse order, for example). ** Alternatively, it could clobberize highlighting and possibly some other features by simply changing the offset information to be "acceptable", as `word_delimiter_graph` does. So, (0-5, 6-7, 0-5, 6-7) would become (0-5, 6-7, 6-6, 6-7)—it's not right, but at least it isn't broken. The generic case would allow us to reorder tokens for the existing `aggressive_splitting` and could be useful in future situations, but is probably more difficult to code and possibly noticeably slower. **Acceptance Criteria:** * We can order `homoglyph_norm` before `aggressive_splitting` without causing errors on known-troublesome tokens such as `Tolstoу's` (with Cyrillic у).
    • Task
    As a search engineer I want most of the tools provided by relevance forge to be compatible/friendlier to image search so that I can assess/anticipate changes to MediaSearch. MediaSearch provides its image results in a grid but relevance forge has been designed for working on text based search results. We should adapt some of the tooling for grid layouts: - determine if it makes sense to have a good way to present diffs for grid results - research what metrics would make more sense to evaluate grid results - possibly start collecting and grading a set of set query -> results AC: - Have a tool that allows to assess the impact of a change on MediaSearch - Have a better understanding of what metric could be used to evaluate grid-based results - Decide if collecting and grading a query set is worth the effort
    • Task
    As CirrusSearch maintainer I want MediaSearch to use a dedicated dataset built from wikidata that does not rely on the existing wikidata search APIs so that I can improve one without impacting the other. Sub-tickets will be created as needed but the plan is roughly: - import commons mediainfo dump to hdfs - spark job that joins commons & wikidata and output a dedicated dataset for concept lookups - determine the mapping, possibly experimenting with better techniques (not one field per language) to support multiple languages - custom elasticsearch query to do query expansion&rewrite - adapt mediasearch and replace the wikidata search API using query expansion - optional but would be good to have: provide completion for wikidata items using this same dataset instead of using the wikidata completion API AC: - The MediaSearch query builder is no longer using the wikidata search API - A single request is made to elastic
    • Task
    Searching for several words returns the results in random order instead of sorted by relevance. For example: # https://ru.wikipedia.org/w/index.php?search=Орден+"Святого+Марка"&title=Служебная:Поиск&profile=advanced&fulltext=1&advancedSearch-current={}&ns0=1 Searching for `Орден "Святого Марка"` returns the page `Орден Святого Марка` third instead of first as I would've expected since it is the most relevant result. It also returns the page `Награды Египта` (which also contains this exact sequence -- `Орден Святого Марка`) close to the end of the 1st page. # https://ru.wikipedia.org/w/index.php?search=Орден+Святого+Марка&title=Служебная:Поиск&profile=advanced&fulltext=1&advancedSearch-current={}&ns0=1 Searching for `Орден Святого Марка` (no quotation marks) does return the page `Орден Святого Марка` first but it doesn't return `Награды Египта` on the 1st page at all -- again, despite containing the very sequence that is searched for. My search settings are set to default so it's what most users get.
    • Task
    **User story:** As a non-WMF user of MediaWiki full-text search, I want to be able to configure custom analysis chains that are more appropriate for my use case. This issue came up in a discussion with @Svrl on the [[ https://www.mediawiki.org/wiki/Topic:Vxh0rmkyfef0pm70 | Cirrus help talk page ]]. For example: while you can specify `$wgLanguageCode = 'cs';`, that only allows you to enable the same specific analysis chain as used on cswiki. If we change that analysis on our end, it also changes for external users when they upgrade MediaWiki. If you want to do something different (like using the Czech stemmer + ICU folding), you can't easily do so (it may be possible with lots of hacking and manual maintenance, but that's sub-optimal). @dcausse & @TJones discussed this some, and @dcausse found a way to inject config to update or replace a language-specific configuration. An example config is done for Czech P13907. This should be documented in our on-wiki docs. **Acceptance Criteria:** * Update appropriate documentation page(s) on-wiki with general method for doing this, and at least one specific example. Should be reviewed by another search developer.
    • Task
    **While performing this command:** php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php **The following error pops out:** PHP Fatal error: Declaration of Elasticsearch\Endpoints\Indices\Exists::getParamWhitelist() must be compatible with Elasticsearch\Endpoints\AbstractEndpoint::getParamWhitelist(): array in /var/www/html/mwtest/extensions/Elastica/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Endpoints/Indices/Exists.php on line 60 **Observed on:** Mediawiki 1.31.10 Elasticsearch 5.6.16 Ubuntu 18 PhP 7.2.24
    • Task
    Add a new CirrusSearch keyword itemquality, e.g. -itemquality:A will return all class A items. (In this task, it is not required to update the class automatically; they may be periodically updated, e.g. once every week. This may be decided later.)
    • Task
    Problem: The Cirrus Search dump script is not resilient enough to failures (elasticsearch restarts), causing cirrus dumps for some wikis to be missing for some weeks. Solution probably involves stopping use of elastic "srcoll" API AC: * Make the dump script more resilient, so that cirrus dumps no longer fail when restarting elasticsearch *** original log below, and additional logs in comments. For this week's run: ``` <13>Oct 7 08:09:36 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201005/eswiki-20201005-cirrussearch-general.json.gz <13>Oct 7 08:09:37 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201005/eswikibooks-20201005-cirrussearch-content.json.gz <13>Oct 7 08:09:37 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201005/eswikibooks-20201005-cirrussearch-general.json.gz ```
    • Task
    I searched ``` incategory:"Files with no machine-readable license" insource:/eview/ -FlickreviewR ``` on Commons, in an effort to find files in that category that have a review template in source wikitext. This query returns some old files. By sorting by edit date, I found for example this: https://commons.wikimedia.org/wiki/File:Korg_Electribe_MX_(EMX-1)_Valve_Force.jpg It was in that category for **less than a minute **when it was uploaded in **2010**! As soon as this edit https://commons.wikimedia.org/w/index.php?diff=43801464 in the same minute of its upload passed, it was already out of that category. Yet it still shows up in my search query 10 years later! The file has been **edited more than 10 times** over the decade, and was last edited in 2017, so your database should have been updated right?! I dont know whether this kind of false positives is solely related to the incategory command or not. Please investigate.
    • Task
    Steps to Reproduce: Submit the query `insource:/\//` or `intitle:/\//` in CirrusSearch. Practical example link with narrowed search domain: [[https://en.wikipedia.org/w/index.php?search=%3A+slash+insource%3A%2F%5C%2F%2F&title=Special:Search | Search results for “: slash insource:/\//”]] (English Wikipedia). Actual Results: Error message: `An error has occurred while searching: Regular expression syntax error at unknown: unknown`. Apparently there is some kind of parsing error. Expected Results: Should search for pages containing `/` or titles in the given namespaces containing `/`. The query also fails if more valid regular expression characters are added before `\/`: `insource:/word\//` or `intitle:/word\//`. Search succeeds when `\/` is not the last thing in the regex: `insource:/\/./` or `intitle:/\/./`. Again, example link for English Wikipedia: [[https://en.wikipedia.org/w/index.php?search=%3A+slash+insource%3A%2F%5C%2F.%2F&title=Special:Search | Search results for “: slash insource:/\/./”]] ---- This is probably a separate issue, but `insource://` and `intitle://`, and `insource:///` and `insource:////`, etc., have an odd error: `An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later.` I would've expected something like "You can't search for an empty regular expression" and "Invalid syntax: `/` found after `insource://` syntax, expected space character or end of query".
    • Task
    Search is often used for finding articles to edit; the ability to exclude protected articles would make that more effective. #growthexperiments, which offers articles with simple editing tasks to newcomers (and thus needs to avoid recommending protected articles), currently filters out protected articles on the client side (via `action=query&prop=info&inprop=protection`) which is far from ideal and makes proper handling of result sizes and offsets impossible. It would be nice to have a CirrusSearch keyword (maybe `hasprotection:edit` / `hasprotection:move`?) for filtering for protection status. Page protection changes are accompanied by a null edit, which pushes status changes to the search index, so AIUI all that would be needed is to add a protection field to the ElasticSearch index, add it to the EventBus event for new revisions, and register the appropriate search feature.
    • Task
    # Status - v4 - running smoke test in CI ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/617416 | 617416 ]]) fails - v5 - Cindy-the-browser-test-bot fails when running [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/610396 | 610396 ]] without saying which tests failed - Vidhi still doesn't have CirrusSearch running locally, [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]] might help - Vidhi [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007/5#message-282ce3d4e62b3807002d2fdf76f42da08adf25f7 | can't log in ]] to horizon.wikimedia.org --- # TODO [x] add a separate patch renaming `@selenium-test` to `selenium-test` to check if webdriverio v4 tests pass in CI: [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/617416 | 617416 ]]: [] update T253869 [] count lines of code in `tests/selenium` and `tests/integration` [] count tests in feature files (scenario and scenario outline) [] local development environment [x] [[ https://www.mediawiki.org/wiki/MediaWiki-Docker/Extension/CirrusSearch | MediaWiki-Docker/Extension/CirrusSearch ]]: P11874 [x] run the tests targeting the beta cluster ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/616734 | 616734 ]]): v4 P12120, v5 P12130 [] try [[ https://github.com/montehurd/mediawiki-docker-dev-sdc | montehurd/mediawiki-docker-dev-sdc ]] as local development environment: P12314 [] run the tests targeting mediawiki-vagrant ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]]): paste TODO [] run the tests targeting Wikimedia Cloud Services ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]]): Not authorized for any projects or domains [x] Update to v5 [] Update to v6 [] Update to v7
    • Task
    MW search suggestions appear when entering some text in the search input in the skin, but when you want to dismiss them after you have clicked inside the suggestion area but not in a manner that takes you to the next page (i.e. mouse up outside of suggestions) the suggestion are stuck even if you click out side of the suggestions. Clicking outside of search focus areas does not force them to hide. Steps to reproduce: * go to [[ https://en.wikipedia.org/wiki/Main_Page | en.wikipedia.org ]] * type //math// in the upper right corner search box * click on the first result, but do not release the mouse * move the mouse outside the suggestions area and release The suggestions list now will not go away unless you click one of the suggestions!
    • Task
    Elastica is an extension created in {T56049} (predating the librarization project) that does nothing other than providing connection between Elasticsearch and MediaWiki. It contains two parts, one is the elastica PHP library; another is ElasticsearchConnection, which may be converted to a library too which may become a part of CirrusSearch.
    • Task
    cirrus_build_completion_indices.sh should not ignore failures of the underlying script `extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php`. Prior to adding `| ts '{}'` it was failing with a 123 status probably because of wikidata: `Completion suggester disabled, quitting...` for wikidata. Since we expect this error UpdateSuggesterIndex could perhaps return a success in this case. To sum-up this task is: - make sure cirrus_build_completion_indices.sh in puppet reports a failure when UpdateSuggesterIndex.php fails - make sure that UpdateSuggesterIndex.php does not return with an error when quitting with this message: `Completion suggester disabled, quitting...`
    • Task
    See [[https://translatewiki.net/w/i.php?title=MediaWiki:Searchresults-title/wa&diff=9392337&oldid=1189285|here]] and the result [[https://wa.wikipedia.org/w/index.php?search=ok&ns0=0|here]]. HTML entity (hex) is not correctly render in the title tag: ``` &amp;#x202F; ``` {F31842466} This is what I see in the title bar of my browser: {F31842489} ``` <title>Rizultats des rcwerances po «&amp;#x202F;ok&amp;#x202F;» — Wikipedia</title> ``` This is what is expected: {F31842491} ``` <title>Rizultats des rcwerances po «&#x202F;ok&#x202F;» — Wikipedia</title> ``` The ampersand char shouldn't be encoded if it represents a part of a html entity.
    • Task
    **Steps to reproduce** # Try `pywikibot.Site('wikimania', 'wikimania').search('hello')` **Expected behavior** Should work fine. **Current behavior** Fails with the following error: ```WARNING: API error toomanyvalues: Too many values supplied for parameter "gsrnamespace". The limit is 50.``` If we ask API for namespaces automatically and then try to search wikimania using API:Search, this is not expected.
    • Task
    For example you should be able to use https://en.wikipedia.org/w/index.php?search=linksto%3Am%3A to search all pages links to Meta-Wiki main page. One use case is T249688#6055456, though {T239628} will be a more proper solution for that. See also {T68293}.
    • Task
    See {T248363} and T208425#5992965 for background.
    • Task
    For example see https://www.wikidata.org/w/index.php?title=Special:Search&limit=20&offset=20&profile=default&search=haslabel%3Akhw&advancedSearch-current={}&ns0=1&ns120=1, this returns a lot of items with khw alias instead of label. See also https://www.wikidata.org/wiki/Q13610143?action=cirrusdump. I proposed to introduce some new keywords: * hasalias * haslabeloralias * haslabel (existing, will search labels only) * inalias * inlabeloralias * inlabel (existing, will search labels only) Change the behavior of existing haslabel and inlabel may be considered a breaking change.
    • Task
    restify used by CirrusSearch has a vulnerability Please update to a newer version. #1171: csv-parse Severity: high Versions of csv-parse prior to 4.4.6 are vulnerable to Regular Expression Denial of Service. The __isInt() function contains a malformed regular expression that processes large specially-crafted input very slowly, leading to a Denial of Service. This is triggered when using the cast option. npm advisory
    • Task
    Once {T240559} lands, add the new `articletopic` search keyword to AdvancedSearch and provide a nice interface for selecting topics (a fixed list of 64 keywords, one or more of which can be used in the query).
    • Task
    This occurs when called from CirrusSearch's forceSearchIndex.php. The result, after indexing some pages but before finishing the job, is: ``` MWException from line 348 of /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php: Bad parser output text. #0 [internal function]: ParserOutput->{closure}(Array) #1 /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php(359): preg_replace_callback('#<(?:mw:)?edits...', Object(Closure), '<div class="mw-...') #2 /var/www/mediawiki-1.34.0/includes/content/WikiTextStructure.php(154): ParserOutput->getText(Array) #3 /var/www/mediawiki-1.34.0/includes/content/WikiTextStructure.php(223): WikiTextStructure->extractWikitextParts() #4 /var/www/mediawiki-1.34.0/includes/content/WikitextContentHandler.php(152): WikiTextStructure->getOpeningText() #5 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(380): WikitextContentHandler->getDataForSearchIndex(Object(WikiPage), Object(ParserOutput), Object(CirrusSearch\CirrusSearch)) #6 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(458): CirrusSearch\Updater::buildDocument(Object(CirrusSearch\CirrusSearch), Object(WikiPage), Object(CirrusSearch\Connection), 0, 0, 0) #7 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(236): CirrusSearch\Updater->buildDocumentsForPages(Array, 0) #8 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php(219): CirrusSearch\Updater->updatePages(Array, 0) #9 /var/www/mediawiki-1.34.0/maintenance/doMaintenance.php(99): CirrusSearch\ForceSearchIndex->execute() #10 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php(689): require_once('/var/www/mediaw...') #11 {main} ``` This is with CirrusSearch-REL1_34-a86e0a5.tar.gz Subsequently added an "echo" to see what it was choking on, which looks to be this: ```lang=html <h2><span class="mw-headline" id="Links">Links</span><mw:editsection page="File::Spec" section="1">Links</mw:editsection></h2> <ul><li><a rel="nofollow" class="external free" href="http://perldoc.perl.org/File/Spec.html">http://perldoc.perl.org/File/Spec.html</a></li></ul> ``` On the older version of the wiki, searching for perldoc (in SphinxSearch in this case) brings up just this: {F31554958} Clicking on which shows this error: {F31554962} In other words this appears to be about something that can only be inherited from really old content, but which might be more gracefully skipped over when encountered. Adding the middle three lines here allowed indexing to run to completion: ```lang=php if ( $options['enableSectionEditLinks'] ) { if (preg_match("|::|",$text)) { $text=preg_replace("|::|","\:\:",$text); } $text = preg_replace_callback( ``` Most likely there's a better way to fix this though.
    • Task
    Would have prevented T244479 which made its way all the way to production without being caught
    • Task
    Similar to {T221135} but for WikibaseLexemeCirrusSearch
    • Task
    Example: https://en.wikipedia.org/w/index.php?search=User%3ARobin+Patterson&title=Special:Search&fulltext=Search&ns0=1 The current "Results from sister projects" is misleading (people may think the results are in main namespace). Compare the local result in the left side
    • Task
    In [[https://www.mediawiki.org/wiki/Help:CirrusSearch|CirrusSearch help]] there are all parameters described in detail, but there is missing any summary (list or table) listing all the parameters clearly one under another.
    • Task
    [[https://www.mediawiki.org/wiki/Help:CirrusSearch#Explicit_sort_orders|Explicit sort orders]] can only be accessed using new Advanced-Search or url. In old search options or in top search bar there is no chance to do this other than modifying url of the results.
    • Task
    Per @CDanis. Off the top of my head GlobalUserPage needs updating, definitely there are others. Rough codesearch: https://codesearch.wmflabs.org/operations/?q=https%3A%2F%2F&i=nope&files=php%24&repos=Wikimedia%20MediaWiki%20config
    • Task
    Certain characters[1] are lost when highlighted in titles and text snippets. To reproduce, search for [[ https://en.wiktionary.org/w/index.php?search=intitle%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special%3ASearch&profile=default&fulltext=1&searchengineselect=mediawiki | `intitle:/[𔐀-𔙆]/ anatolian` ]] on English Wiktionary. The three results are 𔐱𔕬𔗬𔑰𔖱, 𔖪𔖱𔖪, and 𔑮𔐓𔗵𔗬. However, they are displayed as 𔖱, 𔖪, and 𔗬; see screenshot: {F31505303} Looking at the underlying HTML, the title of the first result (𔖱) contains several empty `searchmatch` spans: `<span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch">𔖱</span>` I //think// this may have something to do with the characters being lost during tokenization (or being the kinds of characters that are lost during tokenization—maybe they are treated as punctuation?). If you search for 𔑮𔐓𔗵𔗬 (no quotes), the only hit is the exact title match. Searching for "𔑮𔐓𔗵𔗬" (with quotes) gives zero results. I verified that the English `text` analyzer returns no tokens for the string 𔑮𔐓𔗵𔗬. Another example: [[ https://en.wiktionary.org/w/index.php?search=insource%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special:Search&profile=advanced&fulltext=1&searchengineselect=mediawiki&ns828=1 | `insource:/[𔐀-𔙆]/ anatolian` ]] restricted to the `Module` namespace gives a snippet with this: > canonicalName = "Anatolian Hieroglyphs", characters = "-", //characters = "-"// is //characters = "𔐀-𔙆"// in the original. The underlying HTML is `&quot;<span class="searchmatch"></span>-<span class="searchmatch"></span>&quot;`, again with empty `searchmatch` spans. __ __ __ [1] I first discovered this when looking into T237332, so the examples so far are Anatolian Hieroglyphs, though other characters may be affected.
    • Task
    1) go to the wikisource page of a certain language, for example English https://en.wikisource.org 2) search for the //exact// title of something in a different language. For example: "Les Enfants du capitaine Grant" https://en.wikisource.org/w/index.php?search=Les+Enfants+du+capitaine+Grant 3) See some results. You have to click the author's page, find the book in the list again and click on it. This is a poor example because the word "Grant" appears in the text of the page for both English and French, so the book in English is the second result but I think you get my point. It should just send me to the English translation (or whichever language Wikisource I'm on) directly since I typed the exact match of a book title that appears on a different language Wikisource. This could work by automatically creating redirects when articles are created or moved, or by changing the code of the search page. It should also be like on Wikipedia where it'll show you "We found the following results from French Wikisource"
    • Task
    [[https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Recherche?search=Je+suis+venir+te+dire+que+je+m%27en+vais&sourceid=Mozilla-search&ns0=1|Looking for “Je suis venir te dire que je m'en vais” on fr.wp]] finds “#Je_suis_venue_te_dire_que_je_m'en_vais” section” as second result but does not find following pages: * Je suis venu te dire que je m'en vais * Je suis venue te dire que je m'en vais… * Je suis venue te dire que je m'en vais - Sheila live à l'Olympia 89 which are the three top results when [[https://fr.wikipedia.org/w/index.php?sort=relevance&search=Je+suis+venu+te+dire+que+je+m%27en+vais&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1|searching the correct “Je suis venu te dire que je m'en vais” phrase]]. Note that Wdsearch gadget results already well include “Je suis venu te dire que je m'en vais” pages, but it’s probably T219108.
    • Task
    Completion search is an important component of the editing workflow on wikidata. We should collect comments and recommendations of the community to optimize the current ranking algorithm. The recommendation are currently collected in this page: https://www.wikidata.org/wiki/Wikidata:Suggester_ranking_input See also T193701 for how the various signals were optimized last time.
    • Task
    As a user of Wikidata search, I want recall to be improved so that I can find what I'm looking for. In T163642 we made all strings of `indexed statements` part of the all field allowing them to be searchable by plain search queries. Unfortunately only a subset of the statements are being indexed. Reason is that indexing a statement today means that we populate the `statement_keyword` field. This is something we do not want to do for long text, textual content (phrases and long text that need tokenization) is not suited for keyword matching. If we want to increase recall on wikidata using textual properties we need to come up with a new solution to populate extra text content to existing CirrusSearch field. Currently the text fields are: - text: populated using \Wikibase\EntityContent::getTextForSearchIndex - auxiliary_text: not used by EntityHandler We should evaluate the impact on the size of the index to know if we can feed all the textual properties or only a subset.
    • Task
    When loading CirrusSearch and providing a value for $wgCirrusSearchClusters, extension.json goes ahead and adds in the default cluster to those defined. No defaults should ever be applied here, whatever is configured should be the only value used. Found in T237560.
    • Task
    CirrusSearch can be indexed with or without the "all field". The default is to index it. This field is extremely useful for filtering and retrieval purposes and it is becoming to always consider the fact that it can be absent. For simplicity we should assume that it is always here and remove the ability to configure it.
    • Task
    Moving away from query_string we need to provide primitive queries for wildcard, prefix and fuzzy queries. The reason we can't use the ones provided by elastic is that unlike query_string they work like a term query. Some features we need to reproduce: - best effort analysis of the wildcard so that we can normalize the wildcard: https://github.com/apache/lucene-solr/blob/master/lucene/queryparser/src/java/org/apache/lucene/queryparser/classic/QueryParserBase.java#L708 - normalization for all three - max_deternimized_states for wildcard
    • Task
    After parsing the user query string into a SearchQuery we should apply a set of transformation to generate the elastic search query.
    • Task
    Usecase: - Find items with most sitelinks but no sitelink to specific site - Find items with most statements/sitelinks without specific label Although there're 3rd party tools, they are not updated continously.
    • Task
    In these edits https://commons.wikimedia.org/w/index.php?title=File%3AChiesa_di_San_Francesco_-_Trevi_21.jpg&type=revision&diff=371019235&oldid=366733414 I added some structured information to a file. This is visible at https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M82275323 . For the creator I use "somevalue" and that doesn't show up in the search index, see https://commons.wikimedia.org/w/index.php?title=File:Chiesa_di_San_Francesco_-_Trevi_21.jpg&action=cirrusdump : ``` statement_keywords 0 "P6216=Q50423863" 1 "P275=Q18199165" ``` When I look at https://commons.wikimedia.org/w/index.php?title=File:Betsey_Johnson_dress_other_cardigan.jpg&action=cirrusdump I see that qualifiers are supported: ``` statement_keywords 0 "P180=Q467" 1 "P180=Q467[P3828=Q200539]" 2 "P180=Q467[P3828=Q877140]" 3 "P180=Q467[P3828=Q37501]" 4 "P180=Q11442" ``` For the example file I would expect something like ``` statement_keywords 0 "P170=somevalue" 1 "P170=somevalue[P3831=Q33231]" 2 "P170=somevalue[P3831=Q33231]" 2 "P170=somevalue[P2093=Diego Baglieri]" (etc.) ``` So please modify the search to also index these. You probably want to tackle novalue while you're at it. Not sure how to make the distinction between the string and the keywords. I see that Wikidata has the same problem (for example https://www.wikidata.org/w/index.php?title=Q29569412&action=cirrusdump ), but on Wikidata it's less pressing because we mainly use SPARQL.
    • Task
    Explicitly enabling and provisioning the cirrussearch role in my local (LXC-based) setup leads to Vagrant no longer being able to boot the instance, and the instance appearing as "not created" with `vagrant status`: ``` mholloway@mholloway:~/code/wikimedia$ git clone ssh://gerrit.wikimedia.org:29418/mediawiki/vagrant cirrus2-vagrant Cloning into 'cirrus2-vagrant'... remote: Counting objects: 2147, done remote: Finding sources: 100% (13/13) remote: Getting sizes: 100% (11/11) remote: Compressing objects: 100% (33408/33408) remote: Total 24635 (delta 4), reused 24624 (delta 0) Receiving objects: 100% (24635/24635), 3.59 MiB | 7.71 MiB/s, done. Resolving deltas: 100% (17863/17863), done. mholloway@mholloway:~/code/wikimedia$ cd cirrus2-vagrant/ mholloway@mholloway:~/code/wikimedia/cirrus2-vagrant$ ./setup.sh Your git/Gerrit username Enter 'anonymous' for anonymous access, leave blank to manage it yourself git_user: mholloway You're all set! Simply run `vagrant up` to boot your new environment. (Or try `vagrant config --list` to see what else you can tweak.) mholloway@mholloway:~/code/wikimedia/cirrus2-vagrant$ vagrant roles enable cirrussearch Ok. Run `vagrant provision` to apply your changes. Note the following settings have changed and your environment will be reloaded. vagrant_ram: 1536 -> 2048 mholloway@mholloway:~/code/wikimedia/cirrus2-vagrant$ vagrant up Bringing machine 'default' up with 'lxc' provider... ==> default: Importing base box 'debian/stretch64'... ==> default: Checking if box 'debian/stretch64' version '9.1.0' is up to date... ==> default: Pruning invalid NFS exports. Administrator privileges will be required... [sudo] password for mholloway: ==> default: Starting container... There was an error executing ["sudo", "/usr/local/bin/vagrant-lxc-wrapper", "lxc-start", "-d", "--name", "cirrus2-vagrant_default_1571265688111_82030"] For more information on the failure, enable detailed logging by setting the environment variable VAGRANT_LOG to DEBUG. mholloway@mholloway:~/code/wikimedia/cirrus2-vagrant$ vagrant status Current machine states: default not created (lxc) The environment has not yet been created. Run `vagrant up` to create the environment. If a machine is not created, only the default provider will be shown. So if a provider is not listed, then the machine is not created for that environment. ``` This occurs regardless of whether the role was enabled before or after the initial `vagrant up`, and regardless of whether other roles have been applied first. Interestingly, if the cirrussearch role is pulled in indirectly, as a dependency of another role (for example, wikibasecirrussearch), there is no problem. ==== Misc. data: Vagrant version: 2.2.5 Provider: LXC Vagrant-LXC plugin version: 1.4.3
    • Task
    Currently the language model used by the phrase suggester is only populated using title and redirect strings. This is completely use less for wikibase items and we even disabled the feature completely in T143260. We investigate how to build the language on top of the labels and aliases of the entities. This will require some refactoring in CirrusSearch as this component was not designed to be extensible.
    • Task
    Some Wikibase classes are still dependent on CirrusSearch: - TermLookupSearcher - ElasticTermLookup - MoreLikeWikibase We should try to remove them so that Wikibase alone does not depend on CirrusSearch. So that we don't need either stubs for phan (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/541230) nor importing CirrusSearch when testing Wikibase (https://gerrit.wikimedia.org/r/c/integration/config/+/541227)
    • Task
    I noticed on Commons some files like https://commons.wikimedia.org/wiki/File:Pesenbach_Kirche_Leonhardialtar_Schrein_01.jpg have a link to https://www.wikidata.org/wiki/special:search?search=haswbstatement%3AP2951%3D34 . This search has only one result: https://www.wikidata.org/wiki/Q37963045 Would be nice to be able to pass a parameter to the search that if it's exactly one result, it should redirect to the item right away. One click less for the user.
    • Task
    API URL: https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=intitle%3A%22donkeys%22&srlimit=500&srnamespace=0&srprop=size%7Cwordcount%7Ctimestamp%7Credirecttitle&format=json&origin=*&srsort=create_timestamp_asc For the query [intitle:"donkeys"], the page "Democratic Party (United States)" is one of the results. It should not be listed, since no redirects to that page include plural "donkeys" ([[ https://en.wikipedia.org/wiki/Democratic_Party_(United_States)?action=cirrusDump | list of redirects ]]), and according to [[ https://www.mediawiki.org/wiki/Help:CirrusSearch#Intitle_and_incategory | Help:CirrusSearch#Intitle_and_incategory ]], quotes turn off stemming. Another potential issue: the result item is missing a `redirecttitle:` field. I suspect the result is listed because of redirects with "donkey" in the title, but the lack of `redirecttitle:` makes it look (to [[ https://gitlab.com/falsifian/wikipedia-title-search | a tool we're developing ]]) like it's a non-redirect result. This isn't an isolated example: that same query includes "John Simpson Kirkpatrick", "Buridan's ass", "Evolution of the horse", which similarly have no `redirecttitle:` field (though I haven't verified they don't have redirects with "donkeys" in the title). Another example: "Prometheus" as first result for intitle:"eagles".
    • Task
    ``` ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: Wikimedia\Rdbms\DBQueryError from line 1591 of /vagrant/mediawiki/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: Query: SELECT ips_item_id FROM `wb_items_per_site` WHERE ips_site_id = 'enwiki' AND ips_site_page = 'Main Page' LIMIT 1 ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: Function: Wikimedia\Rdbms\Database::selectRow ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: Error: 1146 Table 'wikidatawiki.wb_items_per_site' doesn't exist (127.0.0.1) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #0 /vagrant/mediawiki/includes/libs/rdbms/database/Database.php(1562): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Table 'wikidata...', 1146, 'SELECT ips_ite...', 'Wikimedia\\Rdbms...') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #1 /vagrant/mediawiki/includes/libs/rdbms/database/Database.php(1150): Wikimedia\Rdbms\Database->reportQueryError('Table 'wikidata...', 1146, 'SELECT ips_ite...', 'Wikimedia\\Rdbms...', false) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #2 /vagrant/mediawiki/includes/libs/rdbms/database/Database.php(1794): Wikimedia\Rdbms\Database->query('SELECT ips_ite...', 'Wikimedia\\Rdbms...') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #3 /vagrant/mediawiki/includes/libs/rdbms/database/Database.php(1886): Wikimedia\Rdbms\Database->select('wb_items_per_si...', Array, Array, 'Wikimedia\\Rdbms...', Array, Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #4 /vagrant/mediawiki/extensions/Wikibase/lib/includes/Store/Sql/SiteLinkTable.php(266): Wikimedia\Rdbms\Database->selectRow('wb_items_per_si...', Array, Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #5 /vagrant/mediawiki/extensions/Wikibase/lib/includes/Store/CachingSiteLinkLookup.php(147): Wikibase\Lib\Store\Sql\SiteLinkTable->getItemIdForLink('enwiki', 'Main Page') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #6 /vagrant/mediawiki/extensions/Wikibase/lib/includes/Store/CachingSiteLinkLookup.php(75): Wikibase\Lib\Store\CachingSiteLinkLookup->getAndCacheItemIdForLink('enwiki', 'Main Page') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #7 /vagrant/mediawiki/extensions/Wikibase/client/includes/LangLinkHandler.php(101): Wikibase\Lib\Store\CachingSiteLinkLookup->getItemIdForLink('enwiki', 'Main Page') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #8 /vagrant/mediawiki/extensions/Wikibase/client/includes/LangLinkHandler.php(331): Wikibase\Client\LangLinkHandler->getEntityLinks(Object(Title)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #9 /vagrant/mediawiki/extensions/Wikibase/client/includes/LangLinkHandler.php(352): Wikibase\Client\LangLinkHandler->getEffectiveRepoLinks(Object(Title), Object(ParserOutput)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #10 /vagrant/mediawiki/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php(97): Wikibase\Client\LangLinkHandler->addLinksFromRepository(Object(Title), Object(ParserOutput)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #11 /vagrant/mediawiki/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php(65): Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers->doContentAlterParserOutput(Object(Title), Object(ParserOutput)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #12 /vagrant/mediawiki/includes/Hooks.php(174): Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers::onContentAlterParserOutput(Object(WikitextContent), Object(Title), Object(ParserOutput)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #13 /vagrant/mediawiki/includes/Hooks.php(202): Hooks::callHook('ContentAlterPar...', Array, Array, NULL) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #14 /vagrant/mediawiki/includes/content/AbstractContent.php(559): Hooks::run('ContentAlterPar...', Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #15 /vagrant/mediawiki/includes/Revision/RenderedRevision.php(265): AbstractContent->getParserOutput(Object(Title), 1, Object(ParserOptions), true) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #16 /vagrant/mediawiki/includes/Revision/RenderedRevision.php(234): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(Object(WikitextContent), true) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #17 /vagrant/mediawiki/includes/Revision/RevisionRenderer.php(214): MediaWiki\Revision\RenderedRevision->getSlotParserOutput('main') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #18 /vagrant/mediawiki/includes/Revision/RevisionRenderer.php(151): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(Object(MediaWiki\Revision\RenderedRevision), Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #19 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(Object(MediaWiki\Revision\RenderedRevision), Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #20 /vagrant/mediawiki/includes/Revision/RenderedRevision.php(197): call_user_func(Object(Closure), Object(MediaWiki\Revision\RenderedRevision), Array) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #21 /vagrant/mediawiki/includes/content/ContentHandler.php(1367): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput() ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #22 /vagrant/mediawiki/extensions/CirrusSearch/includes/Updater.php(377): ContentHandler->getParserOutputForIndexing(Object(WikiPage), Object(ParserCache)) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #23 /vagrant/mediawiki/extensions/CirrusSearch/includes/Updater.php(458): CirrusSearch\Updater::buildDocument(Object(CirrusSearch\CirrusSearch), Object(WikiPage), Object(CirrusSearch\Connection), 0, 0, 4) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #24 /vagrant/mediawiki/extensions/CirrusSearch/includes/Updater.php(236): CirrusSearch\Updater->buildDocumentsForPages(Array, 5) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #25 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/forceSearchIndex.php(219): CirrusSearch\Updater->updatePages(Array, 5) ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #26 /vagrant/mediawiki/extensions/CirrusSearch/tests/jenkins/cleanSetup.php(45): CirrusSearch\ForceSearchIndex->execute() ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #27 /vagrant/mediawiki/maintenance/update.php(217): CirrusSearch\Jenkins\CleanSetup->execute() ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #28 /vagrant/mediawiki/maintenance/doMaintenance.php(99): UpdateMediaWiki->execute() ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #29 /vagrant/mediawiki/maintenance/update.php(277): require_once('/vagrant/mediaw...') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #30 /var/www/w/MWScript.php(98): require_once('/vagrant/mediaw...') ==> default: Notice: /Stage[main]/Mediawiki/Exec[update_all_databases]/returns: #31 {main} ``` Workaround: run vagrant provision a second time
    • Task
    Reindexing large wikis is becoming very difficult (c.f. T227136). It seems that the current reindexing process which is based on the internal mechanism provided by elastic is not able to retry any failed query. Reason is that the scrolled queries are not retriable (ref https://github.com/elastic/elasticsearch/issues/26153). Fixing this is tracked upstream by https://github.com/elastic/elasticsearch/pull/25797 where they suggest that an API be added to create and maintain a reference to a lucene IndexReader giving the possibility to sort on `_doc` (lucene internal ids) and use `searchAfter`. This has been marked as a high hanging fruit. It's likely that if this feature is implemented the reindex process will rely on it. We could alternatively re-implement our own reindex mechanism. We don't strictly a immutable IndexReader, we just need a stable sort field that we could use with searchAfter. Currently the _id does not have `doc_values` enabled making it hard to use it as sort critiria, we'd have to duplicate it into a new field.
    • Task
    While looking into {T223787}, I discovered that some seemingly reasonable and reasonably common suffixes are not handled by the Slovak stemmer. These were not discovered during the earlier review because we usually focus on looking for false positives rather than false negatives. We can probably improve the stemmer, so we should! Some examples that I found are [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Folding_Diacritics_in_Slovak/Stemmer_Struggles#Slovak_Stemmer_Struggles | documented here ]].
    • Task
    There are still few config that were needed during the transition of CirrusSearch integration code out of Wikibase into the WikibaseCirrusSearch extension (e.g. wmgNewWikibaseCirrusSearch).
    • Task
    WikibaseCirrusSearch has a `UseCirrus` config variable that currently defaults to false, disabling its functionality. This is a confusing default, and can create hassles for developers trying to set it up in a testing environment. It should default to true. The trouble is that changing the default to true breaks several Selenium tests for the Wikibase and WikibaseLexeme extensions, as shown in the failing test jobs on https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseCirrusSearch/+/507597/. Adding a `$wgWBCSUseCirrus = false;` directive to `tests/selenium/LocalSettings.php` in WikibaseCirrusSearch does not resolve this problem, and based on my local testing, doing the same directly in Wikibase does not help, either. Ultimately, we shouldn't be trying to work around the problem; the tests and CI environment should be able to accommodate WikibaseCirrusSearch.
    • Task
    Files on Commons get structured data added to it. We started with captions and depicts. This will be expanded over time. When the image is a exact digital representation of the artwork, most of the structured data lives on Wikidata. For example for https://commons.wikimedia.org/wiki/File:Christoph_Unterberger_-_Der_heilige_Johannes_von_Nepomuk_empf%C3%A4ngt_von_Maria_den_Sternenkranz_-_2173_-_%C3%96sterreichische_Galerie_Belvedere.jpg , the structured data lives on https://www.wikidata.org/wiki/Q28008188 . The file and the item are linked using the property digital representation of (P6243) which is added to the file (https://commons.wikimedia.org/w/index.php?title=File:Christoph_Unterberger_-_Der_heilige_Johannes_von_Nepomuk_empf%C3%A4ngt_von_Maria_den_Sternenkranz_-_2173_-_%C3%96sterreichische_Galerie_Belvedere.jpg&diff=prev&oldid=351075792 ). In normal cases the search engine grabs all the local structured data, if P6243 is encountered, the data from the linked item should be included too.
    • Task
    Currently WikibaseCirrusSearch has zero browser tests. While it has unit tests for the most functionality, unit tests do not guarantee that the whole roundtrip for search works, and there have been instances where all unit tests pass, but search functionality is broken. It would be nice to have functional test suite for the search ensuring that basic functionality and keywords work as expected. Since CirrusSearch tests right now run on cindy, it could probably make sense to run this part on cindy too. Or make our standard CI selenium tests to be able to support CirrusSearch tests.
    • Task
    Slightly difficult to explain this one. When searching in wikidata, the search box proffers a set of matches as one types in the search term. Consider the search for Anna Taylor When searching for this string, the dropdown is populated up to the point that the full Anna Taylor sting is typed in; whereupon it becoes blank. Add another space, and it reappears. Steps to reproduce - On wikidata - Enter Anna Taylo - Observe that the search provides dropdown suggestions - Add the final r - Observe that the search dropdown disappears - Add a space - Observe that the search dropdown reappears Expected result: The search dropdown should not disappear on typing in the final r. This seems to be a general problem with wikidata (in firefox and safari from a mac). Does not occur on e.g. en.wiki. -
    • Task
    Rationale: when i run a search for something in wikisource, with high probability, a convenient way to read, or even edit the wikisource, is very useful. not monumental improvement, but will be very useful to small number of people, and hopefully will be a 5-liner... :) peace
    • Task
    zhwiki and jawiki because of T219533 and kowiki because of T216738 Per @EBernhardson new models are automatically built and shipped to prod, they just need to be tested Testing consists of running AB tests First, deploy a config patch, see example here: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/537637/4/wmf-config/InitialiseSettings.php. This will wrap the config object with our custom values. The new models also need to be added to `wgCirrusSearchMLRModel` in that same `InitaliseSettings.php` file. then, turn on the AB tests, example here: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/537639/3/modules/all/ext.wikimediaEvents.searchSatisfaction.js
    • Task
    English-language wikis use `aggressive_splitting`, which is a language analysis filter (a version of Elasticsearch's [[ https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html#analysis-word-delimiter-tokenfilter | Word Delimiter Token Filter ]]) that splits words on case changes (as was the original issue in this ticket) and in other circumstances. Investigate applying it everywhere, or at least for many more languages. --- Original task title & description: **Cross-wiki search tokenizer is better than local search one** [[https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Recherche?search=FilesystemHierarchyStandard&sourceid=Mozilla-search | Searching for “FilesystemHierarchyStandard” in fr.wp]] give me no local result but several results from en.wp, including [en:Filesystem Hierarchy Standard] whereas equivalent [fr:Filesystem Hierarchy Standard] exists. I’ve already encountered this strange issue: global search is sometimes better than local search, especially in phrase tokenization (when I missed spaces). Maybe it’s because I use an English phrasing on French wiki?
    • Task
    A large number of `[2019-03-22T14:19:52,864][WARN ][org.elasticsearch.deprecation.common.ParseField] Deprecated field [_retry_on_conflict] used, expected [retry_on_conflict] instead` messages are seen on the upgraded elasticsearch 6 nodes. * [] T209859 Setting a negative [weight] in Function Score Query is deprecated and will throw an error in the next major version * [] T219265: _retry_on_conflict -> retry_on_conflict * [] T219266: nested_path/nested_filter has been deprecated in favour of the [nested] parameter * [] T219267: Deprecated field [auto_generate_phrase_queries] used, replaced by [This setting is ignored, use [type=phrase] instead to make phrase queries out of all text that is within query operators, or use explicitly quoted strings if you need finer-grained control] * [] T219268: The [classic] similarity is now deprecated in favour of BM25, which is generally accepted as a better alternative. Use the [BM25] similarity or build a custom [scripted] similarity instead.
    • Task
    The [HtmlFormatter project](https://www.mediawiki.org/wiki/HtmlFormatter) is used a few (not that many) places: https://codesearch.wmflabs.org/deployed/?q=use%20HtmlFormatter%5C%5C&i=nope It is built on libxml and xpath with a bunch of hacks to avoid bugs, and a partial CSS-selector-to-xpath translator. We should rebase this on [Remex](https://www.mediawiki.org/wiki/RemexHtml) (to parse HTML) and [zest.php](https://github.com/cscott/zest.php) (to match selectors). This will allow us to reduce our dependence on libxml, increase code coverage and usage of Remex, improve corner case parsing of HTML and selectors, and generally put our eggs in fewer baskets. (It's possible we shouldn't use zest, but should instead just use a slightly better version of CSS-selector-to-xpath, which can be shared with Parsoid.)
    • Task
    **Problem** Performing a search like this: https://it.wikipedia.org/w/api.php?action=query&prop=info%7Cpageprops&generator=prefixsearch&gpssearch=1%20dicembre&gpslimit=10&ppprop=disambiguation Does not include the page with the exact title: https://it.wikipedia.org/w/api.php?action=query&prop=info%7Cpageprops&titles=1%20dicembre
    • Task
    As we plan to work more on autocomplete we should verify that all the data we need is available and collected properly.