Page MenuHomePhabricator
Search Advanced Search
    • Task
    The ML team is willing to deprecate the `mediawiki.revision-score` stream in favor of a stream per model (T317768). CirrusSearch data pipelines should be using `mediawiki.revision_score_drafttopic` instead of `mediawiki.revision-score` one (see T328576). AC: - mediawiki.revision-score (hive tables and/or kafka topics) is no longer used by CirrusSearch data-pipelines
    • Task
    Cindy currently runs on a hacked up cloud instance. Ideally we should migrate this integration test runner to a docker image that can easily be built and run anywhere. This would address a few issues: * The current instance fills up it's disk at times and requires manual intervention to pass tests again. * Most devs can't easily reproduce these test failures locally, a docker based instance should make it reproducable. * The current instance only gets to a runnable state with a few days of manual attention. * Cindy runs through vagrant which is mostly unsupported these days, it should ideally run through a development system that has more support.
    • Task
    **Feature summary** try https://commons.wikimedia.org/wiki/Special:Search or https://commons.wikimedia.org/wiki/Special:MediaSearch , both have the same problem. try searching " typewriter " or " nurse ", etc. for Special:Search , untick file namespace and/or gallery. for MediaSearch , click "Categories and Pages". you will have a hard time finding Category:Typewriters or Category:Nurses . lots more other cats appear before these. **Solution** whatever the keyword is, " Category:<plural form> " is usually where files about that keyword are categorised. the plural form is often the -s form. i guess "search results sorted by relevance" are sorted by some kind of weight? increase its weight if " Category:<search keyword>+s " exists and make it appear as top as possible. **Benefits** users want to go to the category. give them that. dont waste users' time.
    • Task
    From logstash. Error started around 13th March 2023. Does anyone know of any changes that could have occurred around that date? * 730 errors since then * Only occur on mobile site and Chrome iOS * Seems to correspond with pages with ?searchToken query string parameter Nothing useful in stack trace ``` at https://en.m.wikipedia.org/wiki/Operating_theater#/search:20:44 at https://en.m.wikipedia.org/wiki/Operating_theater#/search:20:326 at rm https://en.m.wikipedia.org/wiki/Operating_theater:171:370 at Ho https://en.m.wikipedia.org/wiki/Operating_theater:214:400 at https://en.m.wikipedia.org/wiki/Operating_theater:211:29 at https://en.m.wikipedia.org/wiki/Operating_theater:205:345 at Rd https://en.m.wikipedia.org/wiki/Operating_theater:515:108 at b https://en.m.wikipedia.org/wiki/Operating_theater:512:477 ``` https://logstash.wikimedia.org/goto/59a4ab5c653c74220a567581637869bf
    • Task
    The current implementation of the search update pipeline does support an early version of the page-change schema (`/development/mediawiki/page/change/1.0.0`). This was removed and updated to a stable version `/mediawiki/page/change/1.0.0`. AC: - the pipeline is able read events from the page-state MW stream using the `/mediawiki/page/change/1.0.0` schema.
    • Task
    **Steps to replicate the issue** (include links if applicable): * Search for "Kementrian Perhubungan" (Department of Transportation) in Indonesian Wikipedia: https://id.wikipedia.org/w/index.php?title=Istimewa:Pencarian&search=Kementrian+Perhubungan . Note that `Kementrian` is misspelled and should be `Kementerian` * The top 500 result doesn't contain what I'm looking for at https://id.wikipedia.org/wiki/Kementerian_Perhubungan_Republik_Indonesia (the full name) or the disambig page at https://id.wikipedia.org/wiki/Kementerian_Perhubungan . Notice there's a slight variation on the word `Kementrian` vs. `Kementerian` {F36921344} * I could still understand if it's not the first, but at least top 3 or 5. But not even in top 500?? What's going on here?? * Meanwhile, the autocomplete in the search box itself returned at least five articles with the prefix "Kementerian Perhubungan" {F36921347} * I've tested it in logged-out mode, and same result. **What happens?**: Search a string, "kementrian perhubungan" without quote or "intitle", but the articles: https://id.wikipedia.org/wiki/Kementerian_Perhubungan_Republik_Indonesia nor https://id.wikipedia.org/wiki/Kementerian_Perhubungan are not in the first 500 result, although the search dropdown could easily find the article I'm looking for **What should have happened instead?**: The search should be more clever to see that the article that I want only have 1 letter difference, as the autocomplete (Img 2.) can automatically guess. **Software version** (skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): I tried other words starting with "Kementrian", also have the same weird result. I suppose the autocomplete employ Levenshtein distance, while the search does not.
    • Task
    • ·Closed
    It looks like at some point i created files in this subpath as `ebernhardson`, rather than `analytics-search` as expected. Noticed today while deploying a script that cleans up old data in hadoop (deletes it) that it can't do the cleanup due to permissions issues. Requested fix: ``` hdfs dfs -chown -R analytics-search:analytics-search-users hdfs://analytics-hadoop/wmf/data/discovery ``` Also curious if there is some way we can prevent this in the future, perhaps having the directories owned by a group that only analytics-search is in?
    • Task
    This is the result of a search for "finger" on Commons, requesting file type GIF; the user was apparently hoping to find an image of someone "giving the finger": https://commons.wikimedia.org/w/index.php?search=finger&title=Special:MediaSearch&go=Go&type=image&filemime=gif (WARNING: THE ABOVE LINK GIVES A HIGHLY NSFW RESULT) I don't know what is going on behind the scenes here, but an innocuous search term should not result in displaying a set of sexually explicit videos. It seems to me that this is an inappropriate enough result that it should be treated on the same priority as a bug.
    • Task
    Hi! I was surprised to find out that it's not possible to sort search results alphabetically with the [[ https://www.mediawiki.org/wiki/API:Search | Search API ]]. Sorting alphabetically is quite often what you want when using the Search API to generate lists of pages that match certain criteria, for example for a report. I'm sure there're other queries that would benefit from it.
    • Task
    Hi! I think wikiprojects, maintenance pages, admin boards, backlogs and all sorts of other background pages and communities would benefit a lot from a way to embed search results. With all the new filters and features such as "linskto", "subpageof", "incategory", "insource", etc, search could become a VERY powerful way to create dynamic reports. Crucially, there should be some control over the format of the output, so the end result can be further customized via Lua modules or templates. Some days ago I created [[ https://www.mediawiki.org/wiki/Extension:SearchParserFunction | Extension:SearchParserFunction ]] to do just that, and it's already revolutionizing the wiki where I work (appropedia.org). I figured you may want to consider something similar for Wikimedia sites. It would also be a great way to multiply the impact of all the work you've been doing with search. BTW, thanks for that!
    • Task
    Hello @EBernhardson, the script failed yesterday; here is the email we got and the stack trace; ``` name=email Systemd timer ran the following command:     `/usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s1.dblist` its return value was 1 and emitted the following output: <13>Feb 27 19:03:59 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20230227/enwiki-20230227-cirrussearch-content.json.gz ``` ``` name=stack.trace,lines=10 Dumping 6624383 documents (6624383 in the index) 6% done... 8% done... ...... 58% done... Elastica\Exception\ClientException from line 26 of /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Connection/Strategy/Simple.php: No enabled connection #0 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Connection/ConnectionPool.php(86): Elastica\Connection\Strategy\Simple->getConnection(Array) #1 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Client.php(396): Elastica\Connection\ConnectionPool->getConnection() #2 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Client.php(512): Elastica\Client->getConnection() #3 /srv/mediawiki/php-1.40.0-wmf.24/vendor/ruflin/elastica/src/Search.php(348): Elastica\Client->request('enwiki_content/...', 'POST', Array, Array) #4 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/includes/Elastica/SearchAfter.php(90): Elastica\Search->search() #5 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/includes/Elastica/SearchAfter.php(70): CirrusSearch\Elastica\SearchAfter->runSearch() #6 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/maintenance/DumpIndex.php(163): CirrusSearch\Elastica\SearchAfter->next() #7 /srv/mediawiki/php-1.40.0-wmf.24/maintenance/includes/MaintenanceRunner.php(609): CirrusSearch\Maintenance\DumpIndex->execute() #8 /srv/mediawiki/php-1.40.0-wmf.24/maintenance/doMaintenance.php(99): MediaWiki\Maintenance\MaintenanceRunner->run() #9 /srv/mediawiki/php-1.40.0-wmf.24/extensions/CirrusSearch/maintenance/DumpIndex.php(288): require_once('/srv/mediawiki/...') #10 /srv/mediawiki/multiversion/MWScript.php(118): require_once('/srv/mediawiki/...') #11 {main} ``` We checked to see if enwiki-20230227-cirrussearch-content.json.gz was later completed; sadly, the file is still missing. Additionally, the wikidata files for the last run are missing; we didn’t get any errors or anything related to why the wikidata files weren’t completed. The cirrussearch wikidata log on that day(20230220) looks like this; It shows the dump didn’t complete ``` Dumping 101983781 documents (101983781 in the index) 2% done... ...... 30% done... 32% done... 34% done... 36% done...
    • Task
    **Steps to replicate the issue** (include links if applicable): * Search for book title "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀ꦭꦩꦶ" in Commons works (the full title), found [[ https://commons.wikimedia.org/wiki/File:%EA%A6%A5%EA%A6%BC%EA%A6%A5%EA%A6%BC%EA%A6%9B%EA%A6%B6%EA%A6%8F%EA%A7%80%EA%A6%8F%EA%A6%A4%EA%A7%80%EA%A6%B1%EA%A6%8F%EA%A6%B6%EA%A6%81%EA%A6%A5%EA%A6%BF%EA%A6%97%EA%A6%9A%EA%A7%80%EA%A6%97%EA%A6%B6%EA%A6%AA%EA%A6%A4%EA%A7%80%EA%A6%AD%EA%A6%A9%EA%A6%B6.pdf | the PDF ]] * Search partial "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀" or "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ" or "ꦥꦼꦥꦼꦛꦶꦏ꧀" or any parts of the full title won't work **What happens?**: * First identified 10 years ago, T46350, marked wont fix, since last time was during migration from Lucene to Cirrus Search. * I identified there was a problem with scriptio continua nature of Javanese script (no word marker) * T58505 Cirrus Search ticket was closed as solved **What should have happened instead?**: The Commons search et. al should be able to find parts of the full title. **Software version** (skip for WMF-hosted wikis like Wikipedia): **Other information** (browser name/version, screenshots, etc.): A bit background on the way the script is written: * Scriptio continua is a hassle to display in web, because there's no obvious line break. Therefore, in projects such as Wikisource, if not handled properly, would break the page view of the transcribed documents. (become too wide) * Using certain keyboards, including the way we handle the problem is, to automatically insert ZWS (zero-width space) after certain character (e.g. comma, period, etc. ) I can gave you the full list. Therefore, the line break would still works, except in very rare cases where there's no occurrences of those characters (thus the ZWS not auto-inserted). [the zws doesn't always equal to the Latin space] * AFAIK ZWS is not supported in page titles, (e.g. when I upload books with Javanese script titles that contain ZWS), so none of the titles in jv wiki projects contain ZWS, and thus the cirrus search won't be able to know the word delimiter.
    • Task
    NOTE: Feel free to untag your project and create subtasks if needed. In https://gerrit.wikimedia.org/r/c/mediawiki/core/+/865811 we updated the default for targets to be `["desktop", "mobile"]`. This means that any extensions/skins defining this are now redundant and can remove these definitions, The following query marks some of the offenders (regex to be improved later): https://codesearch.wmcloud.org/search/?q=%22targets%22&i=nope&files=(extension%7Cskin).json&excludeFiles=&repos= # TODO [] Remove the lines [] Bump the MediaWiki required version to 1.40
    • Task
    A new keyword `textbytes` should be added to allow filtering pages based on the value of the `text_bytes` field. The `text_bytes` field is populated from `Content:getSize()` which defines itself as > Returns the content's nominal size in "bogo-bytes". What is behind `bogo-bytes` might remain mysterious but for a wikitext page this is the number of bytes of the wikitext source encoded in `UTF-8`. The keyword will be usable the same way as other numeric keyword we support [[https://www.mediawiki.org/wiki/Help:CirrusSearch#File_measures|File measure]]. - comparison: `textbytes:<1500` or `textbytes:>1500` all pages with text_bytes greater and lesser than 1500 - ranges: `textbytes:1500,10000` all pages with `text_bytes` between 1500 and 10000 - exact match are possible but probably useless, e.g. `textbytes:10` AC: A search query can be issued that filters based on the number of bytes in the source text (text_bytes field of documents in elasticsearch)
    • Task
    The `outlink` topic model is be able to do topic detection in all language we should use this data to possibly drop the need to import ORES `articletopics` model predictions. The search jobs should push `outlink` predictions to the existing `articletopics` weighted_tags prefix (the set of topics are the same): - the search jobs should consume events from the dedicated `mediawiki.revision_score_$model` stream (blocked on T328576) - there should be no impact on the `articletopics` search keyword - the predictions made by ores `articletopics` will be slowly replaced by the `outlink` model ones as new edits are made to existing pages - thresholds will be set statically in the code-base to `0.5` instead of being fetched from the ORES api AC: - outlink topic model predictions are pushed to the CirrusSearch indices and are queryable (via the existing articletopics keyword or a new one) - crosswiki propagation is no longer required and can be removed from the discolytics codebase. -- The ores `drafttopic` model will still be consumed as is but should not be a reason to keep the `crosswiki` propagation if it uses it (we could even consider only using it to populate the draft namespace?).
    • Task
    New articles not appearing in search bar autocomplete When typing a title into the search box (enwiki), relatively new articles do not appear in the list of suggestions. For example https://en.wikipedia.org/wiki/Statue_of_Queen_Victoria,_Hove was created 2022-12-23. When I begin searching the page title, the article is not suggested until I type the complete page title perfectly (eg, start typing "Statue of Queen Victoria, Hov" and nothing appears). Reported by multiple users at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_202#Articles_created_not_appearing_in_search_results Steps to reproduce: 1) Create a page with a new title 2) (optional but done to rule out problems) mark the page as patrolled 3) Wait some time (in this example weeks) 4) Try to search for the title on wiki What is being reported: 1) The search autocomplete will not complete the title 2) When searching for the title by a sub string of the title name (in this case just leaving off some characters from the end) - the title is not included in the search results What is expected: Both the autocomplete and the search results should include the page. The search documentation at https://www.mediawiki.org/wiki/Help:CirrusSearch#How_frequently_is_the_search_index_updated? suggests indexing occurs in "near real time".
    • Task
    Reproduce: Copy some text begins with no-break spaces and paste it into the main search box or value selector, such as "<nbsp>Wikipedia" Excepted: no-break spaces (and other [[ https://en.wikipedia.org/wiki/Whitespace_character | whitespace characters]]) before and after search terms are stripped, similar to that in Wikipedia opensearch (which use the opensearch endpoint) Actual: You does not got the item for Wikipedia: https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%C2%A0Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item This will be inconvenient if you are copying search terms from other websites that contains such whitespace characters, as users may be unaware of the existence of such characters. Note: Space characters that are stripped: [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%20Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | space ]], [[https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%09Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item|tab]], [[https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%0aWikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item|linebreak]] Space characters that are not stripped (but should be stripped): [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%E2%80%82Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | en-space ]], [[ https://www.wikidata.org/w/api.php?action=wbsearchentities&search=%E3%80%80Wikipedia&format=json&errorformat=plaintext&language=en&uselang=en&type=item | ideographic space ]]
    • Task
    Following up on the work done in T324576 we should create helm charts in the deployment-charts repo to deploy the cirrus-streaming-updater flink jobs to the wikikube k8s cluster. The chart should the features provided by the flink operator (i.e. FlinkDeployment CRD). The service should access: - thanos-swift using the S3 compat layer - kafka-main and/or kafka-test clusters - an elasticsearch cluster for testing - mediawiki API (using api-ro for testing low volume wikis) but an async MW cluster should be used instead (existing jobrunners might be usable, ref: T317283#8220470). AC: - a chart using the flink-k8s-operator is added to deployment-charts - the flink jobs are deployed to the wikikube staging cluster - updates from testwiki are propagated to a test elasticsearch cluster
    • Task
    Following-up on the work done in T316519 we should create two docker images for the two flink jobs powering the cirrus streaming updater. Open questions: - should we use blubber to generate these images? - should we use a single repo and have a new image generated per patch or a separate repo with a release process? AC: - images of the 2 flink jobs are available in docker-registry.wikimedia.org
    • Task
    Reproduce: 1. https://www.wikidata.org/w/index.php?search=Lexeme%3Aweek Expected: The search result will also display lexemes with "week" in glosses of some senses, such as https://www.wikidata.org/wiki/Lexeme:L778137
    • Task
    When processing change-events the update pipeline must make sure that these changes are processed in order by the ingestion pipeline. Reordering implies some buffering that might be done either using a flink window or a custom process function using timers, there are pros and cons for both: - windowing might help to have fewer timers but will hold a bigger state - a custom process function might have to trigger more timers but might allow to optimize the events (merge) on the fly Reordering should mainly use the rev_id to sort events and the mediawiki timestamp when the rev_id does not change (delete vs undeletes). To efficiently store the re-ordering state the events will have to be partitioned using the key: `[wiki_id, page_id]`. The optimization step should merge events related to the same page with a set of rules that will have to be specified and document (possibly in this ticket). Having an idea of how many events we could merge (deduplicate) might be interesting (trade-of between state size, latency and de-duplication efficiency). This might be doable doing a quick analysis of the `cirrusSearchLinksUpdate` job backlog (from kafka jumbo) over a time window of 1, 2, 5 and 10min and see how many events can be de-duplicated. AC: - the preparation job has buffering operator to properly re-order events - the preparation job should merge events related to a same page into a single event using some rules that will have to be documented - the buffering operator should have its buffer delay tunable - the buffering operator should expose metrics about how many events are de-duplicated
    • Task
    CirrusSearch should track page re-renders to update its index whenever a change external to the page itself is made (template, page properties, lua...) might affect its rendered version. As of today CirrusSearch does track this using the [[https://www.mediawiki.org/wiki/Manual:Hooks/LinksUpdateComplete|LinksUpdateComplete]] mediawiki hook. For the rewrite of the update-pipeline we should consider using similar events to trigger page updates that are not revision based. When the hook LinksUpdateComplete is triggered CirrusSearch should emit a change-event and ideally should avoid emitting an event if this change relates to a change captured by the page-state stream. The content of the event should contain everything required to enrich the event - domain - wiki_id - page_id - page_namespace - page_title (not strictly required but perhaps useful for debug purposes?) - timestamp (probably the current time at which the MW hook is executed?) Ideally the index_name and cluster_group should be part of these events so that we save a call to the mw API. Open question: Should we enrich during the preparation job or the ingestion job? Enriching during the preparation might require some non-negligible space on the target kafka cluster to store this: `kafka_log_size = re_renders_rate * (avg_doc_size / compression_ratio) * kafka_retention` If we take: * re_renders_rate: 400 re-renders/s ([[https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchLinksUpdate|estimated from current cirrusSearchLinksUpdate insertion rate]]). * avg_doc_size: 20KiB * compression_ratio: 2:1 * kafka_retention: 604800 secs (7days) `kafka_log_size = 400 * (20KiB/2) * 604800 = 2.25TiB` In addition to the kafka log size we also need to estimate the size of the flink state holding the window for doing event-reordering and optimizations. Assuming a 10minutes window it would be: `flink_state_size = 400 * 20KiB * 600 = 4.6GiB` (at least) Having page re-renders content in kafka might allow us to replay these updates during in-place re-index and save one API call for cloudelastic but it's not clear that the space cost is worth it. Another approach is doing enrichment of page re-renders during the ingestion job: - will help to keep the kafka backlog and the flink state smaller - we probably won't want to replay such updates after an in-place reindex (we don't replay those today anyways) - this content is not adressable (not bound to specific revision) so there's no strong reason to capture and store the content - unsure we want to track an error side-output for this kind of updates - will be a natural throttling mechanism to ensure that revision based updates are prioritized - between 65% to 80% of these updates are discarded when hitting elasticsearch AC: - write a schema that supports such update events - emit these events from CirrusSearch (using EventBus?) - consume these events from the producer job - enrich the events (from the preparation or the ingestion job see open question)
    • Task
    We're starting to see this problem on the search airflow instance (an-airflow1001). A simple sensor task might fail with the following context (mail date: `Wed, 14 Dec 2022 17:00:19 +0000`): ``` Try 0 out of 5 Exception: Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally? Log: Link Host: an-airflow1001.eqiad.wmnet Log file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00.log Mark success: Link ``` When looking at the actual state for this task it has succeeded: ``` *** Reading local file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00/1.log [2022-12-14 16:28:13,261] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,309] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,310] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,310] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 16:28:13,310] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,361] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:28:13,365] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310416', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpzo8pxyn_'] [2022-12-14 16:28:14,038] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,038] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32615 [2022-12-14 16:28:14,790] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,788] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:28:14,791] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,790] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:28:14,925] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,923] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:28:15,203] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,202] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,205] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,205] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,209] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,262] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,387] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:28:18,218] {logging_mixin.py:112} INFO - [2022-12-14 16:28:18,217] {local_task_job.py:124} WARNING - Time since last heartbeat(0.03 s) < heartrate(5.0 s), sleeping for 4.97169 s [2022-12-14 16:28:23,199] {logging_mixin.py:112} INFO - [2022-12-14 16:28:23,196] {local_task_job.py:103} INFO - Task exited with return code 0 [2022-12-14 16:31:27,704] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:841} INFO - [...] -------------------------------------------------------------------------------- [2022-12-14 16:56:59,970] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:56:59,970] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310665', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmp03vdcbvm'] [2022-12-14 16:57:00,662] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:00,661] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=25095 [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,595] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,596] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:57:01,696] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,695] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:57:01,987] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,986] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,989] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,988] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,993] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,064] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,219] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:57:04,388] {logging_mixin.py:112} INFO - [2022-12-14 16:57:04,387] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960811 s [2022-12-14 16:57:09,356] {logging_mixin.py:112} INFO - [2022-12-14 16:57:09,354] {local_task_job.py:103} INFO - Task exited with return code 0 [suspicious hole here, time when the mail was sent] [2022-12-14 17:15:26,675] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [...] -------------------------------------------------------------------------------- [2022-12-14 18:25:45,431] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 18:25:45,431] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 18:25:45,479] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 18:25:45,479] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2311270', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpymzjynuf'] [2022-12-14 18:25:46,170] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:46,169] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32566 [2022-12-14 18:25:47,164] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,163] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 18:25:47,165] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,164] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 18:25:47,282] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,281] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 18:25:47,558] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,557] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,559] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,559] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,563] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,641] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,712] {base_sensor_operator.py:123} INFO - Success criteria met. Exiting. [2022-12-14 18:25:50,354] {logging_mixin.py:112} INFO - [2022-12-14 18:25:50,353] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.955724 s [2022-12-14 18:25:55,320] {logging_mixin.py:112} INFO - [2022-12-14 18:25:55,316] {local_task_job.py:103} INFO - Task exited with return code 0 ``` Note the hole between `2022-12-14 16:57:09` and `2022-12-14 17:15:26` (when the email was sent), with a poke interval at 3minutes the sensor should have been queued at `2022-12-14 17:00:09` so something prevented it from running at the expected time. Full logs: P42712 Possible cause: https://github.com/apache/airflow/issues/10790 Possible workarounds: - increase poke_interval to 5mins or more - reduce the load on the machine (provision a bigger instance?)
    • Task
    **Problem Description** Following our recent upgrades of MediaWiki (1.36.4 to 1.38.2) and Semantic MediaWiki (3.2.3 to 4.0.2), we started seeing pages often showing unparsed SMW inline queries as wikitext in the HTML output that is ultimately shown to users, as well as SMW queries sometimes returning empty or incomplete results. These SMW queries are typically defined in templates. Extensive further details can be found here: - [[ https://wiki.guildwars2.com/wiki/Guild_Wars_2_Wiki:Reporting_wiki_bugs#Empty_or_incomplete_SMW_queries | Empty or incomplete SMW queries ]] - [[ https://wiki.guildwars2.com/wiki/Guild_Wars_2_Wiki:Reporting_wiki_bugs#Unparsed_pages_after_null_edit | Unparsed pages after null edit ]] **Troubleshooting** Initial troubleshooting included removing most extensions by commenting out their loading from LocalSettings.php. Removing CirrusSearch and Elastica, which seemed related to timing issues, caused the unparsed query issue to disappear, However, through further testing we found a way to mitigate that problem without having to remove CirrusSearch and Elastica. Our mitigation involved making a change to how our MediaWiki job runners worked. We have four SMW-enabled wikis, each with its own job runner systemd service, where the service is a shell script that runs `runJobs.php` with `--wait --maxjobs=1000 --procs=3` to parallelize the processing and limit the script's runtime to avoid memory leaks, etc. This is basically as recommended by https://www.mediawiki.org/wiki/Manual:Job_queue. The changes made involved removing the use of `--procs` to cause the jobs to be run completely sequentially, as well as grouping together jobs in the queue by type for each iteration of the infinite loop. ``` job_types=" enotifNotify cirrusSearchLinksUpdatePrioritized cirrusSearchElasticaWrite cirrusSearchLinksUpdate cirrusSearchIncomingLinkCount recentChangesUpdate htmlCacheUpdate refreshLinks " maxjobs=1000 memlimit="512M" while true; do for job_type in ${job_types}; do while [[ $(/usr/bin/php ./maintenance/showJobs.php --type ${job_type} --list | wc -l) -gt 0 ]]; do /usr/bin/php ./maintenance/runJobs.php --maxjobs=${maxjobs} --memory-limit=${memlimit} --type ${job_type} done done /usr/bin/php ./maintenance/runJobs.php --maxjobs=${maxjobs} --memory-limit=${memlimit} --wait done ``` Our testing only showed this to "fix" the display of unparsed queries but not the problem of empty/incomplete query results. I have deployed the above runner logic change to our production wiki environment and continue to work with our editors to monitor the wikis for these problems. However, it is quite unclear as to why this problem suddenly started after our aforementioned upgrades and why the runner logic change seems to fix part of the problem.
    • Task
    The way WikibaseCirrusSearch generates its elasticsearch mapping is sub-optimal for non-english users. The approach taken by WikibaseCirrusSearch to deal with the multilingual nature of the wikidata content is to create a field per language for the `labels` field. Having a subfield per language is costly in term elasticsearch resources but does allow a great level of customization, sadly WikibaseCirrusSearch does not take full benefits of it. As of today the mapping for the `labels.ko` field is: ```lang=json "labels": { "properties": { "ko": { "type": "text", "fields": { "near_match": { "type": "text", "index_options": "docs", "analyzer": "near_match" }, "near_match_folded": { "type": "text", "index_options": "docs", "analyzer": "near_match_asciifolding" }, "plain": { "type": "text", "analyzer": "ko_plain", "search_analyzer": "ko_plain_search", "similarity": "bm25", "position_increment_gap": 10 }, "prefix": { "type": "text", "index_options": "docs", "analyzer": "prefix_asciifolding", "search_analyzer": "near_match_asciifolding" } }, "copy_to": [ "labels_all" ] } } ``` This does index a field name `labels.ko` using the elasticsearch default text analyzer. The mapping should look like: ```lang=json "labels": { "properties": { "ko": { "type": "text", "fields": { "near_match": { "type": "text", "index_options": "docs", "analyzer": "near_match" }, "near_match_folded": { "type": "text", "index_options": "docs", "analyzer": "near_match_asciifolding" }, "plain": { "type": "text", "analyzer": "ko_plain", "search_analyzer": "ko_plain_search", "similarity": "bm25", "position_increment_gap": 10 }, "prefix": { "type": "text", "index_options": "docs", "analyzer": "prefix_asciifolding", "search_analyzer": "near_match_asciifolding" } }, "copy_to": [ "labels_all" ], "analyzer": "ko_text", "search_analyzer": "ko_text_search", "similarity": "bm25", "position_increment_gap": 10 } } ``` And then when searching in korean the filter should be adapted to also query the `labels.ko` field and its language fallbacks like what's done for the `descriptions.$lang` field. AC: - searching for a word in a label should use the language specific analyzers: e.g. searching for a korean word part of a label labelled as korean should yield search results (TODO: add a specific example here)
    • Task
    As a maintainer of the search infrastructure I want the [[https://integration.wikimedia.org/ci/job/selenium-daily-beta-CirrusSearch/|selenium-daily-beta-CirrusSearch]] job to inform me of a real problem when it fails so that I can react quickly to fix the CirrusSearch code-base. The problem is that it is a browser test but the Search team is no longer owning the search UI only the search backend and its APIs. In the recent months this test has been failing randomly causing noise to the team. We should decide if we want to keep this job because we believe it has value: - know when elasticsearch on the beta cluster is malfunctioning - know when CirrusSearch is broken and no longer responds to basic API requests (likely caught by cindy) - anything else we believe it has value for? or simply drop it because we believe that it does not have any value. AC: - make a decision: keep or drop - if we decide to keep it we should fix it to make it useful (rewrite this ticket with a proposed solution, e.g. only do API testing)
    • Task
    **Feature summary** (what you would like to be able to do and where): Search should show the actual result for the page if it exists and not just sub-pages **Use case(s)** (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): {F35660063} The Community Wishlist Survey 2022 page exists but it does not show up anywhere in the results. It's all just subpages. **Benefits** (why should this be implemented?): People can find what they are looking for
    • Task
    As a maintainer of the search infrastructure I want to monitor the update lag of the search indices so that I can evaluate if the system performance matches our expectations. What we want to track here is the time needed for a change to propagate to elasticsearch, in other words the time spent in the update pipeline. We do not want to track here the time elasticsearch takes to refresh its datastructure to make these changes visible to users (index refresh setting), this value can be extracted using a script at P17040. The data should be aggregated on: - the kind of update: revision based or page refresh (i.e. in cirrus world: LinksUpdatePrioritized vs LinksUpdate) - the target elasticsearch cluster Out of scope for this ticket is the update lag for the mjolnir batch update pipeline. AC: - a new set of metrics is available in graphite - a new grafana dashboard is created show the values of these metrics
    • Task
    **Steps to replicate the issue** (include links if applicable): * [[https://en.wikipedia.org/w/index.php?title=Special:Search&search=insource%3A%2F%5C%3Cref%28+%5B%5E%5C%3E%5D%2A%29%3F%5C%3Ehttps%3F%3A%5C%2F%5C%2F%5B%5E+%5C%3C%5C%3E%5C%7B%5C%7D%5D%2B+%2A%5C%3C%5C%2Fref%2Fi&ns0=1&fulltext=Search | Make a search]] **What happens?**: Results are rendered with a DOM kind of like this: ```lang=html <ul class="mw-search-results"> <li class="mw-search-result mw-search-result-ns-0"> <table class="searchResultImage"> <tbody> <tr> <td class="searchResultImage-thumbnail"> <a href="..." class="image"> <img alt="" src="..." decoding="async" data-file-width="1920" data-file-height="2400" width="96" height="120"> </a> </td> <td class="searchResultImage-text"> <div class="mw-search-result-heading"> <a href="/wiki/Mecca" title="Mecca" data-serp-pos="0">Mecca</a> </div> <div class="searchresult"> many tunnels.<span class="searchmatch">&lt;ref&gt;https://www.constructionweekonline.com/projects-tenders/article-22689-makkah-building-eight-tunnels-to-ease-congestion&lt;/ref</span>&gt; ===Rapid... </div> <div class="mw-search-result-data">97 KB (10,753 words) - 09:06, 7 October 2022</div> </td> </tr> </tbody> </table> </li> ... </ul> ``` **What should have happened instead?**: CSS flex should be used. These are not HTML data tables; the tables are being used strictly for presentation. The DOM should look something like this: ```lang=html <ul class="mw-search-results"> <li class="mw-search-result mw-search-result-ns-0 searchResultImage"> <div class="searchResultImage-thumbnail"> <a href="..." class="image"> <img alt="" src="..." decoding="async" data-file-width="1920" data-file-height="2400" width="96" height="120"> </a> </div> <div class="searchResultImage-text"> <div class="mw-search-result-heading"> <a href="/wiki/Mecca" title="Mecca" data-serp-pos="0">Mecca</a> </div> <div class="searchresult"> many tunnels.<span class="searchmatch">&lt;ref&gt;https://www.constructionweekonline.com/projects-tenders/article-22689-makkah-building-eight-tunnels-to-ease-congestion&lt;/ref</span>&gt; ===Rapid... </div> <div class="mw-search-result-data">97 KB (10,753 words) - 09:06, 7 October 2022</div> </div> </li> ... </ul> ``` with appropriate CSS flex styling.
    • Task
    === Background A while ago (**in MW v1.35.0**) a number of [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/resources/src/mediawiki.less/mediawiki.ui/variables.less | mediwiki.ui variables ]] were deprecated to follow Wikimedia's [[ https://www.mediawiki.org/wiki/Manual:Coding_conventions/CSS#Variable_naming | stylesheet variable naming convention ]] **//5 minor versions later//** a lot of the extensions are still using the outdated variables, making things harder. `border-radius` is just [[ https://codesearch.wmcloud.org/search/?q=%40borderRadius&i=nope&files=&excludeFiles=&repos= | one example ]]. === Acceptance criteria for done [] Replace deprecated vars with current ones. === List of currently deprecated vars since 1.35 ```lang=less // Deprecated in MW v1.35.0 @colorProgressive: @color-primary; @colorProgressiveHighlight: @color-primary--hover; @colorProgressiveActive: @color-primary--active; @colorDestructive: @color-destructive; @colorDestructiveHighlight: @color-destructive--hover; @colorDestructiveActive: @color-destructive--active; // Orange; for contextual use of returning to a past action @colorRegressive: #ff5d00; @colorText: @color-base; @colorTextEmphasized: @color-base--emphasized; @colorTextLight: @color-base--subtle; @colorBaseInverted: @color-base--inverted; @colorNeutral: @color-base--subtle; @colorButtonText: @color-base; @colorButtonTextHighlight: @color-base--hover; @colorButtonTextActive: @color-base--active; @colorDisabledText: @color-base--disabled; @colorFieldBorder: @border-color-base; @colorPlaceholder: @color-placeholder; @colorShadow: @colorGray14; // Used in mixins to darken contextual colors by the same amount (eg. focus) @colorDarkenPercentage: 13.5%; // Used in mixins to lighten contextual colors by the same amount (eg. hover) @colorLightenPercentage: 13.5%; @iconSize: @size-icon; @iconGutterWidth: @width-icon-gutter; @backgroundColorError: @background-color-error; @colorError: @color-error; @borderColorError: @color-error; @backgroundColorWarning: @background-color-warning; @colorWarning: @color-base--emphasized; @borderColorWarning: @border-color-warning; @backgroundColorSuccess: @background-color-success; @colorSuccess: @color-success; @borderColorSuccess: @color-success; // Orange; for contextual use of a potentially negative action of medium severity @colorMediumSevere: #ff5d00; // Yellow; for contextual use of a potentially negative action of low severity @colorLowSevere: #fc3; @backgroundColorInputBinaryChecked: @background-color-input-binary--checked; @backgroundColorInputBinaryActive: @background-color-input-binary--active; @sizeInputBinary: @size-input-binary; @borderColorInputBinaryChecked: @border-color-input-binary--checked; @borderColorInputBinaryActive: @border-color-input-binary--active; @borderWidthRadioChecked: @border-width-radio--checked; @borderRadius: @border-radius-base; @boxShadowWidget: @box-shadow-base; @boxShadowWidgetFocus: @box-shadow-base--focus; @boxShadowProgressiveFocus: @box-shadow-primary--focus; @boxShadowInputBinaryActive: @box-shadow-input-binary--active; ```
    • Task
    It would be nice to be able to use `^` for the beginning of the title and `$` for the end of the title in regular expression searches of titles (`intitle://`). At the moment there's no way to search for titles ending with `gry` as was recently brought up [in a discussion on categories for words with suffixes that are not really suffixes on English Wiktionary](https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2022/September#Category:English_words_ending_in_%22-gry%22_and_Category:English_words_ending_in_%22-yre%22). `intitle:/gry$/` doesn't work. Years ago @Dixtosa created https://dixtosa.toolforge.org to do searches like this. For prefix searches, `Special:PrefixIndex` works if you've got a literal prefix that narrows things down, but for anything more complicated you really need `insource:/^/`. My impression is that `^` and `$` were disabled in `insource://` searches because it's unclear whether they mean start of line and end of line, or start of text and end of text, and maybe for performance reasons, but neither thing would be a consideration in titles, which don't have newline characters and can only be 255 bytes long. So `intitle://` should be able to use `^` and `$`.
    • Task
    The update pipeline will have to construct an //update document// that will be used to carry the data to index. The fields to support are specified [[https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/844075/4/airflow/tests/fixtures/hive_operator_hql/import_cirrus_indexes_init_create_tables.expected | here]] (minus `extra_source`). CirrusSearch models this using a PHP array and use other configuration resource to inform some update hints (super noop options). Ideally we would like to avoid having to replicate (or fetch) any configuration of the target wiki and the model should carry all the information needed to manipulate itself, in other words //super_detect_noop// options will have to be modeled as well. The pipeline aims to support three kinds of updates: - revision based updates - content refresh updates (re-renders) - update fragments (for sidedata such as weighted_tags, pageviews related signals) The model must support merging all these updates together given a set of rules: - a revision update can be merged with an update fragment (e.g. a page edit and the corresponding update fragment obtained when ORES does its topic detection) - two or more update fragments can be combined together - conflict resolution when e.g. two update fragments attempt to update the same field The model must support being serialized by flink (i.e. a versioned flink serializer might be wise to implement). The model must support being serialized to JSON using a dedicated schema that will have to be designed. The model must support being serialized as an elasticsearch update (super_detect_noop updates and delete operations). The model must support being constructed out of the response of the CirrusSearch API to render its document (T317309). Note: multiple iterations are expected as we implement the search update pipeline itself and discover corner cases we did not know. This ticket is about implementing a reasonable first iteraction. Caveats: - CirrusSearch does not yet produce a that perfectly matches the schema defined [[https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/844075/4/airflow/tests/fixtures/hive_operator_hql/import_cirrus_indexes_init_create_tables.expected | here]]. AC: - a first version of the model is defined and implemented
    • Task
    Elasticsearch refuses to process this query because it contains `U+001F` which is reserved. `illegal_argument_exception: Term text cannot contain unit separator character U+001F; this character is reserved`. Per https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-suggesters.html#indexing we should escape \u0000, \u001f, \u001e both at index time and search time.
    • Task
    **User Story:** As an on-wiki searcher, I want to be able to search for words that have apostrophes in them without having to know or worry about what apostrophe-like character is actually used. For example, at least seven different characters are used on various projects in the name of the city in Yemen: Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib. **Notes:** We have a new character filter, `apostrophe_norm`, currently configured for use only on Nias Wikipedia, which converts the other six options to the straight apostrophe. There is a lot of cross-wiki inconsistency in how these characters are treated, too. The table below shows how the characters are analyzed in English, Japanese, and French Wikis. The standard tokenizer splits on backticks (` U+0060) so that always gets split into two words (//ma// is a stop word in French, so it gets dropped). English has the `aggressive_splitting` filter enabled, which splits on three of the other characters (left and right curly apostrophes and the straight apostrophe). `icu_folding` removes the left and right half rings in English and French, though French has the "preserve" variant, which keeps the original, too. `icu_folding` also straightens the curly apostrophes in French, but `aggressive_splitting` has already split on them in English. |**char**|**U+0027**|**U+02BF**|**U+02BE**|**U+02BC**|**U+0060**|**U+2019**|**U+2018** |**input**|**Ma'rib**|**Maʿrib**|**Maʾrib**|**Maʼrib**|**Ma`rib**|**Ma’rib**|**Ma‘rib** |**en**|ma, rib|marib|marib|marib|ma, rib|ma, rib|ma, rib |**ja**|ma'rib|maʿrib|maʾrib|maʼrib|ma, rib|ma’rib|ma‘rib |**fr**|ma'rib|marib/maʿrib|marib/maʾrib|ma'rib|(ma,) rib|ma'rib/ma’rib|ma'rib/ma‘rib If we work on T219108, we should also consider removing apostrophes from `aggressive_splitting`. **Acceptance Criteria:** * `apostrophe_norm` is enabled everywhere (or at least by default, possibly with exceptions or customization for some languages for reasons as yet unknown) * All of //Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib// index to the same form in all or almost all wikis (i.e., with intentional exceptions). Note: this is a follow up to T311654, which looked at this issue for just one language (Nias).
    • Task
    **Steps to replicate the issue** (include links if applicable): * Go to https://en.wikipedia.org/wiki/Special:Search * Search for `wazuh` ([[ https://en.wikipedia.org/w/index.php?search=wazuh&title=Special:Search&go=Go&ns0=1 | repro search result ]]) **What happens?**: A result for a page is returned — the table is not rendered correctly. {F35354749} **What should have happened instead?**: A result for a page is returned — the table is //either// rendered correctly //or// a "plain text" table representation should be shown. I'm sure ```lang=html <div class="searchresult"> Network Logs Config Sane defaults Notes OSSEC 2019 No No Yes Yes Yes Yes <span class="searchmatch">Wazuh</span> 2022 No No Yes Yes Yes Yes Samhain 2021 Yes No Yes No Partial No Snort 2018 </div> ``` can be displayed a little nicer? **Other information** (browser name/version, screenshots, etc.): Screenshot below from the [[ https://twitter.com/gewt/status/1552808814453219328 | tweet I saw ]] (gewt) about this, so this affects both mobile & desktop. {F35354753}
    • Task
    We expect elasticsearch scores for mediasearch (when rescoring is turned off via `cirrusRescoreProfile=empty` in the url) to be in the 0-100 range, but they aren't Here's [[ https://commons.wikimedia.org/w/index.php?title=Special:Search&cirrusDumpResult&cirrusRescoreProfile=empty&ns6=1&search=custommatch%3Adepicts_or_linked_from%3DQ146 | an example from custommatch ]] and [[ https://commons.wikimedia.org/w/index.php?title=Special:Search&cirrusDumpResult&cirrusRescoreProfile=empty&ns6=1&search=cat | an example from regular search ]] In the [[ https://commons.wikimedia.org/w/index.php?title=Special:Search&cirrusDumpResult&cirrusRescoreProfile=empty&ns6=1&search=cat&cirrusExplain=pretty | cirrusExplain ]] it looks like the logistic regression might be getting applied separately to the score calculation somehow? This might not materially affect the search results, but the fact that the scores are not what we'd expect suggests that we're not doing what we think we're doing, and so we ought to investigate to make sure
    • Task
    The wikidata property `P854` 'reference URL' is currently used on over 62 million references on wikidata. This introduces challenges if one wants to find eg all statements which are referenced to a particular domain. (For example, the Library at the London School of Economics is considering changing its preferred form of URL for online theses, and wanted to find all statements referenced to a URL of the current form `etheses.lse.ac.uk`) 62 million is far too many for a SPARQL query to merely retrieve all uses of the property and then filter for a particular string: such a query has no hope of completing in 60 seconds. An alternative strategy is therefore to use Cirrus search to identify relevant item pages containing the string, and then SPARQL to identify the relevant statements within them. Trying this strategy, Cirrus search was found to retrieve 4121 pages https://w.wiki/5RC3 , in which query https://t.co/ovNL2bZov3 found 40951 statements with such a URL either as a direct value, or a qualifier value, or as a reference However it was noticed that the approach was returning no reference URLs as references that were not also there as statement or qualifier values. https://w.wiki/5RB2 This is despite such uses being widespread -- for example query https://w.wiki/5RqQ finds 11,500 further cases where LSE thesis URLs are being referenced as references from items for LSE staff or graduates, without the URLs appearing as statement or qualifier values -- the pages for none of these items were being returned by the Cirrus search. This was unexpected behaviour, and makes it difficult to reliably find URLs from a particular domain being used as references. Notes. 1. When called from SPARQL the mwapi search call is limited to returning a maximum of 10,000 results. ([[https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI | MWAPI manual ]]). However as the call is only actually returning 4121 pages, we should still be well within this limit. 2. There do seem to be at least some pages where Cirrus //can// find a URL used only in a wikidata reference -- for example this query successfully retrieves a reference to a Unesco URL https://t.co/Jfwg88oYId . But seemingly not one of the pages using an LSE URL only as a reference is found.
    • Task
    When targeting a local wiki, `Search with accent yields result page with accent` Selenium test fails. Looks like the page does not exist. {P24645} {F35052899}
    • Task
    As a maintainer of the search infrastructure I want to have more precise metrics regarding errors that occur between CirrusSearch and Elasticsearch so that I can better understand the problems on the cluster. The CirrusSearch failures are currently categorized into 3 buckets: `rejected`, `failed` and `unknown`. The `unknown` bucket is currently seeing 1 error/minute so it would be interesting to know what these are, especially if they relate to indexing documents. Graph: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=9 AC: - the number of `unknown` errors should be exceptional (close to 0/day)
    • Task
    This includes: - new keyword in CirrusSearch to remove them from search results - new option in AdvancedSearch to do so
    • Task
    - Follow instructions from [[ https://www.mediawiki.org/wiki/CirrusSearch | CirrusSearch ]] (Quickstart) to get MediaWiki working locally. - Refactor WebdriverIO tests (in `tests/selenium`) from sync to async mode. - Push code to [[ https://www.mediawiki.org/wiki/Gerrit | Gerrit ]]. # TODO {P20436}
    • Task
    Shows up a handful of times in logs ([[https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-mediawiki-2022.01.27?id=4aWdm34ByXy9L9-Ms29Z|example]]). ``` Invalid URL http://?????????/mediawiki/load.php?debug=false&lang=en&modules=mediawiki.legacy.commonPrint,shared|skins.monobook&only=styles&skin=monobook&* specified for reference Flow\Model\URLReference from /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Model/URLReference.php(27) #0 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Model/URLReference.php(64): Flow\Model\URLReference->__construct(Flow\Model\UUID, string, Flow\Model\UUID, Title, string, Flow\Model\UUID, string, string) #1 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/Mapper/BasicObjectMapper.php(40): Flow\Model\URLReference::fromStorageRow(array, NULL) #2 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ObjectLocator.php(315): Flow\Data\Mapper\BasicObjectMapper->fromStorageRow(array) #3 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ObjectManager.php(307): Flow\Data\ObjectLocator->load(array) #4 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ObjectLocator.php(119): Flow\Data\ObjectManager->load(array) #5 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ObjectLocator.php(70): Flow\Data\ObjectLocator->findMulti(array, array) #6 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ManagerGroup.php(127): Flow\Data\ObjectLocator->find(array) #7 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Data/ManagerGroup.php(139): Flow\Data\ManagerGroup->call(string, array) #8 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/LinksTableUpdater.php(124): Flow\Data\ManagerGroup->find(string, array) #9 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/LinksTableUpdater.php(44): Flow\LinksTableUpdater->getReferencesForTitle(Title) #10 /srv/mediawiki/php-1.38.0-wmf.19/extensions/Flow/includes/Content/BoardContentHandler.php(217): Flow\LinksTableUpdater->mutateParserOutput(Title, ParserOutput) #11 /srv/mediawiki/php-1.38.0-wmf.19/includes/content/ContentHandler.php(1723): Flow\Content\BoardContentHandler->fillParserOutput(Flow\Content\BoardContent, MediaWiki\Content\Renderer\ContentParseParams, ParserOutput) #12 /srv/mediawiki/php-1.38.0-wmf.19/includes/content/Renderer/ContentRenderer.php(47): ContentHandler->getParserOutput(Flow\Content\BoardContent, MediaWiki\Content\Renderer\ContentParseParams) #13 /srv/mediawiki/php-1.38.0-wmf.19/includes/Revision/RenderedRevision.php(271): MediaWiki\Content\Renderer\ContentRenderer->getParserOutput(Flow\Content\BoardContent, Title, integer, ParserOptions, boolean) #14 /srv/mediawiki/php-1.38.0-wmf.19/includes/Revision/RenderedRevision.php(238): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(Flow\Content\BoardContent, boolean) #15 /srv/mediawiki/php-1.38.0-wmf.19/includes/Revision/RevisionRenderer.php(221): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array) #16 /srv/mediawiki/php-1.38.0-wmf.19/includes/Revision/RevisionRenderer.php(158): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array) #17 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array) #18 /srv/mediawiki/php-1.38.0-wmf.19/includes/Revision/RenderedRevision.php(200): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array) #19 /srv/mediawiki/php-1.38.0-wmf.19/includes/content/ContentHandler.php(1443): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput(array) #20 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(85): ContentHandler->getParserOutputForIndexing(WikiPage, ParserCache) #21 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(68): CirrusSearch\BuildDocument\ParserOutputPageProperties->finalizeReal(Elastica\Document, WikiPage, ParserCache, CirrusSearch\CirrusSearch) #22 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/BuildDocument/BuildDocument.php(171): CirrusSearch\BuildDocument\ParserOutputPageProperties->finalize(Elastica\Document, Title) #23 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/DataSender.php(317): CirrusSearch\BuildDocument\BuildDocument->finalize(Elastica\Document) #24 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(136): CirrusSearch\DataSender->sendData(string, array) #25 /srv/mediawiki/php-1.38.0-wmf.19/extensions/CirrusSearch/includes/Job/JobTraits.php(136): CirrusSearch\Job\ElasticaWrite->doJob() #26 /srv/mediawiki/php-1.38.0-wmf.19/extensions/EventBus/includes/JobExecutor.php(79): CirrusSearch\Job\CirrusGenericJob->run() #27 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array) #28 {main} ```
    • Task
    ``` php tests/phpunit/phpunit.php tests/phpunit/includes/preferences/DefaultPreferencesFactoryTest.php Using PHP 7.4.26 PHPUnit 8.5.21 by Sebastian Bergmann and contributors. .....EE..EEI. 13 / 13 (100%) Time: 2.21 seconds, Memory: 79.00 MB There were 4 errors: 1) DefaultPreferencesFactoryTest::testShowRollbackConfIsHiddenForUsersWithoutRollbackRights Error: Call to a member function getSession() on null /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1429 /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1369 /Users/kostajh/src/mediawiki/w/extensions/Echo/includes/EchoHooks.php:309 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:338 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:137 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookRunner.php:1909 /Users/kostajh/src/mediawiki/w/includes/preferences/DefaultPreferencesFactory.php:248 /Users/kostajh/src/mediawiki/w/tests/phpunit/includes/preferences/DefaultPreferencesFactoryTest.php:211 /Users/kostajh/src/mediawiki/w/tests/phpunit/MediaWikiIntegrationTestCase.php:452 === Logs generated by test case [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [localisation] [debug] LocalisationCache using store LCStoreNull [] [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [localisation] [debug] LocalisationCache using store LCStoreNull [] [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [objectcache] [debug] MainObjectStash using store {class} {"class":"HashBagOStuff"} [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [MessageCache] [debug] MessageCache using store {class} {"class":"HashBagOStuff"} [Parser] [debug] Parser::setOutputFlag: set user-signature flag on 'DefaultPreferencesFactoryTest'; User signature detected [] === 2) DefaultPreferencesFactoryTest::testShowRollbackConfIsShownForUsersWithRollbackRights Error: Call to a member function getSession() on null /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1429 /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1369 /Users/kostajh/src/mediawiki/w/extensions/Echo/includes/EchoHooks.php:309 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:338 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:137 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookRunner.php:1909 /Users/kostajh/src/mediawiki/w/includes/preferences/DefaultPreferencesFactory.php:248 /Users/kostajh/src/mediawiki/w/tests/phpunit/includes/preferences/DefaultPreferencesFactoryTest.php:231 /Users/kostajh/src/mediawiki/w/tests/phpunit/MediaWikiIntegrationTestCase.php:452 === Logs generated by test case [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [localisation] [debug] LocalisationCache using store LCStoreNull [] [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [localisation] [debug] LocalisationCache using store LCStoreNull [] [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [objectcache] [debug] MainObjectStash using store {class} {"class":"HashBagOStuff"} [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [MessageCache] [debug] MessageCache using store {class} {"class":"HashBagOStuff"} [Parser] [debug] Parser::setOutputFlag: set user-signature flag on 'DefaultPreferencesFactoryTest'; User signature detected [] === 3) DefaultPreferencesFactoryTest::testVariantsSupport Error: Call to a member function getSession() on null /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1429 /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1369 /Users/kostajh/src/mediawiki/w/extensions/Echo/includes/EchoHooks.php:309 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:338 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:137 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookRunner.php:1909 /Users/kostajh/src/mediawiki/w/includes/preferences/DefaultPreferencesFactory.php:248 /Users/kostajh/src/mediawiki/w/tests/phpunit/includes/preferences/DefaultPreferencesFactoryTest.php:371 /Users/kostajh/src/mediawiki/w/tests/phpunit/MediaWikiIntegrationTestCase.php:452 === Logs generated by test case [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [localisation] [debug] LocalisationCache using store LCStoreNull [] [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [localisation] [debug] LocalisationCache using store LCStoreNull [] [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [objectcache] [debug] MainObjectStash using store {class} {"class":"HashBagOStuff"} [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [MessageCache] [debug] MessageCache using store {class} {"class":"HashBagOStuff"} [Parser] [debug] Parser::setOutputFlag: set user-signature flag on 'DefaultPreferencesFactoryTest'; User signature detected [] === 4) DefaultPreferencesFactoryTest::testUserGroupMemberships Error: Call to a member function getSession() on null /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1429 /Users/kostajh/src/mediawiki/w/includes/Permissions/PermissionManager.php:1369 /Users/kostajh/src/mediawiki/w/extensions/Echo/includes/EchoHooks.php:309 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:338 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookContainer.php:137 /Users/kostajh/src/mediawiki/w/includes/HookContainer/HookRunner.php:1909 /Users/kostajh/src/mediawiki/w/includes/preferences/DefaultPreferencesFactory.php:248 /Users/kostajh/src/mediawiki/w/tests/phpunit/includes/preferences/DefaultPreferencesFactoryTest.php:396 /Users/kostajh/src/mediawiki/w/tests/phpunit/MediaWikiIntegrationTestCase.php:452 === Logs generated by test case [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [localisation] [debug] LocalisationCache using store LCStoreNull [] [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [localisation] [debug] LocalisationCache using store LCStoreNull [] [objectcache] [debug] MainWANObjectCache using store {class} {"class":"EmptyBagOStuff"} [objectcache] [debug] MainObjectStash using store {class} {"class":"HashBagOStuff"} [wfDebug] [debug] ParserFactory: using default preprocessor {"private":false} [localisation] [debug] LocalisationCache::isExpired(en): cache missing, need to make one [] [MessageCache] [debug] MessageCache using store {class} {"class":"HashBagOStuff"} [UserOptionsManager] [debug] Loading options from database {"user_id":0} [Parser] [debug] Parser::setOutputFlag: set user-signature flag on 'DefaultPreferencesFactoryTest'; User signature detected [] [CentralAuthVerbose] [info] Loading state for global user {user} from DB {"user":"","private":false} [CentralAuthVerbose] [info] Loading attached wiki list for global user from DB {"private":false} [CentralAuthVerbose] [info] Loading groups for global user {user} {"user":"","private":false} [objectcache] [debug] fetchOrRegenerate(global:centralauth-user:d41d8cd98f00b204e9800998ecf8427e): miss, new value computed [] [CentralAuthVerbose] [info] Loading CentralAuthUser for user {user} from cache object {"user":"","private":false} === ERRORS! Tests: 13, Assertions: 28, Errors: 4, Incomplete: 1. You should really speed up these slow tests (>50ms)... 1. 200ms to run DefaultPreferencesFactoryTest:testIntvalFilter 2. 158ms to run DefaultPreferencesFactoryTest:testGetForm 3. 94ms to run DefaultPreferencesFactoryTest:testEmailAuthentication with data set #0 4. 68ms to run DefaultPreferencesFactoryTest:testEmailAuthentication with data set #2 5. 67ms to run DefaultPreferencesFactoryTest:testEmailAuthentication with data set #1 ```
    • Task
    CirrusSearch currently uses mediawiki-vagrant as the development environment. Most of the mediawiki development community has moved on from mwv. To move forward we need to: # Identify the functionality CirrusSearch requires to run its full testing suites, integration/unit/etc. (ex: multiwiki deployment) # Identify the environments that currently exist, and come up with an estimate of how much work it would be to add the functionality we require from step 1. Nothing complex, we are mostly looking at them relative to each other.
    • Task
    This is a microtask for {T256239}. See that task for more information and how to get help. - If you are working on this task, assign it to yourself, so others know it's already taken. - Read the MediaWiki-Docker documentation for the repository: [[ https://www.mediawiki.org/wiki/MediaWiki-Docker/Extension/CirrusSearch | MediaWiki-Docker/Extension/CirrusSearch ]]. - Set up the repository on your machine. - Run Selenium tests. - Read the repository documentation: [[ https://www.mediawiki.org/wiki/Extension:CirrusSearch | Extension:CirrusSearch ]]. - You might need to read repository source code: [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CirrusSearch/+/refs/heads/master | CirrusSearch ]]. - You might need to contact developers of the repository. [[ https://www.mediawiki.org/wiki/Developers/Maintainers | Developers/Maintainers ]] lists which team is in charge of which repository. - **Find what's missing from the MediaWiki-Docker documentation for the repository. Update the page.** If the Selenium tests pass, you're done.
    • Task
    **Steps to reproduce:** 1. Go to https://cs.wikipedia.org/ (logged in; Firefox 93) 2. In the search bar, enter `burcak` (as the term `Burčák` exists but I might not have a keyboard that allows to easily add diacritics) 3. Look at the proposed autocomplete results 4. Click the `Hledat` (`Search`) button **Actual outcomes after step 3 and step 4:** {F34688936} {F34688935} **Expected outcome:** Seeing existing https://cs.wikipedia.org/wiki/Burčák listed that I could select
    • Task
    **List of steps to reproduce** (step by step, including full links if applicable): * Make some pages edits and wait a while * Perform an insource regex search (See <https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=1049919215#Search_index> for some details) **What happens?**: * Outdated results are generated, even hours later **What should have happened instead?**: * Results should have near-realtime results only (per <https://www.mediawiki.org/wiki/Help:CirrusSearch#How_frequently_is_the_search_index_updated?>) **Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc**:
    • Task
    In templates with TemplateData, only search the template name and description from TemplateData, not the whole doc. On a template doc can be pretty much anything, so in my eyes it makes the most sense to consider only the information within TemplateData. The entire docu should only be searched as a fallback option for templates without TemplateData. reported on https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagen/Werkstatt/Archiv_2021/3#Passende_Vorlage_suchen_und_finden:_Suchfunktion_und_weitere_Verbesserungen_im_Vorlagenassistenten
    • Task
    (Lydia asked that I write this up, just in case) I thought that "," comma was already added to the Elasticsearch standard tokenizer and would be excluded from simple search? But it seems that there is some overriding decision to have the default config this way on Wikidata? Perhaps the word_delimiter is being used and incorrectly? > Avoid using the word_delimiter filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead. Below as seen in my screenshot, I was looking for entities that contained all 3 words, but it seemed if I DID NOT include the comma, then the entity was not found. The only way that it was displayed was if I did include the comma. {F34615713} I noticed that the string "foot locker inc" will not show the entity in the dropdown, but only "foot locker, inc." which includes the comma? Exact match should only happen by default if a user wraps in double quotes, such as ``` "Foot Locker, Inc." ``` where in my example screenshot I have to include the comma to find the entity. But my expectation was that any U+002C comma in the search string would not be included in the search query. (On that entity, I have since added the full legal name into the alias field to help improve searchability, but still would like to know the decision on why U+002C comma is not being excluded) Why was U+002C comma decided to be included in simple search? Must users use the Advanced Search on Wikidata or the API if they want to actually do simple searches that are not exact match phrases? Doing something advanced in order to do something simple would seem counter-intuitive and the reverse of most users expectations.
    • Task
    # Status - The current Documentation is not having the sufficient information for locally setting up CirrusSearch - Debugging is challenging without having the repo locally. - Patch : [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/708516 | 708516]]
    • Task
    **List of steps to reproduce** * Open Turkish Wikipedia (or any other Turkish wiki) * Search for "special:import" in the search bar. * It sents to "Special:İmport" page instead of "Special:Import". **What should have happened instead?**: For some reason it changes the letter "i" to "İ" instead of "I". {F34587372} {F34587732}
    • Task
    As a user of the Search REST api I want to be able to use the same parameters that I used to pass when using opensearch or the action API using search modules so that I can tune, instrument and debug CirrusSearch. How to reproduce: pass `cirrusUseCompletionSuggester=yes` or `cirrusUseCompletionSuggester=no` to enable/disable the completion suggester: Opensearch gets varying results: - https://en.wikipedia.org/w/api.php?action=opensearch&search=Test%20crocket&cirrusUseCompletionSuggester=yes - https://en.wikipedia.org/w/api.php?action=opensearch&search=Test%20crocket&cirrusUseCompletionSuggester=no The REST api does not vary results: - https://en.wikipedia.org/w/rest.php/v1/search/title?q=Test%20crocket&cirrusUseCompletionSuggester=yes - https://en.wikipedia.org/w/rest.php/v1/search/title?q=Test%20crocket&cirrusUseCompletionSuggester=no This is particularly problematic as Cirrus relies on this kind of parameters for configuring its instrumentation and A/B test infrastructure. Seen on tr.wikipedia.org where the search widget is relying on the REST api instead of opensearch. Relates to T281578 AC: - discuss and decide what approach to take - instrumentation and debugging options should still be possible when CirrusSearch is called from the REST API
    • Task
    I’ve noticed that in some search results on mediawiki.org, the link to the section containing the result is a redlink. [standalone group](https://www.mediawiki.org/w/index.php?title=Special:Search&search=standalone+group&ns0=1&ns12=1&ns100=1&ns102=1&ns104=1&ns106=1): {F34534400} [phpunit testing list groups](https://www.mediawiki.org/w/index.php?search=phpunit+testing+list+groups&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns12=1&ns100=1&ns102=1&ns104=1&ns106=1): {F34534405} (Note that the first result has a blue section link.) The links point to the target page without the section, e.g. https://www.mediawiki.org/w/index.php?title=Manual:PHP_unit_testing/Writing_unit_tests_for_extensions&action=edit&redlink=1 for the first example. So far, I haven’t been able to reproduce this on another wiki.
    • Task
    As a maintainer of the search infrastructure I want the long running maintenance tasks to be resilient to node restarts so that such processes do not fail regularly. The scroll API relies on a non persisted state maintained on the elasticsearch nodes that may disappear if the node restarts and will cause the underlying maintenance task to fail. This problem currently affects: - dump generation (T265056) - title completion index rebuild - ttmserver - reindex? (might be solved upstream https://github.com/elastic/elasticsearch/issues/42612) One solution is to move the state to the client performing the long running task using `search_after` on a stable field (the page id). AC: - the scroll API is no longer used by long running tasks - a node crash does not cause a long running task to fail
    • Task
    As a maintainer of the search cluster I want to easily know what is the state of the various indices on all the clusters in regard to analysis settings/mappings so that I can more confidently enable/deprecate features without breaking existing usecases. Changing the analysis settings and mappings in CirrusSearch requires reindexing the affected wikis, this process is generally slow and is often delayed so that more chances are packed together. The drawback is that it's prone to mistakes as the maintainer has to remember what was done and what left to be done. AC: - verify that UpdateSearchIndexConfig is still able to properly detect discrepancies between the actual and expected settings - a small tool is available in `CirrusSearch/scripts` that produces a list of wiki-cluster pairs to verify
    • Task
    From IRC in `#wikimedia-operations`: ```lang=irc 11:55:23 <Krinkle> If there is code storing data directly in memc bypassing getWithSet(), then that would be a problem. 11:55:26 <Krinkle> I don't know if that's the case. 11:55:49 <Krinkle> It being called out here suggests that maybe it is doing something like that, as otherwise why is it called out at all? 11:56:24 <legoktm> I think the problem is that the CirrusSearch cache is too cold to use after the switchover 11:57:32 <Krinkle> well, if the procedure laid out here is what was done in the past, then I suppose we can do it again, I mean, nothing has changed in terms of wan cache 11:57:39 <legoktm> https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/3df4a9b30707a2ef9ba1ebfcc84f09b915c78e15/includes/Searcher.php#642 11:57:53 <Krinkle> if its use of wan cache is new, then we can re-evaluate it indeed 11:58:37 <legoktm> it's probably not new, I'm just checking to make sure the docs are still up to date, and it seems like they are 11:59:00 <Krinkle> aye, yeah, but this does seem a bit of an anti-pattern. 11:59:30 <Krinkle> it's bypassing virtually all scale and performance levers and automation in the wanobjectcache class by not using getWithSet, I think. 12:05:16 <legoktm> Krinkle: is there a link somewhere that explains why getWithSet is better than get/set? 12:05:20 -*- legoktm is filing a bug 12:08:11 <Krinkle> legoktm: the docs for get() and set() say to consider using getWithSet, and the raw get()/set() enumerate a lot of things to consider if you call them directly. https://doc.wikimedia.org/mediawiki-core/master/php/classWANObjectCache.html 12:08:23 <Krinkle> but more generally, if you ask me, these methods just shouldn' be public in the first place. 12:09:00 <Krinkle> They probably are only public to allow for an optimisation in one or two places somewhere where we haven't bothered to accept or accomodate it in a way that is less damanging to the public API 12:09:17 <Krinkle> and they probably are only called here because someone migrated the code from wgMemc to wanCache 12:09:24 <Krinkle> which is a step in the right direction I guess. ```
    • Task
    As a user, I want to order search results by the page size of each result, so that I can prioritize articles to work on. Split from T11519: Use cases: to find longest pages in a specific category, such as "stubs".
    • Task
    Split from T11519: Introduce a new keyword "pagesize" to search page with given size e.g. incategory:Stubs pagesize:>3000 will give stubs with more than 3000 bytes. Similar keyword filesize already exists
    • Task
    **List of steps to reproduce** (step by step, including full links if applicable): * [[ https://en.m.wikipedia.org/w/index.php?title=Special:Search&limit=500&offset=0&profile=default&search=%22Azerbaijan%3A+MTN+%28until+2015%29%22&ns0=1 | Perform this mobile search ]] for "Azerbaijan: MTN (until 2015)" on the English Wikipedia **What happens?**: * You see about 300 articles in the search results. * Click on any of the articles in mobile, and observe that the text "Azerbaijan: MTN (until 2015)" does not appear in the rendered article or in its source text. * Switch to desktop view of any of the articles, and you can see that text inside one of the navboxes. **What should have happened instead?**: * In the mobile view, navboxes are not displayed, so search results on mobile should exclude text that appears in navboxes. **Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc**: Mobile version of English Wikipedia
    • Task
    As a search engineer I want to know the set of available tools so that I can decide which ones are more adapted to my needs and possibly deprecate some of the tools written in relevancyForge. Family of tools and few examples: - judgement list creation & management (grading query sets) -- [[https://gerrit.wikimedia.org/g/wikimedia/discovery/discernatron|discernatron]] -- [[https://github.com/o19s/quepid/|quepid]] -- [[https://github.com/cormacparle/media-search-signal-test|media-search-signal-test]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/search/MjoLniR/+/refs/heads/master|mjolnir query grouping & DBN click model]] - evaluation engines -- [[https://github.com/o19s/quepid/|quepid]] -- [[https://github.com/SeaseLtd/rated-ranking-evaluator|rated-ranking-evaluator]] -- [[https://github.com/cormacparle/media-search-signal-test|media-search-signal-test AnalyzeResults.php]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge|relevance forge engine scorer]] -- [[https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge|relevance forge engine scorer & diff tools]] Aspects to evaluate: - ability to customize search integration: how is it to integrate a new search subsystem? - ability to store data and manage history (compare performance overtime) - UX: UI, multi-tenancy AC: - produce a comprehensive list of tools with a description of their features
    • Task
    Steps to reproduce: * Upload images and pdf's with a similar name. eg (mycat.png, category.jpg, cat1.pdf) * Edit Page * Insert Gallery * Enter a term in the search box that matches a known image name and a known pdf filename. eg. cat Expected result: * 2 results with thumbnails for mycat.png and category.jpg Actual Result: No results returned Ajax result contains message like 'Could not normalize image parameters for cat1.pdf' If I change the search query to: -filemime:pdf cat I get results correctly.
    • Task
    Currently CirrusSearch jobs are configured to not do any retries cause cirrus jobs manage retries internally. Instead, cirrus jobs should report success in case they failed and have scheduled a retry, and let change-prop do overall retries in case of a catastrophic job failure.
    • Task
    As a search engineer I want a dedicated dataset with the wikidata entities referenced from commons so that requests do not have to be made to wikidata directly Commons and wikidata RDF data are available in a hive table. Create a spark job in wikidata/query/rdf/rdf-spark-tools that pulls all wikidata items linked from a mediainfo item using the property `P180` or `P6243` with the following data: - item - labels - aliases - descriptions - P31 (instance of) - P171 (taxon) Example for Q42: ```lang=json { TODO } ``` The resulting dataset should be available in a hive table for downstream operators. Hive table: `discovery.mediasearch_entities` HDFS folder: `hdfs:///wmf/data/discovery/mediasearch_entities` Schedule: should probably be rebuilt as soon as the commons mediainfo RDF dump is processed AC: - a new spark job in `wikidata/query/rdf/rdf-spark-tools` - a new dag in airflow to schedule this new job
    • Task
    # Context As part of a [[ https://fr.wiktionary.org/wiki/Wiktionnaire:Pages_propos%C3%A9es_%C3%A0_la_suppression/f%C3%A9vrier_2021#Discussion |discussion about integration of some rare word within the French Wiktionary]], it was pointed out that currently searching with the default internal engine of the Wiktionnaire for a term like [[https://fr.wiktionary.org/w/index.php?search=mqmqn&title=Sp%C3%A9cial%3ARecherche&profile=advanced&fulltext=1&searchengineselect=mediawiki&advancedSearch-current=%7B%7D&ns0=1&ns100=1&ns106=1&ns110=1|mqmqn]] will return no result. Popular general purpose web search engine will rightfully suggest "did you mean “maman?”" Indeed, in this case, it's obvious for any knowledgeable person that someone most likely on a qwerty keyboard layout like if it was an azerty one. # Desired behaviour The minimum improvement would be that the internal search engine could provide a good suggestion for cases like this one. I'm not aware of the actual algorithms behind the search engine, but the Levenshtein distance (LD) between `mqmqn`and `maman` is only 2. It should certainly be taken into account rather than providing not even a single result. Compare for example with how [[https://fr.wiktionary.org/w/index.php?search=mamqn&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&searchengineselect=mediawiki&advancedSearch-current=%7B%7D&ns0=1&ns100=1&ns106=1&ns110=1|searching for mamqn]] can suggest something like `brandwerend maken`, which has a 15 LD with the provided input. It would be even better, if it was possible to feed the engine with a list of common misspells with a comment on the causes of such a misspell. Such a facility could provide both exhaustive and comprehensive specification lists, that is both something as `mqnqn -> maman : "…qwerty [on] azerty…"`and `p.p. -> papa : "The community decided to abuse the regexp facility to suggest papa as result to pépé, pipi, popo, pypy and so."` – although this later example would be a defective use of the feature. The result could then generate search result page with a leading text such as `Did you mean “[[maman]]”? This is a common misspell of a French word resulting from typing the word on a [[w:qwerty|]] keyboard layout as if it was an [[w:azerty| ]] one.`
    • Task
    ```lang=php public function allowRetries() { return false; } ``` This overridde has a [doc comment](https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/a221008bd90151c60f0e2db568e83da552deb813/includes/Job/ElasticaWrite.php#88) which explains why it is disabled, namely that the job has its own retry logic. However, if I understand correctly, this also means that the job is lost without recovery if the PHP process is killed. E.g. due to deployment and we roll over php-fpm across the fleet, or due to any other php-fpm restart (such as opcache capacity being reached, which is done by a cronjob currently), due during switch overs when we kill running processes, etc. Disabling retries is relatively rare, and when done it is typically for jobs that exist only as optimisation or that can self-correct relatively quickly (e.g. warming up thumbnail or parser cache), or for unsafe/complex code that isn't atomic and cannot restart with e.g. a user asynchronously waiting on the other end who would notice and know to re-try at the higher level (such as upload chunk assembly). I don't know if there is a regular and automated way by which this would self-correct for CirrusSearch. If not, then it might be worth turning this back on. Given that the code is already wrapped in a try-catch, it should be impossible for blind job queue growth to happen. The cases where a runtime error is found of the kind that you don't want to retry, the existing code will kick in as usual and signal that it should be counted as success. It's only when the process is aborted from the outside, and thus the job runner never gets a response or gets HTTP 500, that it will thus have permission to retry/requeue upto 3 times.
    • Task
    In T265894 @Tgr suggested the idea of a maintenance script for CirrusSearch to allow setting arbitrary field data in the ES index, for local development. In our suggested edits feature, we rely on ORES topics which are not populated in our local wiki; @Tgr wrote a script P10461 to populate this data. We would also like to set the `hasrecommendation` field (T269493) for articles. Having a maintenance script provided by the extension would be convenient.
    • Task
    According to their blogpost (https://www.elastic.co/blog/licensing-change): > Starting with the upcoming Elastic 7.11 release, we will be moving the Apache 2.0-licensed code of Elasticsearch and Kibana to be dual licensed under SSPL and the Elastic License, giving users the choice of which license to apply. Their //FAQ on 2021 License Change// : https://www.elastic.co/pricing/faq/licensing Considering this is happening: 1. WIll CirrusSearch rely on ElasticSearch in the near future (say on version 7.x)? 2. Will CirrusSearch rely on ElasticSearch in long-term? 3. According to task T213996, the switch from MongoDB had happened because it was removed in Debian and therefore will be unsuitable for long-term use., and therefore it was unclear if a precedent was achieved. Are there any policy change or clarification if another dependency announced to switch to a license that may not be suitable for MediaWiki? ---- Context on proprietary relicensing: https://sfconservancy.org/blog/2020/jan/06/copyleft-equality/ Existing alternative venues: https://opendistro.github.io/for-elasticsearch/contribute.html ("distribution" or "fork" depending who you ask, no CLA) Some announced forks: https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch/ https://logz.io/blog/open-source-elasticsearch-doubling-down/
    • Task
    **Problem:** Special:Search right now only allows for searching across all languages in the Lexeme namespace. It would be useful to allow to restrict the search to a specific language in order to make finding the right Lexeme easier. In order to do this we should introduce new cirrus search keywords. These could be `haslemma:en` and `haslang:Q1860`. **Example:** A search for "[[https://www.wikidata.org/w/index.php?search=a&title=Special:Search&profile=advanced&fulltext=1&ns146=1|a]]" to find the English indefinite article. It is currently the 17th result. **BDD** GIVEN a Lexeme search AND a keyword "haslang:Q1860" THEN the results only contain Lexemes with English as the Lexeme language GIVEN a Lexeme search AND a keyword "haslemma:en" THEN the results only contain Lexemes with English as one Lemma's spelling variant **Acceptance criteria:** * Results on Special:Search can be restricted by language via 2 new keywords **Notes:** * existing keywords specific to Wikibase: https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch
    • Task
    **User Story:** As a search user, I want to get the same results for cross-language suggestions regardless of the case of the query, because that usually doesn't matter to me. As noted below, searching for транзистор on English Wikipedia generates Russian cross-language suggestions, while searching for Транзистор does not (they only differ by the case of the first letter). Language identification via TextCat is currently case-sensitive because the n-gram models were generated without case folding. This makes sense as a model because word-initial caps are different from word-final caps in many cases, and some languages, like German, have different patterns of capitalization that can help identification. However, a side effect of that is that words that differ only by case can get different detection results—usually in the form of "no result" because one string is "too ambiguous" (i.e., there is more than one viable candidate). It would be mostly straightforward to case-fold the existing models (merging n-gram counts) to generate case-insensitive models, but we would have to re-evaluate the models' effectiveness. **Acceptance Criteria:** * Survey of how often differently-cased versions of the same query (original, all lower, all upper, capitalized words) get different language ID results, using the current TextCat params, to get a sense of the scope of the problem. * A review of any accuracy changes for case-folded TextCat models, using the currently optimized parameters. * If the problem is large enough and the accuracy of case-folded models drops too much, we need a plan (i.e., a new sub-ticket) to re-optimize the TextCat params for the case-folded and slightly lower-resolution but more consistent models. _____ **Original Description:** It's an issue I found as I was reporting T270847 :) If I [[ https://en.wikipedia.org/w/index.php?search=%D0%A2%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace of the English Wikipedia for "Транзистор" ]], I find zero results in the main screen, and one result in the right-hand sister project sidebar: "транзистор" in the English Wiktionary. The word means "transistor" in several languages that are written in the Cyrillic alphabet, and note that the search string begins with an uppercase Cyrillic letter. The title of the Wiktionary result, which //is// found, is written with a lowercase letter. If I [[ https://en.wikipedia.org/w/index.php?search=%D1%82%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace of the English Wikipedia for "транзистор" ]], which is the same word, but in all lowercase letters, then I get the same Wiktionary result in the sidebar, and also many results from the Russian Wikipedia (I'd also expect other languages, but that's another issue, T270847). Searching probably shouldn't be case-sensitive, at least not in a case like this.
    • Task
    **User Story:** As a Wikipedia user and speaker of a given language, I would like to know that results are available in my language when searching on a Wikipedia in a different language, so I can read articles in my own language. Our current language identification process chooses the //one// most likely language to show results from. There may be other languages with exact title matches or other reasonable results. It would be useful to let users know that articles/results in those languages exist if possible. Potential pitfalls include the expense of searching more than one additional language and increased potential for poor relevance for general results. Some possible approaches: * Allow more than one language to be used for cross-language searching; possibly based on one or more of geolocation, user language preferences, browser language(s), and language ID results. * Search multiple languages for results; could limit additional languages to title matches or exact title matches. * Update UI: Display multiple results sets, or provide a language selector, or provide links to results/exact title matches. **Acceptance Criteria:** * An assessment of how many languages we can realistically search * If n == 1, give up. `:(` * Best option for how to choose which languages to search * Best option for how to search additional languages * A plan for updating the UI (may require help from outside the team) If/When we move this to current work, this ticket may need to be upgraded to an EPIC to support all those different tasks. ______ **Original Description:** Ukrainian Wikipedia is only sometimes shown in cross-wiki search results even if a relevant result is available. For example, if I [[ https://en.wikipedia.org/w/index.php?search=%D1%82%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search the article namespace English Wikipedia for the string "транзистор" ]] (it means "transistor"), it finds nothing in the English Wikipedia, as you would expect, and shows results from the Russian Wikipedia. An [[ https://uk.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D0%B7%D0%B8%D1%81%D1%82%D0%BE%D1%80 | article with the exact same title ]] exists in the Ukrainian Wikipedia, but it's not shown in the results. Evidently, the showing of results from the Ukrainian Wikipedia works in cross-wiki search for some searches, but not for all. If I [[ https://en.wikipedia.org/w/index.php?search=%D0%BF%D0%B5%D1%82%D1%80%D0%BE%D0%BF%D0%B0%D0%B2%D0%BB%D1%96%D0%B2%D1%81%D1%8C%D0%BA%D0%B0+%D0%B1%D0%BE%D1%80%D1%89%D0%B0%D0%B3%D1%96%D0%B2%D0%BA%D0%B0&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1 | search for "петропавлівська борщагівка" ]] (Petropavlivska Borshchahivka, a name of a village), I get one result from English and multiple results from Ukrainian. I'd expect to see a result from Ukrainian also for "транзистор" (transistor), and not only from Russian. There are also [[ https://www.wikidata.org/wiki/Q5339 | several more wikis ]] where there's an article with the exact same title: Bashkir, Bulgarian, Chechen, Kazakh, and more, and I'd expect to see all of them. It would also be OK to prioritize results for languages that I have configured in my browser, but I tried configuring Ukrainian, and I still see only Russian results. (And even if my browser language is //prioritized//, other languages should be //available//.)
    • Task
    There is no way for users to opt for a default search option to exclude documents. Adding "-filemime:pdf -filemime:djvu" verges on being incomprehensible for most non-tech users. This could be usefully added as a site user preference, or made another field in the Commons search UI. This is an issue made more significant recently, with the IA books project adding a million PDFs to the collections on Commons. Consequently even simple (non-document type) searches like "cats with flowers" are returning lots of uninteresting looking PDFs in the top search returns, unless you happen to be very interested in Seed Trade Catalogs.
    • Task
    **User story:** As an Elasticsearch developer, I want to be able to add useful filters in a logical order without having to worry about how they might interact to create an invalid token order. **Notes:** As outlined in the parent task (T268730) and related comments, because `homoglyph_norm` creates multiple overlapping tokens and `aggressive_splitting` splits tokens, the two can interact to create tokens in an invalid order if `homoglyph_norm` comes before `aggressive_splitting`. For example, a stream of tokens with offsets (0-5, 6-7, 0-5, 6-7), which should be properly ordered as (0-5, 0-5, 6-7, 6-7). The short-term solution is to swap their order, but that is not the logical order they should be applied—though the outcome is the same in the majority of cases (but not all). There is a specific and a generic approach to solving the problem: * Specific: recreate either `aggressive_splitting` or its component `word_delimiter` in such a way that it doesn't create out-of-order tokens. This would require caching incoming tokens to make sure that none that come immediately after would be out of order. * Generic: create a general-purpose reordering filter that would take a stream and reorder tokens in an invalid order (up to some reasonable limit—it shouldn't have to handle a thousand tokens in reverse order, for example). ** Alternatively, it could clobberize highlighting and possibly some other features by simply changing the offset information to be "acceptable", as `word_delimiter_graph` does. So, (0-5, 6-7, 0-5, 6-7) would become (0-5, 6-7, 6-6, 6-7)—it's not right, but at least it isn't broken. The generic case would allow us to reorder tokens for the existing `aggressive_splitting` and could be useful in future situations, but is probably more difficult to code and possibly noticeably slower. **Acceptance Criteria:** * We can order `homoglyph_norm` before `aggressive_splitting` without causing errors on known-troublesome tokens such as `Tolstoу's` (with Cyrillic у).
    • Task
    As CirrusSearch maintainer I want MediaSearch to use a dedicated dataset built from wikidata that does not rely on the existing wikidata search APIs so that I can improve one without impacting the other. Sub-tickets will be created as needed but the plan is roughly: - import commons mediainfo dump to hdfs - spark job that joins commons & wikidata and output a dedicated dataset for concept lookups - determine the mapping, possibly experimenting with better techniques (not one field per language) to support multiple languages - custom elasticsearch query to do query expansion&rewrite - adapt mediasearch and replace the wikidata search API using query expansion - optional but would be good to have: provide completion for wikidata items using this same dataset instead of using the wikidata completion API AC: - The MediaSearch query builder is no longer using the wikidata search API - A single request is made to elastic
    • Task
    Searching for several words returns the results in random order instead of sorted by relevance. For example: # https://ru.wikipedia.org/w/index.php?search=Орден+"Святого+Марка"&title=Служебная:Поиск&profile=advanced&fulltext=1&advancedSearch-current={}&ns0=1 Searching for `Орден "Святого Марка"` returns the page `Орден Святого Марка` third instead of first as I would've expected since it is the most relevant result. It also returns the page `Награды Египта` (which also contains this exact sequence -- `Орден Святого Марка`) close to the end of the 1st page. # https://ru.wikipedia.org/w/index.php?search=Орден+Святого+Марка&title=Служебная:Поиск&profile=advanced&fulltext=1&advancedSearch-current={}&ns0=1 Searching for `Орден Святого Марка` (no quotation marks) does return the page `Орден Святого Марка` first but it doesn't return `Награды Египта` on the 1st page at all -- again, despite containing the very sequence that is searched for. My search settings are set to default so it's what most users get.
    • Task
    **User story:** As a non-WMF user of MediaWiki full-text search, I want to be able to configure custom analysis chains that are more appropriate for my use case. This issue came up in a discussion with @Svrl on the [[ https://www.mediawiki.org/wiki/Topic:Vxh0rmkyfef0pm70 | Cirrus help talk page ]]. For example: while you can specify `$wgLanguageCode = 'cs';`, that only allows you to enable the same specific analysis chain as used on cswiki. If we change that analysis on our end, it also changes for external users when they upgrade MediaWiki. If you want to do something different (like using the Czech stemmer + ICU folding), you can't easily do so (it may be possible with lots of hacking and manual maintenance, but that's sub-optimal). @dcausse & @TJones discussed this some, and @dcausse found a way to inject config to update or replace a language-specific configuration. An example config is done for Czech P13907. This should be documented in our on-wiki docs. **Acceptance Criteria:** * Update appropriate documentation page(s) on-wiki with general method for doing this, and at least one specific example. Should be reviewed by another search developer.
    • Task
    **While performing this command:** php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php **The following error pops out:** PHP Fatal error: Declaration of Elasticsearch\Endpoints\Indices\Exists::getParamWhitelist() must be compatible with Elasticsearch\Endpoints\AbstractEndpoint::getParamWhitelist(): array in /var/www/html/mwtest/extensions/Elastica/vendor/elasticsearch/elasticsearch/src/Elasticsearch/Endpoints/Indices/Exists.php on line 60 **Observed on:** Mediawiki 1.31.10 Elasticsearch 5.6.16 Ubuntu 18 PhP 7.2.24
    • Task
    Add a new CirrusSearch keyword itemquality, e.g. -itemquality:A will return all class A items. (In this task, it is not required to update the class automatically; they may be periodically updated, e.g. once every week. This may be decided later.)
    • Task
    I searched ``` incategory:"Files with no machine-readable license" insource:/eview/ -FlickreviewR ``` on Commons, in an effort to find files in that category that have a review template in source wikitext. This query returns some old files. By sorting by edit date, I found for example this: https://commons.wikimedia.org/wiki/File:Korg_Electribe_MX_(EMX-1)_Valve_Force.jpg It was in that category for **less than a minute **when it was uploaded in **2010**! As soon as this edit https://commons.wikimedia.org/w/index.php?diff=43801464 in the same minute of its upload passed, it was already out of that category. Yet it still shows up in my search query 10 years later! The file has been **edited more than 10 times** over the decade, and was last edited in 2017, so your database should have been updated right?! I dont know whether this kind of false positives is solely related to the incategory command or not. Please investigate.
    • Task
    Search is often used for finding articles to edit; the ability to exclude protected articles would make that more effective. #growthexperiments, which offers articles with simple editing tasks to newcomers (and thus needs to avoid recommending protected articles), currently filters out protected articles on the client side (via `action=query&prop=info&inprop=protection`) which is far from ideal and makes proper handling of result sizes and offsets impossible. It would be nice to have a CirrusSearch keyword (maybe `hasprotection:edit` / `hasprotection:move`?) for filtering for protection status. Page protection changes are accompanied by a null edit, which pushes status changes to the search index, so AIUI all that would be needed is to add a protection field to the ElasticSearch index, add it to the EventBus event for new revisions, and register the appropriate search feature.
    • Task
    # Status - v4 - running smoke test in CI ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/617416 | 617416 ]]) fails - v5 - Cindy-the-browser-test-bot fails when running [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/610396 | 610396 ]] without saying which tests failed - Vidhi still doesn't have CirrusSearch running locally, [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]] might help - Vidhi [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007/5#message-282ce3d4e62b3807002d2fdf76f42da08adf25f7 | can't log in ]] to horizon.wikimedia.org --- # TODO [x] add a separate patch renaming `@selenium-test` to `selenium-test` to check if webdriverio v4 tests pass in CI: [[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/617416 | 617416 ]]: [] update T253869 [] count lines of code in `tests/selenium` and `tests/integration` [] count tests in feature files (scenario and scenario outline) [] local development environment [x] [[ https://www.mediawiki.org/wiki/MediaWiki-Docker/Extension/CirrusSearch | MediaWiki-Docker/Extension/CirrusSearch ]]: P11874 [x] run the tests targeting the beta cluster ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/616734 | 616734 ]]): v4 P12120, v5 P12130 [] try [[ https://github.com/montehurd/mediawiki-docker-dev-sdc | montehurd/mediawiki-docker-dev-sdc ]] as local development environment: P12314 [] run the tests targeting mediawiki-vagrant ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]]): paste TODO [] run the tests targeting Wikimedia Cloud Services ([[ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/613007 | 613007 ]]): Not authorized for any projects or domains [x] Update to v5 [] Update to v6 [] Update to v7
    • Task
    MW search suggestions appear when entering some text in the search input in the skin, but when you want to dismiss them after you have clicked inside the suggestion area but not in a manner that takes you to the next page (i.e. mouse up outside of suggestions) the suggestion are stuck even if you click out side of the suggestions. Clicking outside of search focus areas does not force them to hide. Steps to reproduce: * go to [[ https://en.wikipedia.org/wiki/Main_Page | en.wikipedia.org ]] * type //math// in the upper right corner search box * click on the first result, but do not release the mouse * move the mouse outside the suggestions area and release The suggestions list now will not go away unless you click one of the suggestions!
    • Task
    Elastica is an extension created in {T56049} (predating the librarization project) that does nothing other than providing connection between Elasticsearch and MediaWiki. It contains two parts, one is the elastica PHP library; another is ElasticsearchConnection, which may be converted to a library too which may become a part of CirrusSearch.
    • Task
    cirrus_build_completion_indices.sh should not ignore failures of the underlying script `extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php`. Prior to adding `| ts '{}'` it was failing with a 123 status probably because of wikidata: `Completion suggester disabled, quitting...` for wikidata. Since we expect this error UpdateSuggesterIndex could perhaps return a success in this case. To sum-up this task is: - make sure cirrus_build_completion_indices.sh in puppet reports a failure when UpdateSuggesterIndex.php fails - make sure that UpdateSuggesterIndex.php does not return with an error when quitting with this message: `Completion suggester disabled, quitting...`
    • Task
    See [[https://translatewiki.net/w/i.php?title=MediaWiki:Searchresults-title/wa&diff=9392337&oldid=1189285|here]] and the result [[https://wa.wikipedia.org/w/index.php?search=ok&ns0=0|here]]. HTML entity (hex) is not correctly render in the title tag: ``` &amp;#x202F; ``` {F31842466} This is what I see in the title bar of my browser: {F31842489} ``` <title>Rizultats des rcwerances po «&amp;#x202F;ok&amp;#x202F;» — Wikipedia</title> ``` This is what is expected: {F31842491} ``` <title>Rizultats des rcwerances po «&#x202F;ok&#x202F;» — Wikipedia</title> ``` The ampersand char shouldn't be encoded if it represents a part of a html entity.
    • Task
    For example you should be able to use https://en.wikipedia.org/w/index.php?search=linksto%3Am%3A to search all pages links to Meta-Wiki main page. One use case is T249688#6055456, though {T239628} will be a more proper solution for that. See also {T68293}.
    • Task
    See {T248363} and T208425#5992965 for background.
    • Task
    For example see https://www.wikidata.org/w/index.php?title=Special:Search&limit=20&offset=20&profile=default&search=haslabel%3Akhw&advancedSearch-current={}&ns0=1&ns120=1, this returns a lot of items with khw alias instead of label. See also https://www.wikidata.org/wiki/Q13610143?action=cirrusdump. I proposed to introduce some new keywords: * hasalias * haslabeloralias * haslabel (existing, will search labels only) * inalias * inlabeloralias * inlabel (existing, will search labels only) Change the behavior of existing haslabel and inlabel may be considered a breaking change.
    • Task
    restify used by CirrusSearch has a vulnerability Please update to a newer version. #1171: csv-parse Severity: high Versions of csv-parse prior to 4.4.6 are vulnerable to Regular Expression Denial of Service. The __isInt() function contains a malformed regular expression that processes large specially-crafted input very slowly, leading to a Denial of Service. This is triggered when using the cast option. npm advisory
    • Task
    Once {T240559} lands, add the new `articletopic` search keyword to AdvancedSearch and provide a nice interface for selecting topics (a fixed list of 64 keywords, one or more of which can be used in the query).
    • Task
    This occurs when called from CirrusSearch's forceSearchIndex.php. The result, after indexing some pages but before finishing the job, is: ``` MWException from line 348 of /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php: Bad parser output text. #0 [internal function]: ParserOutput->{closure}(Array) #1 /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php(359): preg_replace_callback('#<(?:mw:)?edits...', Object(Closure), '<div class="mw-...') #2 /var/www/mediawiki-1.34.0/includes/content/WikiTextStructure.php(154): ParserOutput->getText(Array) #3 /var/www/mediawiki-1.34.0/includes/content/WikiTextStructure.php(223): WikiTextStructure->extractWikitextParts() #4 /var/www/mediawiki-1.34.0/includes/content/WikitextContentHandler.php(152): WikiTextStructure->getOpeningText() #5 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(380): WikitextContentHandler->getDataForSearchIndex(Object(WikiPage), Object(ParserOutput), Object(CirrusSearch\CirrusSearch)) #6 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(458): CirrusSearch\Updater::buildDocument(Object(CirrusSearch\CirrusSearch), Object(WikiPage), Object(CirrusSearch\Connection), 0, 0, 0) #7 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/includes/Updater.php(236): CirrusSearch\Updater->buildDocumentsForPages(Array, 0) #8 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php(219): CirrusSearch\Updater->updatePages(Array, 0) #9 /var/www/mediawiki-1.34.0/maintenance/doMaintenance.php(99): CirrusSearch\ForceSearchIndex->execute() #10 /var/www/mediawiki-1.34.0/extensions/CirrusSearch/maintenance/forceSearchIndex.php(689): require_once('/var/www/mediaw...') #11 {main} ``` This is with CirrusSearch-REL1_34-a86e0a5.tar.gz Subsequently added an "echo" to see what it was choking on, which looks to be this: ```lang=html <h2><span class="mw-headline" id="Links">Links</span><mw:editsection page="File::Spec" section="1">Links</mw:editsection></h2> <ul><li><a rel="nofollow" class="external free" href="http://perldoc.perl.org/File/Spec.html">http://perldoc.perl.org/File/Spec.html</a></li></ul> ``` On the older version of the wiki, searching for perldoc (in SphinxSearch in this case) brings up just this: {F31554958} Clicking on which shows this error: {F31554962} In other words this appears to be about something that can only be inherited from really old content, but which might be more gracefully skipped over when encountered. Adding the middle three lines here allowed indexing to run to completion: ```lang=php if ( $options['enableSectionEditLinks'] ) { if (preg_match("|::|",$text)) { $text=preg_replace("|::|","\:\:",$text); } $text = preg_replace_callback( ``` Most likely there's a better way to fix this though.
    • Task
    Would have prevented T244479 which made its way all the way to production without being caught
    • Task
    Similar to {T221135} but for WikibaseLexemeCirrusSearch
    • Task
    Example: https://en.wikipedia.org/w/index.php?search=User%3ARobin+Patterson&title=Special:Search&fulltext=Search&ns0=1 The current "Results from sister projects" is misleading (people may think the results are in main namespace). Compare the local result in the left side
    • Task
    In [[https://www.mediawiki.org/wiki/Help:CirrusSearch|CirrusSearch help]] there are all parameters described in detail, but there is missing any summary (list or table) listing all the parameters clearly one under another.
    • Task
    [[https://www.mediawiki.org/wiki/Help:CirrusSearch#Explicit_sort_orders|Explicit sort orders]] can only be accessed using new Advanced-Search or url. In old search options or in top search bar there is no chance to do this other than modifying url of the results.
    • Task
    Per @CDanis. Off the top of my head GlobalUserPage needs updating, definitely there are others. Rough codesearch: https://codesearch.wmflabs.org/operations/?q=https%3A%2F%2F&i=nope&files=php%24&repos=Wikimedia%20MediaWiki%20config
    • Task
    Certain characters[1] are lost when highlighted in titles and text snippets. To reproduce, search for [[ https://en.wiktionary.org/w/index.php?search=intitle%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special%3ASearch&profile=default&fulltext=1&searchengineselect=mediawiki | `intitle:/[𔐀-𔙆]/ anatolian` ]] on English Wiktionary. The three results are 𔐱𔕬𔗬𔑰𔖱, 𔖪𔖱𔖪, and 𔑮𔐓𔗵𔗬. However, they are displayed as 𔖱, 𔖪, and 𔗬; see screenshot: {F31505303} Looking at the underlying HTML, the title of the first result (𔖱) contains several empty `searchmatch` spans: `<span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch"></span><span class="searchmatch">𔖱</span>` I //think// this may have something to do with the characters being lost during tokenization (or being the kinds of characters that are lost during tokenization—maybe they are treated as punctuation?). If you search for 𔑮𔐓𔗵𔗬 (no quotes), the only hit is the exact title match. Searching for "𔑮𔐓𔗵𔗬" (with quotes) gives zero results. I verified that the English `text` analyzer returns no tokens for the string 𔑮𔐓𔗵𔗬. Another example: [[ https://en.wiktionary.org/w/index.php?search=insource%3A%2F%5B%F0%94%90%80-%F0%94%99%86%5D%2F+anatolian&title=Special:Search&profile=advanced&fulltext=1&searchengineselect=mediawiki&ns828=1 | `insource:/[𔐀-𔙆]/ anatolian` ]] restricted to the `Module` namespace gives a snippet with this: > canonicalName = "Anatolian Hieroglyphs", characters = "-", //characters = "-"// is //characters = "𔐀-𔙆"// in the original. The underlying HTML is `&quot;<span class="searchmatch"></span>-<span class="searchmatch"></span>&quot;`, again with empty `searchmatch` spans. __ __ __ [1] I first discovered this when looking into T237332, so the examples so far are Anatolian Hieroglyphs, though other characters may be affected.
    • Task
    1) go to the wikisource page of a certain language, for example English https://en.wikisource.org 2) search for the //exact// title of something in a different language. For example: "Les Enfants du capitaine Grant" https://en.wikisource.org/w/index.php?search=Les+Enfants+du+capitaine+Grant 3) See some results. You have to click the author's page, find the book in the list again and click on it. This is a poor example because the word "Grant" appears in the text of the page for both English and French, so the book in English is the second result but I think you get my point. It should just send me to the English translation (or whichever language Wikisource I'm on) directly since I typed the exact match of a book title that appears on a different language Wikisource. This could work by automatically creating redirects when articles are created or moved, or by changing the code of the search page. It should also be like on Wikipedia where it'll show you "We found the following results from French Wikisource"
    • Task
    [[https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Recherche?search=Je+suis+venir+te+dire+que+je+m%27en+vais&sourceid=Mozilla-search&ns0=1|Looking for “Je suis venir te dire que je m'en vais” on fr.wp]] finds “#Je_suis_venue_te_dire_que_je_m'en_vais” section” as second result but does not find following pages: * Je suis venu te dire que je m'en vais * Je suis venue te dire que je m'en vais… * Je suis venue te dire que je m'en vais - Sheila live à l'Olympia 89 which are the three top results when [[https://fr.wikipedia.org/w/index.php?sort=relevance&search=Je+suis+venu+te+dire+que+je+m%27en+vais&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1|searching the correct “Je suis venu te dire que je m'en vais” phrase]]. Note that Wdsearch gadget results already well include “Je suis venu te dire que je m'en vais” pages, but it’s probably T219108.