Page MenuHomePhabricator
Search Advanced Search
    • Task
    The [HtmlFormatter project](https://www.mediawiki.org/wiki/HtmlFormatter) is used a few (not that many) places: https://codesearch.wmflabs.org/deployed/?q=use%20HtmlFormatter%5C%5C&i=nope It is built on libxml and xpath with a bunch of hacks to avoid bugs, and a partial CSS-selector-to-xpath translator. We should rebase this on [Remex](https://www.mediawiki.org/wiki/RemexHtml) (to parse HTML) and [zest.php](https://github.com/cscott/zest.php) (to match selectors). This will allow us to reduce our dependence on libxml, increase code coverage and usage of Remex, improve corner case parsing of HTML and selectors, and generally put our eggs in fewer baskets. (It's possible we shouldn't use zest, but should instead just use a slightly better version of CSS-selector-to-xpath, which can be shared with Parsoid.)
    • Task
    **Problem** Performing a search like this: https://it.wikipedia.org/w/api.php?action=query&prop=info%7Cpageprops&generator=prefixsearch&gpssearch=1%20dicembre&gpslimit=10&ppprop=disambiguation Does not include the page with the exact title: https://it.wikipedia.org/w/api.php?action=query&prop=info%7Cpageprops&titles=1%20dicembre
    • Task
    As we plan to work more on autocomplete we should verify that all the data we need is available and collected properly.
    • Task
    == Steps to reproduce == * Open [[https://de.wikipedia.org/w/index.php?search=&profile=advanced | Special:Search@GermanWikipedia]] in //advanced// mode. * Enter the following search term: ##incategory:Vorlage:Fremdsprachenunterstützung insource:/invoke:Vorlage:lang\|full/## * Make sure no particular namespace is specified. * Query. * 20 of 223 results should be shown. * Note that by nature of //incategory// all results are from template namespace. * In //advanced //mode all namespaces are searched. * Now try to get next 20 from result set, or extend the number of hits to 50 100 250 500. * Clicking one of such links leads to no results any longer. * Affected: Show (previous 20 | next 20) (20 | 50 | 100 | 250 | 500) == Reason == Neither the ##&profile=advanced## nor any namespace information is provided by the offered URL for consecutive pages. * Therefore the follow-up is searching ns=0 only. * All results are expected in ns=10. == Remedy == The offered links next/previous/more must preserve relevant settings from current page URL, e.g. ##&profile=advanced## or any ##&ns=## which might influence the result set. == Task has been rewritten == On first attempt an escaping-encoding-decoding problem has been assumed. * It turned out that this was not the cause. * It just happened that suspicious characters occurred in search expression when behaviour was encountered the first time. Please ignore discussion before 16 December.
    • Task
    The LTR plugin currently scores one doc at a time, certain ranking algorithm may provide significant speed-up when scoring multiple docs in one pass.
    • Task
    This would at least be very useful for maintenance of typo fixes in file titles, maybe elsewhere, as well (in some kind the opposite of T171155 I guess). Generally, when files are renamed (technically moved to the new name) they are not deleted under the old name, but a redirect will be created. So, if there is a typo fix, the wrong name will still be found with an according search. But if one frequently is searching for new files with the same mistake the redirects of the fixed files are interfering. Examples from the wild: - //avaition// for //aviation//: https://commons.wikimedia.org/w/index.php?title=Special:Search&search=file%3A+intitle%3A+avaition - //illistration// for //illustration//: https://commons.wikimedia.org/w/index.php?title=Special:Search&search=file%3A+intitle%3A+illistration - The spelling of //Potsdam// is slightly unusual, it’s easily miswritten as //Postdam// (alas, there is a village with this name in Belgium): https://commons.wikimedia.org/w/index.php?title=Special:Search&search=file%3A+intitle%3Apostdam+-belgique On the other hand there may be cases when it could be useful to search only for redirects. //Edit:// The examples above are from Commons only, but this would be useful elsewhere, too: Searching for possible impacts of an issue in dewiki it was necessary to find lemmata with parentheses containing templates `commons` and `commonscat`, but the results show quite a lot redirects with parentheses (very probably created by page moves) making it harder to look for the issue impact: - https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=%3A+hastemplate%3A"commons"+intitle%3A%2F\(.%2B\)%2F - https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=%3A+hastemplate%3A"commonscat"+intitle%3A%2F\(.%2B\)%2F
    • Task
    **Steps to reproduce:** 1. Make patch in #cirrussearch repository. 2. Let Jenkins and Cindy-the-test-browser-bot run tests. 3. See Cindy-the-test-browser bot fail tests 4. See output, for example in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/455059/ **Expected outcome:** Provide output like CI/zuul/integration, so contributor can know where and what is wrong so can fix it. **Actual outcome:** One long line of function names (?) without any pointers to help. No names of file names, no line numbers.
    • Task
    **Details** Created as the one remaining part of {T183096}. This is blocked on CirrusSearch using Extension registration in the first place. {T87892} Wikibase use class_exists checks in tests as well as production code to check if a specific extension is enabled: ```lang=php, counterexample if ( !class_exists( 'CirrusSearch' ) ) { ``` This binds against a fully qualified class name in a super-problematic way that does not fail when the class is renamed or moved to another namespace. This has already caused many regressions in other (related and unrelated) codebases. All these conditionals must be replaced with a proper check : ```lang=php if ( !ExtensionRegistry::getInstance()->isLoaded( 'CirrusSearch' ) ) { ``` See https://gerrit.wikimedia.org/r/398511 for an example. > ExtensionRegistry can only be used, when the extension to check is using extension.json **Impact & priority** This change could avoid confusion and regression in the future when class names get change or namespaces moved. The task itself is easy to complete, so lets get it done as soon as possible. **Task & Acceptance Criteria** - All occurrences of class_exists( 'CirrusSearch' ) (or another CirrusSearch class in Wikibase have been converted to use Extension Registration checks.
    • Task
    For example use `Wikipedia pageproperty:disambiguation` to search all disambiguation pages containing "Wikipedia", or use `pageproperty:displaytitle` to search all articles with DISPLAYTITLE. This is also useful for Wikidata, e.g. together with haswbstatement you can find a list of humans (or any other subjects) without any sitelinks or without any identifiers.
    • Task
    Certain features of CirrusSearch in wikibase depend on the elasticsearch wikimedia-extra plugin. Disable features and create warning if the plugin is not installed
    • Task
    Something changed in the completion suggester for search. Earlier it gave more accurate results while typing. In Finnish Wikipedia (https://fi.wikipedia.org/wiki/): 1) I have a preference "Default (recommended): Corrects up to two typos. Resolves close redirects." 2) Now if I type "Tsekki" it doesn't suggest me the correct page "Tšekki", where "Tsekki" redirects. There is just one alphabet difference, so fewer than 2 "typos". Another what has changed: *Earlier when I typed whole word "Novak Djokovic", it gave me the correct form "Novak Đoković", where it redirects.
    • Task
    https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bsearch There's are example in the bottom of the page: api.php?action=query&list=search&srwhat=text&srsearch=meaning -- It lists "totalhits" for all entries and and "wordcount" for each entry. Is there a URI to list the hits for "meaning" for each entry like "wordcount"?
    • Task
    Clear target should be [[ https://commons.wikimedia.org/wiki/User:Gunnar.offel ]] E.q. https://commons.wikimedia.org/w/index.php?search=Gunnar.offel&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B%222%22%5D%7D&ns2=1 I do not have to say "`intitle:`" is pointless here. It seems an unusal (soft)redirect omitted the search, what should not happen. Another anyoing thing is the (autofill) search live preview, which searchs always in the standard namespace (which is not wanted). * parent task: {T73491}
    • Task
    **Summary** Categories with redirects don't appear in search results (particularly for searching using "intitle:"). **Description** As described on English Wikipedia: If, for instance, I search for intitle:Applied" with incategory:"Physics journals", I get the following. All is cool, that works. https://en.wikipedia.org/w/index.php?search=intitle%3A%22Applied%22+AND+incategory%3A%22Physics+journals%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns10=1&searchToken=b6t28msp2y0zy79ysb63vpc54 However, if I search for intitle:Appl." with incategory:"Redirects from ISO 4", I get this. Which is no results at all. https://en.wikipedia.org/w/index.php?search=intitle%3A%22Appl.%22+AND+incategory%3A%22Redirects+from+ISO+4%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns10=1&searchToken=4i52hwptx4fianpdm9cwg2wyb How do I make this work? Can it be done? https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#How_to_find_intitle_matches_for_redirects_in_a_category? **Notes** Some initial thoughts from search engineers on #wikimedia-discovery IRC: > > 11:13 AM <•ebernhardson> guy CKoerner_WMF: problem isn't clear, the two certainly work together. intitle matches redirects, and redirects are the "same" document as the one it redirects to so categories match too > 11:15 AM unless the user can present a counter case, my assumption would be search is working and there are no results > 11:17 AM CKoerner_WMF: so, i looked at there is exactly 1 page in the category linked. And that page does not have 'Appl.' in the title or redirects > 11:17 AM hmm, maybe they are elsewhere though... > 11:20 AM so https://en.wikipedia.org/wiki/Category:Redirects_from_ISO_4 has 14k pages, and incategory:Redirects_from_ISO_4 has 1 result. no clue why though > 11:20 AM <•dcausse> I think they expect the redirect page to be in the category > 11:21 AM <•ebernhardson> guy dcausse: it is, from the perspective of cirrus, isn't it? > 11:21 AM dcausse: the redirects are just the document they point at > 11:21 AM oh, wait are the redirects *themselves* tagged? hmm > 11:21 AM <•dcausse> but the redirects can have wikitext > 11:21 AM <•ebernhardson> guy hmm, if that's the case then yea > 11:21 AM <•dcausse> but here it's not really a category, it's {{R from ISO 4}} > 11:22 AM https://en.wikipedia.org/w/index.php?title=Abstr._Appl._Anal.&redirect=no does not mention any category > 11:23 AM <•ebernhardson> guy dcausse: you're right, i looked through and afaik the categories are all on the redirect page, not the page it links to, so we don't index it > 11:23 AM not really sure how that would fit in the model. hmm. > 11:24 AM <•dcausse> https://en.wikipedia.org/wiki/Wikipedia:Categorizing_redirects#How_to_categorize_a_redirect > 11:25 AM seems complex :/ > 11:27 AM the category is marked with a template but this template is lost when populating the main doc with the redirect data :( > 11:36 AM <•CKoerner_WMF> Chris Koerner Should I file a bug? > 11:36 AM <•dcausse> we should index redirect content at some point but that seems tricky to do right. perhaps with a redirect_doc : [ {id:XYZ, text:, vesrsion:.}, {...} ] and a special noop handlers >
    • Task
    Idea from user, to add + for boosting a specific word in the search query. This is a feature provided by [[ https://duck.co/help/results/syntax | DuckDuckGo ]] already.
    • Task
    Right now, wikidata completion search box only searches entity namespaces. We may want to be able to search titles in article namespaces too, e.g. when searching for `Wikidata:Project chat` we may want to display both entity with label "Wikidata:Project chat" and the [[ https://www.wikidata.org/wiki/Wikidata:Project_chat | actual page ]] having this title. Current proposal for implementing this is: 1. Refactor GUI code so that instead of Wikidata fully replacing completion search box, the standard box allows augmenting the result set data and display 2. Make the GUI code for standard implementation call standard title search, and in addition Wikidata part of the code calls wbsearchentities, as now 3. Display both sets of results (likely wbsearchentities first, title search second) See more discussion of it in https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search#GUI
    • Task
    Currently the keywords applies its logic thanks to a broad doApply method. It currently populates a bunch of states in the SearchContext. We should introduce different interfaces that fit keywords needs. So far we identified these needs: - HighlightingKeyword: pass the HighlightBuilder so that keywords can change the highlighting behavior - RescoringKeyword: pass a RescoreBuilder to affect the rescore window components - FilteringKeyword: so that they can inject their filters - ScoringKeyword (better name needed): so that they can affect the main query scores (e.g. intitle) - InitialSearchContextKeyword: so that they can affect the InitialSearchContext (setting the namespaces or limiting search to local results) - LegacyKeyword: will have the old doApply method so that we can still support them while we migrate to dedicated interfaces Keywords will be able to implement multiple interface: - prefix will implement InitialSearchContextKeyword and FilteringKeyword - intitle will implement FilteringKeyword and ScoringKeyword - subpageof/insource/intitle will implement FilteringKeyword and HighlightingKeyword
    • Task
    Search engines such as Cirrus should examine the content of all slots when updating the search index.
    • Task
    Hi all! Whilst image is present in https://en.wikivoyage.org/wiki/Special:ListFiles (for examp.: `ValenciaBanner2.jpg` ) it could not be found in media search results by partial name: {F14028962} {F14028964} {F14028975} {F14028968} Some debug info on third-party install can be found here: https://phabricator.wikimedia.org/T186997
    • Task
    **Problems** * As a user, I'd like to discover more files or content made / created by a specific user. * As a user, I'd like to find specific content without paging through Special:listfiles. **Background** Currently there is no way to distinguish between search results uploaded or created by a specific user. Paging through special:listfiles is not an activity any sane person would do for users with massive uploads, e.g. Special:ListFiles&dir=prev&user=Ruthven. Attempting to view massive new pages by a specific user (special:contribs) will also result in a timeout on a big enough wiki, especially if the namespace parameter is used). Also, for regular pages, this provides a sensible and easy interface to see and count all (existing) pages created by a user as this would naturally include the matches. Other Use cases: * Looking into discussions (Talk pages) participated * Looking into pages they created with a specific keyword * Readers looking into interesting pages or media initially created by a specific contributors * Anti vandalism - looking into pages created / edited by a specific user and containing a specific term. **Proposed solution** * Add a new search keyword "author:", e.g. "author:User1"; AND * Add a new search keyword "contributor:" to list all pages a edited by particular user; * Possibly make it possible to include more than one author, e.g. "author:User1|user2|..." or alternatively "author:User1 author:User2" Note: A file page may be created before a file upload (by another user). So there may be a need to distinguish between an uploader and a file page creator. Task filed on behalf of ia a friendly IP at: https://www.mediawiki.org/wiki/Topic:U7lof232wnk4pgbn
    • Task
    In VE, the autocomplete (search) link selection tolerates some scope of typos. That toleration works for English and other more or less [[https://en.wikipedia.org/wiki/Morphological_typology#Analytic_languages|analytic languages]]. However, the [[https://en.wikipedia.org/wiki/Morphological_typology#Synthetic_languages|synthetic languages]] have endings that aren't ignored by the autocomplete. That causes users to fix little regular things again and again, whenever they want to add a link. Please have a look at an example, [[https://en.wikipedia.org/wiki/File:D%C3%A9clinaisons_noms_pluriel_polonais.jpg|Polish plural endings]]. In Polish, when I add a link to an article entitled "Sąd Najwyższ**y** Stanów Zjednoczonych" (US Supreme Court) in a sentence where the genitive case is correct ("Sąd**u** Najwyższ**ego** Stanów Zjednoczonych"), I don't have to change the first word (-u is like a typo) but I have to do it with the second one (y -> ego). It gets messy when more words have endings. And I didn't mention [[https://en.wikipedia.org/wiki/Agglutinative_language|Finnish and alike]] (e.g. very simple changes: Helsinki -> Helsingissä or talo -> taloissani). Each language has its specific endings. It's impossible to list all of those for all the languages in one place. Instead, you could allow wikis to set "tolerated typos" //on their local pages//. Just like it has been done with Citoid.
    • Task
    Currently morelike queries (used by mobile RelatedArticles) is using the CirrusSearch syntax. This type of queries is one of the most heavily used keyword in cirrus. Having morelike as part of the search syntax is interesting but should cover only the few usecases of searchers using the input box. I think we should provide a dedicated API endpoint that does not rely on the CirrusSearch syntax so that heavy consumers have a stable API that is not dependent on the search syntax. Work has been done to expose this API through RESTBase (T125983) but this is still using the syntax and is only used by the the mobile apps, mobile web directly call the action API with list=search (IIRC). The benefits for us would be: - promote morelike to a first class API citizen and have more control over it - make the work on the search syntax parser easier (morelike is one the few greedy keyword we'd like to refactor), with a dedicated API endpoint we would not have to care about breaking an important API consumer (155M api requests/day) - make future work on switching the implementation behind easier.
    • Task
    Currently we run our integration tests from a machine in clouds (cindy). A script runs on every unmerged commit marked as V+2. We have done this way because there was no way to setup the proper env in jenkins to run the integration tests. We could investigate the feasability to setup a docker image with the appropriate env to run them : - MW with CirrusSearch and its dependencies - elasticsearch - elasticsearch plugins
    • Task
    Prior to plug a new query parser we should refactor how the search query is parsed and transformed into an elasticsearch query. Currently there are no strong separation between the parsing logic and the query building logic. This should ease future development on integrating a new parser. Note that MediaSearch (through the work done in T252692) is starting to use some of the components built here. There might be consolidation to be made between MediaSearch (WikibaseMediaInfo extension) and CirrusSearch while resuming work on this.
    • Task
    Hi, I opened a discussion thread, and I was advised to ask on Phabricator : https://www.mediawiki.org/wiki/Topic:U21h6wzaatak6xzl I use CirrusSearch, and Translate. My wiki is in French, translated in English. I need to have two things : - In search suggestion, I want to have translated page name (eg. Pagename/Page display title/en) according to my chosen language, if it exists. - In search results, I want to have the Page display title instead of Pagename/en as a title. My goal is to provide my non-French speakers a means to know the title of the page, in their language. After that, I would set this parameter to favor user language pages. $wgCirrusSearchLanguageWeight = [ 'user' => 10.0, // should favor pages in user language 'wiki' => 1.2, // boost pages in wiki language ]; Thanks a lot for help !
    • Task
    See T176197#3737728 and following comments. After discussions on Wikipedia and Wiktionary, the general consensus is that we should enable the hiragana to katakana (H2K) mapping for French, Russian, Italian, and Swedish because it has a small effect but seems useful.
    • Task
    When index creation or reindex fails or is interrupted, there might be stale aliases left sitting around, which block further reindexing and in general waste space. We should have a maintenance script that allows to clean them up. The procedure for removing the indexes is already documented at https://wikitech.wikimedia.org/wiki/Search#Removing_Duplicate_Indices but the script should: * Handle both clusters via `--cluster` option * Allow single-wiki cleanup via `--wiki` * Allow also all-wikis cleanup via suitable option
    • Task
    **Summary:** Autocomplete in the search box does nothing if you type a character that uses a [[ https://en.wikipedia.org/wiki/Dead_key | dead key ]] (a key that put an accent or other diacritic on the next letter you type), the Javascript "keypress" event listener doesn't get the message that anything has happened. This is true for Latin and Greek characters with dead keys. Using "onkeyup" may solve the problem. It isn't clear what other UI elements are affected beside the search box, if any. I'm not sure what projects this applies to, since it is a UI problem on an element used for search. Earlier discussion from T75605: @Xoristzatziki in T75605#3417714: > A bug, which may be related, still exists for Greek terms (in ALL projects, even in en.wiktionary) . Typing in the search box anything that ends in accented letter does not provide any suggestions that include the last letter (even if they exist ex. καλά). Copying and pasting works. Also typing anything after (ex. a space) works. It seems (to user) like the search is not done by the really typed letters, and the code is "waiting" for something. @TJones in T75605#3421129: > ... the scope is much bigger than accented Greek characters. ... I'm pretty sure this is a Javascript issue. > > For my quick test, I'm on a Mac, using the American, French, and Greek keyboards, and I tested in Chrome, Safari, and Firefox. To my surprise, they all behave the same. > > If you use [[ https://en.wikipedia.org/wiki/Dead_key | dead keys ]] (keys that put an accent or other diacritic on the next letter you type), the Javascript "keypress" event listener doesn't get the message that anything has happened. I tested this with both Latin and Greek letters on the Mac keyboards. > > As I understand it, the Greek keyboard uses a dead key to add ´ and ¨ to vowels. Similarly in the Mac American keyboard has dead keys for several diacritics (I use ´ ¨ ˆ ` ˜ regularly). If I type //resumé// to search on English Wiktionary, I also don't get any more suggestions for the final é. (BTW, it happens for non-final letters, too, if you pause, but it's easy to miss if you keep typing). > > On the French keyboard, é has its own key, and when I type //resumé// using that keyboard, it behaves as expected. > > On the //Mac// Greek keyboard (so this probably does not apply to Windows or Linux), I can type ά by typing option+shift+α. If I type //καλά// this way it gets suggestions as expected. Similarly, you can use option+shift+<x> to type other accented vowels: 1/έ 2/ί 3/ή 4/ό 0/ύ ./ώ —I didn't see any precomposed versions with diaeresis (i.e, ϊ or ϋ ). These non-dead-key versions generate new suggestions. > > So, the problem isn't accented characters per se, but rather characters that have to be typed with dead keys, at least on a Mac keyboard. I'm not familiar with the UI code that's handling all this, so I have no idea how easy it would be to fix, but searching online shows a lot of people complaining about this, but no obvious solutions. @Xoristzatziki in T75605#3651879: > The problem is in the code for sure. The accent in dead keys in Greek keyboard are typed first so the last key pressed is a non dead key. onkeyup works in my tests. (See link for attachments.)
    • Task
    I noticed in the saneitizer graph that right as we took down the eqiad cluster the 'fixed' rate jumped from 0 or 1 up to 1800 for 1 minute. This implies the error handling in there somewhere is wrong and is treating an error during shutdown as something else. A failure of this type should only be able to trigger extra update jobs, so doesn't really hurt much, but still perhaps worth double checking.
    • Task
    Sometimes related articles show inappropriate article recommendations (see e.g. T165223). It is possible to override these using //{{#related:Preferred title}}//. However, that is very blunt tool when all you want to do is to remove an article. If I (to take an example from Swedish Wikipeda) want to remove //Babysitter// from articles related to //AK-101//, this doesn't mean I necessarily want to manually define what readers should see, I just want to solve the problem where a clearly unsuitable article is shown. {{unrelated#Article}} could solve this.
    • Task
    These don't currently work, and might be nice to be able to look at how they actually work in practice. It would also mean we could add some of the generic parsing tests for what actually gets generated by these.
    • Task
    A search for `intitle:/Wikipediaa?a?/` on Commons and enwiki returns nothing, even though this should match the phrase "Wikipedia" in titles. This is probably a bug with CirrusSearch rather than Apache Lucene, since otherwise Lucene regular expressions function properly. Possibly this is due to the search ignoring punctuation, which should be turned off for regexes automatically.
    • Task
    When searching for a page title with a colon in it, it is a very common to typo a semicolon instead. When the target is a page in the main namespace, the search engine usually finds the intended page (e.g. searching for "Star Trek; Voyager" (with or without quotes) the search engine finds Star Trek: Voyager as the first result). However when the intended page is a shortcut (e.g. MOS:HYPHEN) or page in another namespace (e.g Template:Unreferenced) and you use a semicolon instead of a colon then the search results don't feature this. In all cases, the drop-down search suggestions do include the non-main namespace pages, but these are not available in all environments and not used in every place they are available. The ideal solution I think would be to place a link as a "Did you mean" entry. For example if you search for Template;Unreferenced the search engine results would be prefaced by: Did you mean: Template:Unreferenced I'm undecided whether this should apply only to exact matches - should searching for Template;Unreference (which doesn't exist) do as it currently does, or ask "Did you mean: Tempalte:Unreferenced" as is currently the case if you search for Template:Unreferece In all cases if a page name exists with the semi-colon this logic should not interfere with taking you to that page (e.g searching for Help; a Day in the Life should take you to that redirect, not search for a page in the help namespace)
    • Task
    Tracking task to collect various ideas related to the completion suggester quality.
    • Task
    The [[ https://en.wikipedia.org/wiki/Help:Category#Searching_for_articles_in_categories | Help:Category ]] article says: > An "OR" can be added to join the contents of one category with the contents of another. For example, enter > > incategory:"Suspension bridges" OR incategory:"Bridges in New York City" > >to return all pages that belong to either (or both) of the categories, as [[ https://en.wikipedia.org/w/index.php?title=Special:Search&search=incategory%3A%22Suspension+bridges%22+OR+incategory%3A%22Bridges+in+New+York+City%22&ns0=1&fulltext=Search | here ]]. But this search returns no results even though each category by itself returns many results. I've tried this with other categories and it appears that it does an AND (intersection) regardless of the actual operator used.
    • Task
    **Expected behavior:** I search for a string, and get shown message "searchmenu-new", offering me to create an article by that name. **Actual behavior:** the message doesn't appear if the search string includes a quote mark. **Example:** I try to create a new article named Don "King" Kong, and I do so by searching for that string: https://en.wikipedia.org/w/index.php?search=Don+"King"+Kong ...I do not get the option to create an article of that name. (Please note that a single quote anywhere in the string is enough to trigger this behavior) This behavior seems to be triggered by flag $searchContainedSyntax, which I got lost tracking down in the source code. As far as I can tell, this is due to CirrusSearch; running a mediawiki-vagrant machine with vanilla search everything works as expected. I assume this is because vanilla search has no special search operators. I'll totally understand if this is a WONTFIX...
    • Task
    I noticed while poking around today that many interesting categories for an article are not always applied to the article page, but to the talk page. For example: https://en.wikipedia.org/wiki/Talk:Agaricus_deserticola This has the categories: * Wikipedia featured articles * Featured articles that have appeared on the main page * Featured articles that have appeared on the main page once * FA-Class Fungi articles * Mid-importance Fungi articles * WikiProject Fungi articles If we were doing one-hot encoding of "important" categories, it seems it would be usefull to have things like featured articles as on of those values.
    • Task
    A friendly neighborhood IP (I mean that sincerely) left a [[ https://www.mediawiki.org/wiki/Topic:Tnqi9y8vw5vd5ijc | suggestion ]] on the CirrusSearch extension talk page regarding the ability to search for specific external links in queries. > **Problem** > > As a reader, I want to find articles that mention contain a specific link (e.g. a new story, a hoax or an untrustworthy site) to verify its validity. > > As a editor, I want to find articles that mention a specific link and some keyword to eliminate spam or certain vandalism or hoaxes. > > **Background** > > Currently, cirrussearch allows searching for internal links, yet it doesn't make it possible to do this for external links. This means that one has to use a page such as Special:LinkSearch or complicated regex with "insource" that may not always find the link because they can be constructed by templates in hard to find ways, e.g. "{{{mainsite}}}.com/{{stringsub}}". > > **Proposed solution** > > A new search "keyword" or predicate that indexes external links, e.g.: > > > ``` > banana cures aids extlinksto:/*.hoaxysite.com/ > -extlinksto:/*.hoaxysite.com/ Before I created this task I spoke to one of the Discovery engineers about this suggestion. Their thoughts: "overall i don't think it's crazy hard and most of the work will just be figuring out what the right analysis chain is for it and perhaps creating a second field for external_link_domains that ignores the rest for searching"
    • Task
    This maint script can be extremely slow to run. After T147957 we decided to run the query on vslow. It was maybe not necessary to use vslow for every cases. @EBernhardson suggested to use vslow only when we have to filter on namespace.
    • Task
    It used to be, I could search something like '[[ https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&search=do+not+bite&fulltext=1&searchToken=8fbhok03n8k566h8hgwnz8dqr | do not bite ]]', then [[ https://en.wikipedia.org/w/index.php?title=Special:Search&profile=all&search=do+not+bite&fulltext=1&searchToken=em050jsoh8q4sb0ocph3bwdl4 | hit the "Everything" tab ]], and find "Wikipedia:Please do not bite the newcomers". Over the last few months however, such Wikipedia namespace results don't even reach the first page of search results. It seems that content is boosted so aggressively, that I can no longer find 'project' content, unless I go to "Advanced" and [[ https://en.wikipedia.org/w/index.php?search=do+not+bite&title=Special:Search&profile=advanced&fulltext=1&ns2=1&ns3=1&ns4=1&ns5=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns108=1&ns109=1&ns118=1&ns119=1&ns446=1&ns447=1&ns710=1&ns711=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1&searchToken=35o6l26o7cgtw0wzhtkgqtcj5 | specifically exclude content namespaces ]]. The alternative is to have Wikipedia: or WP: in the search field, but it's just less than ideal. We desperately need a way to easily search 'Editor' content efficiently. That can be a change to the algorithm's, or a change the UI to exclude content namespace from one of the presets. But this focus on results for 'readers' has become a nuisance for my personal search experience atm.
    • Task
    Search (on Commons): [[ https://commons.wikimedia.org/w/index.php?search=Category%3ASan+Ignacio+Belize&title=Special:Search&go=Go&uselang=en&searchToken=f446f1e4i88az6newrs3qsspa | Category:San Ignacio Belize]] Actual Results: * Category:Hawkesworth Bridge * Category:Maya Flats Airstrip * Category:Cahal Pech * Category:San Ignacio, Belize * Category:Aerial photographs of San Ignacio Expected Results: * Category:San Ignacio, Belize * Other stuff
    • Task
    Proposed in #Community-Wishlist-Survey-2016. Received 31 support votes, and ranked #51 out of 265 proposals. [[ https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Search#Built-in_CatScan| View full proposal with discussion and votes here.]] === Problem === Sometimes one wants more advanced search that searches specific content in specific categories, articles with specific templates, etc. Sometimes this is useful for editors, sometimes for readers, and sometimes for developers (through the API). === Who would benefit === Everyone (readers, editors, developers) who need better search capabilities === Proposed solution === Integrate [[ https://tools.wmflabs.org/catscan3/catscan2.php | CatScan ]] (or write a similar tool) into MediaWiki itself === Technical details === -- === Time, expertise and skills required === -- e.g. 2-3 weeks, advanced contributor, javascript, css, etc === Suitable for === -- e.g. Hackathon, GSOC, Outreachy, etc. === Proposer === [[ https://meta.wikimedia.org/wiki/User:Ynhockey| Ynhockey ]] === Related links === -
    • Task
    A common approach on wiki's, especially outside the content namespace, is for long running communication pages to have many archive pages under names such as {name}/Archive42. These pages all have very statistically similar content, but the "main" page should generally be ranked higher than all the archives. Look into ranking sub pages lower than their parent page.
    • Task
    It seems to be the case that `prefix:` operators in lsearchd used to support multiple parameters via pipes (so `prefix:foo|bar`) which would treat them as an OR operation to look in multiple possible prefixes. I can't think of any sort of reason we can't do something similar. The use case is for search templates that want to look in a couple of places at once.
    • Task
    Based on a [[ https://www.mediawiki.org/wiki/Topic:Tatx4q0bzjzal301 | suggestion ]], we'd like to investigate if there is a way we can search a pre-set collection of projects and/or namespaces with a query string. Something that would be intuitive to users (logged in or not) and easy to use. This might be able to be done using a new button on the Special:Search page or some other method. For instance, a logged in user might want to search for 'Kennedy' using the following project and namespaces to find documentation or maybe an older discussion or presentation: - English Wikipedia - Metawiki - Mediawikiwiki - Outreachwiki - WikimaniaXXXX For additional special search bonus points - if we have the capability to do this type of search using a pre-set list of languages as well? Maybe by re-using the Compact Language links or recently used languages, but it might be preferable to use a editable configuration of these languages that the user can update/change at any time.
    • Task
    > Background: The search results include, words, size, and upload date. License for the content pages isn't really needed because most pages have the same license, however, media licenses are important for re-users. [[ https://www.mediawiki.org/wiki/Topic:Tkc2sravqykrlz6d | Issue ]] raised by a user (and paraphrased): - As a reader, I can't easily find out what license a file has when viewing search results - As a media re-user, licenses are not displayed in search results when querying for reusable files - As an editor, it's difficult to find files without a license for updating, deleting or to add in appropriate licenses **Proposed solution:** In each search result - add the license (maybe as an icon) fetched from the commons metadata api. > Note: this is not the same as the previous metadata related suggestion. This suggestion merely proposes the addition of a license label next to each file result.
    • Task
    Under MW 1.28 with matched 1.28 CirrusSearch&Elastica extension (REL1_28-0959e38) ``` php /var/www/mediawiki/maintenance/runJobs.php -q PHP Notice: unserialize(): Error at offset 18862 of 65535 bytes in /var/www/mediawiki/includes/jobqueue/JobQueueDB.php on line 802 [dde4ea3abab8bf5fa6ab94ff] [no req] Error from line 40 of /var/www/mediawiki/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Unsupported operand types Backtrace: #0 /var/www/mediawiki/includes/jobqueue/Job.php(74): CirrusSearch\Job\ElasticaWrite->__construct(Title, boolean) #1 /var/www/mediawiki/includes/jobqueue/JobQueueDB.php(292): Job::factory(string, Title, boolean, string) #2 /var/www/mediawiki/includes/jobqueue/JobQueue.php(372): JobQueueDB->doPop() #3 /var/www/mediawiki/includes/jobqueue/JobQueueGroup.php(240): JobQueue->pop() #4 /var/www/mediawiki/includes/jobqueue/JobRunner.php(157): JobQueueGroup->pop(integer, integer, array) #5 /var/www/mediawiki/maintenance/runJobs.php(86): JobRunner->run(array) #6 /var/www/mediawiki/maintenance/doMaintenance.php(111): RunJobs->execute() #7 /var/www/mediawiki/maintenance/runJobs.php(119): require_once(string) #8 {main} ```
    • Task
    [[ https://www.mediawiki.org/wiki/Topic:Tka4by2qazb777rs | A recent topic appeared ]] on Extension talk:CirrusSearch. The suggestion is to expose file metadata already available via the API to the search engine. Suggested fields: * Video or audio of a certain "playtime" * Framecount * Looped images * Duration * Frame rate, * creation date The example use case given: "The usecases are numerous, for instance for writing an article about world war one, one may want to filter images from that period. When looking for videos to add to a page one may want short animations to showcase the concept, e.g. a moving hurricane , and not be interested in very long videos. The same applies to animated images because in some cases they illustrate the concept better than others, and in some cases they don't, so it might be good to filter those either way." Also mentioned were fields from the Commons Metadata API * GPSLatitude - latitude * GPSLongitude - longitude * LicenseShortName - short human-readable license name * LicenseUrl * DateTimeOriginal Related: {T150809}
    • Task
    A search box for books and sources on wikisource to enable readers from jumping to the parts they want using key words or phrases.
    • Task
    We often have questions about how data breaks down in cirrussearch, for purposes of evaluating changes or making estimates for why something is the way it is. Druid should be great for these kinds of break downs, but we need to decide what to load. This ticket is to collect various questions we want to answer: * p95/p99 per-wiki for various query types (completion suggester, full text, regex, etc). A dimension on query length might also be useful.
    • Task
    We've been working on plans to deal with "so many search options", i.e., the large number of current and potential second-try searches: "did you mean" suggestions and rewrites, interwiki search, language ID + cross-wiki search, quote stripping, wrong keyboard detection, etc. I've got a wiki page going with background and current best plan: [[ https://www.mediawiki.org/wiki/Wikimedia_Discovery/So_Many_Search_Options | So Many Search Options ]]. I've been updating it after meetings and working through details. I've been trying not to let it take up too much time, but after a couple of half to full days spent on it, I'm creating the phab ticket I should have created a while back. While full implementation doesn't need to happen now (just before we deploy any more second-try searches), getting the design done is important, especially as it [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/So_Many_Search_Options#API_Proposal | relates to the API ]]. We want to get that hammered out and share with the Mobile team as soon as possible so that we don't end up with a widely-used API we want to restructure. Also called: second try search, second try searching, second chance search, second chance searching. (Added for easier finding later.)
    • Task
    Steps to reproduce: # Open any template (tested on cswiki) # Open "What links here" page # Filter results to transclusions only # Open search page and search `hastemplate:xyz` (where xyz is your chosen template) in all namespaces # Open search page and search `insource:/\{\{xyz/i` (where xyz is your chosen template) in all namespaces Expected results: All three listings should be the same (only insource could contain commented or nowiki's occurrences) Current results: Many of templates get a much smaller list for hastemplate than for other two possibilities. It behaves badly even for exclusions (by `-hastemplate:xyz`). This is ok for usual search, but e.g. if you exclude them when including templates using a pywikibot (by `-search:"-hastemplate:xyz"`) it is not good, see [[ https://cs.wikipedia.org/w/index.php?title=Bill_Bailey&diff=prev&oldid=14565975 | this edit ]] for instance
    • Task
    Feature request, for the search completion suggestor to automatically search cross-wiki, if given an interwiki prefix to start. E.g. At English Wikipedia, typing "wikt:foo" into the search bar, would trigger search suggestions that come from English Wiktionary. (Task based on proposal in https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Archive#Cross-wiki_search_suggestions that is unfeasible in the short-term. Full copy:) * Problem: It is possible to navigate, for example, from English wikipedia to the article about Argentina in Spanish wikipedia by entering "es:Argentina" in the search box, or to the English wiktionary entry about whatever by entering "wikt:whatever". You won't, however, be given search suggestions like when you are about to navigate to another page in the same wiki. * Who would benefit: Users who are interested in several wikis. * Proposed solution: Our search engine should learn that, if the text entered into the search box begins with interwiki prefixes as described here, search suggestions should be made from that very wiki associated with the given interwiki prefix. ---- *See also: {T139504}
    • Task
    Users might search** 8** or **eight **for title, I want these two word could be synonym of each other in the search result. Similar problem exist when Japanese user search for 8. They might type **8** or **八つ**. I tried to enable synonym_filter in config/elasticsearch.yml Then create a synonym.txt with content: 8, eight By search though commond line "POST http://localhost:9200/moegirl_content/_search", I got both **8** and **eight** in return result. However, on website the MW search suggestion and the MW search result doesn't match both. Hope Wikipedia can add these synonym in search result and search suggestion too. Some article have very similar variants in title. Users have to type exactly the word to find that article. Add some necessary synonym can benefit may people.
    • Task
    Hi. Try in your draft [[Special:Search/New York filesize:<100]], the link is not created. [[Special:Search/New York filesize:100]] is.
    • Task
    File type search has been warmly welcomed by the communities. Perhaps there's more metadata we could expose to CirrusSearch to improve multimedia search. EXIF data has additional potential fields. Most obvious is camera model/type, focal length, aperture, ISO, and orientation. http://www.sno.phy.queensu.ca/~phil/exiftool/TagNames/EXIF.html
    • Task
    In some of our mocks for the new interwiki search, we have also included at the bottom some multimedia results. These vary a bit from the standard interwiki search we are implementing, namely: * The query has to be re-written from a content search to a search against the file namespace * Rather than showing a single result with highlighting, the mocks show a few images (3, 6?) with no titles or highlights If we want to go this route we will need to implement the support for this in CirrusSearch's interwiki handling.
    • Task
    Background info: [[ https://en.wikipedia.org/wiki/Furigana | Furigana ]]; [[ https://en.wikipedia.org/wiki/Ruby_character#HTML_markup | Ruby character#HTML markup ]]. Words annotated with HTML ruby are not searchable in a user-friendly way when HTML formatting is stripped, as the reading becomes inline by default. For example, annotated | To | kyo | 東 | 京 becomes inline `東(To)京(kyo)`, so neither searching for `東京` nor `Tokyo` would work. ( [[ https://ja.wiktionary.org/w/index.php?search=如(ごと)し | example of searching for HTML-formatted ruby as proof that this is a problem ]], [[ https://en.wiktionary.org/w/index.php?search=%E5%A6%82(%E3%81%94%E3%81%A8)%E3%81%97 | another example ]] )
    • Task
    This needs much more UX thought, but it would be useful to have check boxes or equivalent for people to say why they are skipping a query. Some obvious possible reasons include: - The query doesn't make any sense - The query is too vague or ambiguous - I can't determine the original user's intent - Rating this query would be too much work - The query is non-encyclopedic Maybe an "other" field with a text box, maybe not. And the reason text could use some wordsmithing. Making these checkboxes (we'd also discussed radio buttons) might make more sense because then the reasons can overlap some, and more than one reason may apply.
    • Task
    Based on the research documented in this [[ https://phabricator.wikimedia.org/T136377#2739250 | comment ]], T149143 and the research documented [[ https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Quotes_and_Questions#Quotes_Corpus | here ]] that was done on Relevance Forge, we'd like to go forward with replacing double quotes with a space in search queries. This replacement will help with queries like ``` albert"einstein" house ``` or ``` "albert einstein"house ``` which are currently treated as three separate words. Using spaces instead of stripping out the double quotes, will keep these types of queries as three words. We don't feel that there is any downside with having extra spaces in the query. We'll need to check the edge cases for languages that use spaces and don't use spaces in their words. Likely prerequisite: {T156019}
    • Task
    Similar to @mpopov recent work comparing our zero result rate to other search engines, we should look over the discernatron data to see if there are particular query features that we should be taking into account to return better results. Some ideas: * Are the high ranked results in our top 20, 100, or 1000? Are we just giving them too low of a score? * If the high ranked results are not in the top 100, or 1000, why not? Synonyms? Spell correction? Relaxing query constraints? * Probably others, these are just a couple off the top of my head.
    • Task
    The CirrusSearch documentation on mw.org (https://www.mediawiki.org/wiki/Extension:CirrusSearch) is out of date. Cirrus is now tightly coupled with core and version dependencies are not obvious for third parties. We should maybe start to version CirrusSearch properly or maybe follow core versions, e.g. CirrusSearch 1.28 requires core 1.28. Concerning features, many new features/configs were added to cirrus but they are not mentioned in the documentation.
    • Task
    the following fields are added by CirrusSearch, for all content models: * namespace * namespace_text * redirect * source_text * suggest * timestamp * title * text * text_bytes handling for some of these in CirrusSearch is quite complex, but still think these should be exposed in some (simple) way in core.
    • Task
    I suggest closing T134430 as too much effort and instead pursuing an easier approach. Based on my analysis of several wikis, and Erik's suggestion that we find one set of languages to use for language detection for the long tail, I suggest that for the next M wikis (beyond the "top N" in T121541)—down to some limit based on size and/or query volume—we use a standard set of language models, plus a Wiki-Text model (T121545) based on that wiki. The Wiki-Text models are not as accurate on query strings as query-based models, but they are easy to generate mostly automatically. We could deploy a standard list of languages based in part on likelihood of being encountered (English seems to be everywhere) and uniqueness of character set (Greek is generally Greek). My current suggested list would be Arabic, Armenian, Chinese, English, Greek, Hebrew, Japanese, Korean, Russian, and Thai, plus a wiki-text model for the language of the wiki. I'd suggest a staged roll out to see what kind of feedback we get. If we get reports of mis-identifying languages, we could add or remove models as necessary. If we don't get any feedback, then either the results are acceptable, or no one is using it. Additional features required/desired: - figure out a way to mix query-based and wiki-text-based models (simple solution: copy wiki-text models to query-based model directory and note which is which in the docs; more complex solution: allow TextCat to take more complex specifications across model directories) [required] - generic feedback mechanism to allow users to easily rate language detection / results and flag instances where things go wrong. (need to think about UI and logging—translate to all the required languages, or go for generic icons (e.g., smiley face, neutral face, frownie face); and can we log queries that get poor marks from users to investigate later?) [highly desired]
    • Task
    We want to spread the usefulness of Language Identification (via TextCat) to non-Wikipedia wikis. Rather than do a time-consuming manual analysis for each wiki project, we could do an A/B test on some/all projects in the same language using the default configs for the Wikipedia project in that language (for which analysis is done). Such A/B tests would give us insight into whether the TextCat configs can be straightforwardly shared across projects in the same language. If so, it would help us be able to apply language detection to more of the long tail of wiki projects.
    • Task
    If I insert `insource:/gender/i` in top-right search bar on any translatewiki page (except Main Page) e. g. **[[ https://translatewiki.net/wiki/Project:News ]]** and hit ENTER, i get this error message: ``` [3b8b3ab4f3f47544ea7858c9] /w/i.php?title=Special%3ASearch&search=insource%3A%2Fgender%2Fi&go=Vai RuntimeException from line 402 of /srv/mediawiki/tags/2016-07-09_13:33:21/extensions/CirrusSearch/includes/Search/ResultsType.php: regex is only supported with $wgCirrusSearchUseExperimentalHighlighter = true Backtrace: #0 /srv/mediawiki/tags/2016-07-09_13:33:21/extensions/CirrusSearch/includes/Searcher.php(1160): CirrusSearch\Search\FullTextResultsType->getHighlightingConfiguration(array) #1 /srv/mediawiki/tags/2016-07-09_13:33:21/extensions/CirrusSearch/includes/Searcher.php(795): CirrusSearch\Searcher->search(string, string) #2 /srv/mediawiki/tags/2016-07-09_13:33:21/extensions/CirrusSearch/includes/CirrusSearch.php(403): CirrusSearch\Searcher->searchText(string, boolean) #3 /srv/mediawiki/tags/2016-07-09_13:33:21/extensions/CirrusSearch/includes/CirrusSearch.php(147): CirrusSearch->searchTextReal(string, NULL) #4 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/specials/SpecialSearch.php(288): CirrusSearch->searchText(string) #5 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/specials/SpecialSearch.php(232): SpecialSearch->showResults(string) #6 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/specials/SpecialSearch.php(142): SpecialSearch->goResult(string) #7 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/specialpage/SpecialPage.php(505): SpecialSearch->execute(NULL) #8 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/specialpage/SpecialPageFactory.php(598): SpecialPage->run(NULL) #9 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/MediaWiki.php(282): SpecialPageFactory::executePath(Title, RequestContext) #10 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/MediaWiki.php(748): MediaWiki->performRequest() #11 /srv/mediawiki/tags/2016-07-09_13:33:21/includes/MediaWiki.php(520): MediaWiki->main() #12 /srv/mediawiki/tags/2016-07-09_13:33:21/index.php(43): MediaWiki->run() #13 {main} ``` {F4270373} ---- **URL**: https://translatewiki.net/w/i.php?title=Special%3ASearch&search=insource%3A%2Fgender%2Fi&go=Vai
    • Task
    > There appear to be a number of places (like categories) where redirects are in italics and normal articles are in non-italics. Could that concept be added to the drop-down list on search box? (so that for example, if Phi Gamma Delta is typed into the searchbox, Phi Gamma Delta, Phi Gamma Delta House, and Phi Gamma Delta Fraternity House (University of Minnesota) show up as normal, but Phi Gamma Delta Chapters occurs in italics since it is a redirect. From: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(idea_lab)#Search_Drop_down_list_-_redirects_in_italics.3F Rough mockup: {F4250316} (note the redirect is also highlighted with a hover state) Kinda related (in terms of user experience/consistency): T52240
    • Task
    In search, most Internet users use bald Latin alphabet (without letters č, ć, š, ž and đ). This is similar to how in German language the search for "Muenchen" will return the results for "München". Thus, Serbian Wikipedia should support searching in this way, but it doesn't. Example: # Search for "marković": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovi%C4%87&searchToken=cibdktt9t7eu2hv4o3n1hgg84 #* Observed: 207 search results. # Search for "markovic": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovic&searchToken=gf3dawrz4tio3a91fujm144m #* Expected: all the 207 previous search results should appear. #* Observed: Only 47 results appear. An overview of the issue is given at https://wiki.apache.org/solr/SerbianLanguageSupport
    • Task
    When I am searching for certain title and page with almost similar name exists, searching redirets me to existing page. When I want to create it, I must edit URL Steps to reproduce: go to cs.wikipedia.org type "Kris" into search field You will be redirected to [[Kriš]] But also page [[Kris (dýka)]] exists, which should be found and wchich should be on the place of [[Kris]]
    • Task
    Elasticsearch bails out thinking this is a bad query, I don't understand why though. The error from es is: ``` parse_exception: Encountered " <OR> "OR "" at line 1, column 1. ``` Fails: * "foo*" OR "bar" Works: * foo* OR "bar" * "bar" OR "foo*"
    • Task
    Sorry I'm a bit pissed off from this worse search-engine (I mean the first web search-engine ever was better, I have some reasons more). Example: [[ https://commons.wikimedia.org/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=Test+wiki+Admin+logo.svg&search-cat-all=&search-cat-none=&search-file-copyright=any&search-file-type=&search-orig-query=&ns6=1 | Test wiki Admin logo.svg]] (File only) get no correct result, but this is the exact file name if you download (with Firefox). There is only one word switched (the exact name is in the SVG title inside, so it is displayed also on the file-description, but this should also not matter). The file is : [[ https://commons.wikimedia.org/wiki/File:Test_wiki_logo_Admin.svg | File:Test wiki logo Admin.svg ]]
    • Task
    Given that I search on "nerdmenn" in the search box When I type a char Then I got a list of fuzzy matched entries Note there is no "nerdmenn" but there is "nordmenn" Given that I write the URL https://no.wikipedia.org/w/index.php?search=nerdmenn When I hit return Then I got no entries in the result It seems to me that this behavior isn't consistent.
    • Task
    Perhaps a minor thing, but lucene regex doesn't appear to support \n natively. Instead the \n needs to be provided at the json level. Currently we pass the \n through all the way to lucene and don't find anything. Should be an easy fix, but we might want to consider if there are other special characters beyond \n (\r, \t? others? i dunno...) that we want to handle
    • Task
    From: https://www.mediawiki.org/w/index.php?title=Topic:Swl8xxi4tuz0s2zo ``` $ php updateSearchIndexConfig.php content index... Fetching Elasticsearch version...1.7.5...ok Scanning available plugins...none Infering index identifier...mwiki_CP_1262-mwiki_CP__content_first Picking analyzer...german Creating index... Unexpected Elasticsearch failure. Elasticsearch failed in an unexpected way. This is always a bug in CirrusSearch. Error type: Elastica\Exception\ResponseException Message: InvalidIndexNameException[[mwiki_CP_1262-mwiki_CP__content_first] Invalid index name [mwiki_CP_1262-mwiki_CP__content_first], must be lowercase] ```
    • Task
    I noticed we are generating html when CirrusSearch requests ParserOutput for Wikibase entities. From profiling locally, I believe this slows down indexing quite a bit. I don't think full html is needed in case of Wikibase content (except it might be used for 'text_bytes', but maybe that's not so relevant for entities?). For entities, we build the search text and fields directly from the entity object.
    • Task
    Having only english stopwords leads to the following situation where a fuzzy match is shown but a better match is available if you ignore some non-english stopwords: On English Wikipedia, if I type "Ruta Maya" in the search bar, the only suggestion I get is "Rita May (actress)". I'm impressed that it suggests Rita May, but confused that it doesn't suggest "La Ruta Maya" (which does get suggested at Special:Search). Specific cases can be fixed by adding a redirect but it sounds interesting to investigate in a more generic solution by expanding the stopwords to other languages, esp. on english wikipedias where there are many non-english titles.
    • Task
    I wonder if it could be a good idea to add an "inhistory" or "incontributor"/"bycontributor" to possible metafields in CirrusSearch. This would make it possible to add a search for articles written by a specific user. A full list of the users contributions would make the db explode, but only adding the user names should be a modest addition. A variation could be to add the timestamps, but I'm not sure if this is really usefull. You usually know who has made some contributions, but not when.
    • Task
    If a user has already been selected into the TestSearchSatisfaction2 schema and lands on a page with a google referrer, log an event along with the full referrer. Most of these probably won't include the search term, but those that do might be interesting and provide a ranking signal.
    • Task
    It happens once a day when QPS is at its highest point, we can see a huge spikes in tp99 (20sec) for the comp_suggest queries but a spike is also visible on full_test queries. Looking at hive there's few queries (17 for March 22 around 20h30 UTC) that can take around 2 minutes to complete. I'm not sure sure to understand, after looking at some nodes I see nothing obvious that could cause such spikes.
    • Task
    This search: https://meta.wikimedia.org/wiki/Special:Search?search=Hasty&prefix=Research%3ANewsletter%2F20 should return this page: https://meta.wikimedia.org/wiki/Research:Newsletter/2015/March#cite_note-17 which contains the search string as part of a template: ``` {{Cite journal| [...] | last1 = Hasty| first1 = Robert T.| last2 = Garbalosa| first2 = Ryan C.| [...] }} ``` The other author names in that citation seems to have the same problem. On the other hand, the problem doesn't seem to be the template itself, as other uses of the same template show up just fine in search results ([[https://meta.wikimedia.org/wiki/Special:Search?search=weifang&prefix=Research%3ANewsletter%2F20 | example]]). @EBernhardson observed on IRC: "fwiw it looks to have been removed somewhere in the indexing, because it's in the source_text field and not the text field. [...] https://meta.wikimedia.org/wiki/Research:Newsletter/2015/March?action=cirrusdump [...] for now you can find it with this, but it doesn't do stemming so you wont find hastily when searching for hasty: https://meta.wikimedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3AHasty+prefix%3AResearch%3ANewsletter%2F20&fulltext=Search " (Context: The [[https://meta.wikimedia.org/wiki/Research:Newsletter/Archives#Search_the_WRN_archives |archive search function of the Wikimedia Research Newsletter]] has become increasingly important as a way to quickly find coverage of academic research publications about Wikipedia from half a decade, so it would be great to fix this one way or another.)
    • Task
    Elasticsearch 2.0 added a new analysis module that applies phonetics (doublemetaphone, nysiis, etc). This could be a great improvement to recall of misspelled queries.
    • Task
    full text search, when given a query like "reise priebus" which has zero results will rewrite the query to use the provided suggestion, "reese priests". This isn't really a good fix, but it is a feature we have in production and relevancyLab should do the same.
    • Task
    On a brand new wiki with just the pages [[Test Page]] and [[Some Page]], using edit.php to create a page [[Another]] causes the "did you mean" search results to disappear. Direct searches for "test page" and "some page" still work, but where searches for "test paeg" and "some paeg" previously asked "did you mean 'test page'" and "did you mean 'some page'", respectively, there are now no results. No amount of refreshing or waiting brings the "did you mean" back. Procedure to recreate the issue: 1. Revert VM to initial wiki state with just [[Test Page]] and [[Some Page]] 2. Refresh search ~25 times to confirm for certain that the "did you mean" for both "test paeg" and "some paeg" are not going anywhere 3. Run `echo "This is a test 0" | php "/opt/meza/htdocs/mediawiki/maintenance/edit.php" -u Admin "Another"` 4. Refresh search page between 7 and 35 times and see that "did you mean" no longer is present for "test paeg" and "some paeg". 5. Refresh search page 100 more times to confirm "did you mean" isn't coming back Note that choosing a page title "Another Page" did not cause this issue. Page title "Asdf" did cause the issue. Nothing significant was seen in Elasticsearch logs with debug enabled. I've tested this extensively on my platform. See https://github.com/enterprisemediawiki/meza/issues/296 Please note that I can't overemphasize the "sometimes" in the title. If you look at the tests I ran at the link above, you'll see that it was hard to pin down this error initially because it wasn't easily repeatable. It is, however, consistently repeatable using the process above, where edit.php is run immediately after creating Test Page and Some Page. Two configurations tested, both showing the issue: CentOS 7 PHP 5.6.16 MW 1.25.5 Elasticsearch 1.6.2 All WMF extensions on REL1_25 branches CentOS 7 PHP 5.6.16 MW 1.26.2 Elasticsearch 1.6.2 All WMF extensions on REL1_26 branches
    • Task
    In the desktop schema we have moved forward to measure: * Number of hits returned * Click through position (including offset) * Dwell time on page clicked through to, as a rough estimate of satisfaction * If the page clicked through to was scrolled or not * Combine all searches within a time span as a search session ** Web uses 10 minute timeout, with the timeout refreshing each time a new search is performed * Article ID clicked through to * Tie click through to a specific search where possible * Max score (position 1) of search results * List of article ID's returned in search Probably more
    • Task
    Mostly hypothetical, but in theory we could end up with articles that don't get deleted from the index. I'm thinking mainly in the situation where you start over on a wiki but don't prune the Elasticsearch index--that is: you had a wiki named foowiki, stopped, and are starting foowiki again. The old articles will never be pruned since nobody can delete non-existing articles :) When we get such a result we already don't display it (since we check for page existence application side after fetching our results), so it shouldn't be too terribly hard to insert a low-priority delete job around that point.
    • Task
    Fairly self expanitory, inject additional latency at a couple different levels (50ms, 100ms, 250ms?) and measure how much latency changes user behaviour. This can help us figure out how much additional backend processing could be used and at what level doing more processing will degrade user behaviour even if the results are better
    • Task
    When talking to nik at elasticon he expressed that morelike highlighting can be incredibly useful for understanding what more like is actually doing, and how it came to decide that a particular document was the best match. We could put this behind a flag of some sort and default it to on for web based more like queries which are fairly minimal, while defaulting it to off for api based more like which is the majority of traffic (currently).
    • Task
    At a presentation given the the author of elasticsearch's completion search feature he suggested prefering a single shard when possible. We should evaluate performance of using single shard indices (with multiple replicas as necessary) vs our current usage of 4 shards on the largest completion indices.
    • Task
    We have issues generating metrics for interwiki search because the session information is in local storage, and local storage is per domain (not even per TLD). It is possible to work around this by embedding an iframe in pages and using Window.postMessage to send data from the parent to the iframe and back. All wikimedia sites would need to use the same domain for the iframe (centralauthwiki ? i dunno...). Needs careful consideration of security, to prevent other domains accessing the "globalStorage". Likely this can be done with a list of TLD's to accept. The resulting API should look like mw.storage, but returns promises instead of values directly.
    • Task
    File's are a large part of commonswiki, and the top scoring pages are mostly in the File namespace. We should push the popularity score into the flie index so it can be used there as well. May also need some thoughts about a better way to collect popularity score on commonswiki, I'm not sure if the page views table captures how much the image is pulled from when used as part of an article.
    • Task
    Jobrunner log on deployment-jobrunner01 shows repeated occurrences of: ``` 2016-02-13T06:28:39+0000: Runner loop 0 process in slot 4 gave status '0': curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=enwiki&type=cirrusSearchLinksUpdatePrioritized&maxtime=30&maxmem=300M' <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> </body></html> ``` Running manually shows the execution time limit being hit: ``` $ time curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=enwiki&type=cirrusSearchLinksUpdatePrioritized&maxtime=30&maxmem=300M' <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>503 Service Unavailable</title> </head><body> <h1>Service Unavailable</h1> <p>The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.</p> </body></html> real 2m0.014s user 0m0.022s sys 0m0.007s ```
    • Task
    Unlike other search parameters, //morelike// is only working alone. //Morelike// already creates a unique //morelike area// comparable in usefulness to a category or a [[//en.wikipedia.org/wiki/wp:outline|subject outline]], but it gives many many more pages, and they're of unspecified characteristics. A //morelike area// might find new use as a stepping-off point targeting title, template, category, file, and other page characteristics in that unique area of interest. Also, this singular-standalone trait is awkward to try to rationalize in Search documentation. The cool things about //morelike areas// is that they're not manual, and they're not biased. Compared to categories and outlines, //morelike areas// are more complete; they list missing and orphaned articles. They're easier; they are not hampered by the problems of of subcategories (T37402), and input pagenames (finding category pagenames, and outline pagenames). The area of interest is definitely fast to create, just not subsequently workable like it is with //incategory//. A workaround for wikiprojects or a user with there own unique area of interest is to cut and paste a large, reference-set of words into the query box, and then filter this with other search parameters to get the desired a meaningful page count and reusable search link. The [[//en.wikipedia.org/wiki/Wikipedia:Help_desk/Archives/2016_January_31#Looking_for_a_Bot_to_list_imageless_articles_in_a_given_category| need was raised at]] the Wikipedia help desk.
    • Task
    This would allow us to push sanitization out to the edge where it is supposed to be. It was suggested that this is how bing is internally handling highlighting.
    • Task
    Additional metrics we could collect/measure/etc. Checked boxes indicate data for which we currently collect and relevant SQL queries are in the comments. Ideally we collect most/all of the necessary information in a single schema, TestSearchSatisfaction2, so that the same sessions are being compared everywhere. [x] click through position [x] number of searches per session [x] search session duration [x] click through position vs number of searches per session [x] session abandonment rate - how many sessions don't result in a click through [x] number of queries per search session [ ] requery rate - similar to above, but understanding when a user is reformulating a query vs starting a different kind of query. See http://www.cs.cmu.edu/~rosie/papers/jonesKlinknerCIKM2008.pdf [x] number of click throughs per session [ ] any of the other metrics split by navigational vs explorational queries (how to determine?) [x] time to first click through [x] rate click through as SAT/DSAT (satisfied/dissatisfied). Graph ratio of satisfied vs dissatisfied. We can reuse the current definition of "clicked and stayed on article > x seconds" [x] time to first SAT [x] SAT queries vs click through position. Are clicks to results > 10 ever satisfied? [ ] rate queries as popular and unpopular and compare other metrics split on these lines [ ] rate sessions as anonymous, low edit user, or high edit user and compare other metrics split on these lines [ ] record score of top result in a query and compare other metrics on high scoring vs low scoring results [ ] rate click through as as high volume vs low volume page and compare other metrics split on these lines [ ] record if did you mean was clicked in a session and compare other metrics based on if the did you mean was used [ ] number of results above the fold (needs to be recorded as part of the searchResultPage event). Could run tests to see if more results above the fold changes other metrics. [ ] Running above metrics while excluding searches with zero results could be interesting, focusing on precision of results [ ] most common queries with results but no click through. this may be confounded by us not recording which exact query a click through was caused by. Some things might still need to be defined / refined: * Is a search session the length of time between first search and last click, or does it include checkin's that record how long the user viewed the article? * How long does the user need to stay on the article page to be satisfied? * What if the user was satisfied without clicking through, perhaps the highlight contained the information necessary (perhaps unlikely with our current configuration) * ??? It would be interesting to track how this changes over time, and what effects change we make to search have on it. Some of this data is already available in the latest iteration of the satisfaction schema, others may require we start collecting more data.
    • Task
    It would be nice to know how long the user spends between receiving their search result and clicking through to a result. It might give some indication to the quality of the results, users clicking through faster saw what they were looking for immediately, while users taking longer might indicate they scroll through the results, consider a few options, then click one that is maybe kinda/sorta the right answer (or not). Collecting these in javascript might get harder, it would be nice if we had a way to collect this information from the webrequest logs directly. This might tie into the previous idea of giving each search a unique hash, and having click through's from search hit a redirect bounce that includes tracking information and then redirects them on to their final result. We can't necesarrily include these extra tracking parameters in the link to the article itself because that would hurt article cache hit rates (unless we stuff all the data into wprov somehow)
    • Task
    This would help us to determine how important the ordering of the top few results, currently users click through the second result much more often than the third. Is this because the second result is better, or is this just user behaviour? ``` | event_position | count | percent | +----------------+--------+---------+ | 1 | 154150 | 64.26% | | 2 | 34214 | 14.26% | | 3 | 16213 | 6.76% | ```
    • Task
    Might be a useful scoring metric, not sure. For reference here are the top 20 pages for 2016/2/2 (with obvious search engines excluded. i'm sure i missed a few). Not sure how exactly to do a comparison, or how it would be integrated ... just a thought. The numbers are pretty small so maybe it's not distinct enough of a ranking signal...Not sure about stability either. Basically was just a thought, partially based on greg lindahl's comments to https://www.quora.com/How-many-iterations-should-I-do-to-compute-near-accurate-PageRank-of-Wikipedia-Articles-in-the-latest-dump ``` select project, count(distinct referer_host) as num_referers, collect_set(page_title)[0] as page_title from (select lower(parse_url(referer, 'HOST')) as referer_host, page_id, pageview_info['project'] as project, pageview_info['page_title'] as page_title FROM webrequest where year=2016 and month=2 and day=2 and is_pageview = true and page_id IS NOT NULL and page_id > 1 and referer IS NOT NULL and referer_class <> 'internal') x where 0 == instr(referer_host, ".google.") and 0 == instr(referer_host, "duckduckgo.com") and 0 == instr(referer_host, "search.yahoo.com") and 0 == instr(referer_host, "yandex.ru") and 0 == instr(referer_host, "baidu.com") group by project, page_id order by num_referers desc limit 20; project en.wikipedia num_referers 4228 page_title Main_Page project ru.wikipedia num_referers 1431 page_title Заглавная_страница project de.wikipedia num_referers 1310 page_title Wikipedia:Hauptseite project en.wikipedia num_referers 1261 page_title Donald_Trump project ja.wikipedia num_referers 944 page_title メインページ project es.wikipedia num_referers 906 page_title Wikipedia:Portada project en.wikipedia num_referers 718 page_title Stop_words project en.wikipedia num_referers 547 page_title World_population project fr.wikipedia num_referers 546 page_title Wikipédia:Accueil_principal project en.wikipedia num_referers 487 page_title Groundhog_Day project en.wikipedia num_referers 463 page_title HTTP_cookie project meta.wikimedia num_referers 457 page_title Help:Contents project ru.wikipedia num_referers 449 page_title Труд project en.wikipedia num_referers 424 page_title Iowa_caucuses project en.wikipedia num_referers 395 page_title Zika_virus project en.wikipedia num_referers 385 page_title United_States project en.wikipedia num_referers 369 page_title Ted_cruz project ar.wikipedia num_referers 365 page_title الصفحة_الرئيسية project en.wikipedia num_referers 356 page_title RESTful project en.wikipedia num_referers 349 page_title Facebook ```
    • Task
    The [[//mediawiki.org/wiki/Extension:Translate|Translate extension]], comes with [[//mediawiki.org/wiki/Template: translatable template|{{Translatable template}}]], but a translated template is no longer searchable with CirrusSearch hastemplate search parameter. Hastemplate can find template usage where the target is a secondary template, and this ability should also be able to find where the target template is passed as a parameter. Currently hastemplate doesn't recognize the parameter list as "a place for template names", as it does in template code, where it the target template as a secondary. The only workaround is a set of case-insensitive regex searches, for all possible combination of aliases, and even after all that work, it still sacrifices the visibility of secondary template. For example, these two queries should have the same count, but hastemplate is way off: * [hastemplate: ApiEx](//mediawiki.org/wiki/Special:Search/all:hastemplate: ApiEx) * [insource:/TNT *\| *ApiEx/i insource: "tnt apiex"](//mediawiki.org/w/index.php?search=all:insource:/Tnt *\| *ApiEx/i+insource:+"tnt+apiex"&title=Special:Search&go=Go) For a target template with two aliases, six queries are required, and even then no secondary template-usage is found: * [hastemplate:Documentation](//mediawiki.org/wiki/Special:Search/all:hastemplate: Documentation) * [insource:/TNT *\| *Documentation/i insource: "tnt documentation"](//mediawiki.org/w/index.php?search=all:insource:/TNT+*\|+*Documentation/i+insource:+"tnt+documentation"&title=Special:Search&go=Go) * [insource:/TNT *\| *Doc/i insource: "tnt doc"](//mediawiki.org/w/index.php?search=all:insource:/TNT+*\|+*Doc/i+insource:+"tnt+doc"&title=Special:Search&go=Go) * [insource:/TNT *\| *Template doc/i insource: "tnt template doc"](//mediawiki.org/w/index.php?search=all:insource:/TNT+*\|+*Template+doc/i+insource:+"tnt+template+doc"&title=Special:Search&go=Go) * [insource:/Translatable template *\| *Documentation/i insource: "Translatable template documentation"](//mediawiki.org/w/index.php?search=all:insource:/Translatable template+*\|+*Documentation/i+insource:+"Translatable template+documentation"&title=Special:Search&go=Go) * [insource:/Translatable template *\| *Doc/i insource: "Translatable template doc"](//mediawiki.org/w/index.php?search=all:insource:/Translatable template+*\|+*Doc/i+insource:+"Translatable template+doc"&title=Special:Search&go=Go) * [insource:/Translatable template *\| *Template doc/i insource: "Translatable template template doc"](//mediawiki.org/w/index.php?search=all:insource:/Translatable template+*\|+*Template+doc/i+insource:+"Translatable template+template+doc"&title=Special:Search&go=Go)