Page MenuHomePhabricator

Wikimedia wiki search is broken (outputting inconsistent results)
Closed, DeclinedPublic

Description

When I go to https://hi.wikipedia.org/w/index.php?search=incategory%3A%22%E0%A4%95%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A3%E0%A4%BE+%E0%A4%9C%E0%A4%BF%E0%A4%B2%E0%A4%BE%22&title=%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7%3A%E0%A4%96%E0%A5%8B%E0%A4%9C, I'm getting inconsistent results. Sometimes when the page loads, it shows a proper listing (1 to 20 of 955 results). Other times when the page loads, it shows an improper listing (no results found).

This leads me to believe that the search indices may not be properly synchronized. Or perhaps data is getting dropped somewhere.


Version: unspecified
Severity: critical
URL: http://wikitech.wikimedia.org/view/Search
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=35691
https://bugzilla.wikimedia.org/show_bug.cgi?id=43544

Details

Reference
bz42423

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 12:56 AM
bzimport set Reference to bz42423.

I don't think anyone but roots have access to the search cluster at this point. Hrmph.

I'm seeing similar inconsistent output at https://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=default&search=Boston&fulltext=Search. On certain page loads, the results are "1 - 20 of 22"; on other page loads, the results are "no results matching the query". Something is plainly broken.

  • Bug 42426 has been marked as a duplicate of this bug. ***
  • Bug 42424 has been marked as a duplicate of this bug. ***

Can anyone still reproduce this?

(In reply to comment #4)

Can anyone still reproduce this?

Yes. Why do you ask?

  • Bug 42431 has been marked as a duplicate of this bug. ***

Looking at http://wikitech.wikimedia.org/view/Server_admin_log there are several entries which might be related:

November 25
08:11 apergos: from about half an hour ago, restarted lucene search on search13 and forgot to log it

November 23
04:04 Tim: oh yeah, and I upgraded lucene to my version with the timeouts, deployed to pmtpa only via puppet
04:02 Tim: many lucene search servers failed to bind to port 1099 when they were restarted by the upgrade, restarting manually

svenmanguard wrote:

Hey there. I'm just confirming that search is still useless.

Sven

Probably useful (In reply to comment #8)

Hey there. I'm just confirming that search is still useless.

Sven

Probably useful for you to identify where you are having issues. The wikis reported on the duplicate and above url for wikidata all seem to return data now, so where is there still a problem?

svenmanguard wrote:

It's returning results now, but it wasn't when I made the above post.

Link in URL field now works reliably for me too (see IRC log below and the line by nagios-wm). However worth to investigate so this doesn't happen again.

<apergos> I got what I think is a no results page
<apergos> I don't see anything useful in the log about hiwiki (on search13 and search14)
<nagios-wm> RECOVERY - Lucene on search14 is OK: TCP OK - 0.002 second response time on port 8123
<apergos> that's odd, I didn't know it was out to lunch (and it didn't behave like it was)
<apergos> there's a lot of 'thread is waiting' messages
<apergos> tim might have some insight (given his recent change to the code)
<apergos> there are also messages like these:
<apergos> Cannot contact RMI registry for host search0x : Unknown host: search0x
<apergos> but it's hard to tell what is setting that off

(In reply to comment #5)

(In reply to comment #4)

Can anyone still reproduce this?

Yes. Why do you ask?

Because several people that had seen it broken (across multiple wikis) had seen it was no longer broken for them. I figured this bug was a good place to fish for people (and cases) where it was still not working.

Issue isn't reproducible anymore for me, lowering severity/priority.

sumanah wrote:

CC'ing Patrick, as Patrick and Tim (already cc'd) are working on fixes to
immediate search problems.

Probably related: For
<nagios-wm> PROBLEM - Lucene on search1016 is CRITICAL: Connection refused
from 50 minutes ago,
https://gerrit.wikimedia.org/r/35345 was submitted.

(In reply to comment #11)

<apergos> Cannot contact RMI registry for host search0x : Unknown host:
search0x

That's just a configuration hack, flooding the logs with exception backtraces to avoid the need to disable those indexes properly.

ForoaW wrote:

Sorry, but if it fails, it should state that it failed but not pretend that it found 0 results.
A possible solution that it always returns the date of the latest search database update, displays an impossible data when nothing received.

Foroa

Bug 16236 comment 14 implies that this still happens on mediawiki.org.

Patrick and Tim: Has there been any outcome of investigations on this four weeks ago?

ForoaW wrote:

Remains a problem. Search failed several times, even 5 bminutes ago.

See http://commons.wikimedia.org/wiki/Commons:Village_pump#Search_faulty.3F too.

comment #17 is filed as bug 43544. As far as I can tell, the search problem on commons aren't a problem right now.

ForoaW wrote:

This morning, I've got at least 10 search failures. In general, retrying it after a few tens of seconds works.

From yesterday's logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20121231.txt :
[21:17:23] <binasher> robla: that bugzilla ticket should probably be closed unless it's worth have a ticket to report new search problems as they happen. search was broken over the thanksgiving holiday, which was when mzmcbride opened it
[23:48:36] <robla> thanks for the update. I think, after asking you about that, we established that things are (as of this instant) in ok shape, but something we could use a little more monitoring of

So I guess they've deteriorated again?
It was also suggested to split comment 17 to another bug: bug 43553.

Isarra just experienced this problem on mediawiki.org and I was able to reproduce (after many tries). The HTML source of the page with no results contained "<!-- Served by srv275 in 10.143 secs. -->", though I'm not sure this is very helpful to debugging.

My suspicion is that the search cluster's indices are not all fully synchronized. Or possibly one of the search boxes is simply broken/unresponsive (ten seconds is an awfully long time to take to respond).

I'll upload screenshots in short order.

Created attachment 11593
Screenshot of mediawiki.org incorrectly showing no search results, 2013-01-05

Attached:

mediawiki.org-incorrectly-showing-no-search-results-2013-01-05.png (800×1 px, 127 KB)

Created attachment 11594
Screenshot of mediawiki.org correctly showing search results, 2013-01-05

Attached:

mediawiki.org-correctly-showing-search-results-2013-01-05.png (800×1 px, 200 KB)

Bug 43663 is a potential duplicate.

I'm increasing priority as this seem to affect quite some people and makes finding information cumbersome and errorprone.

Have not been able to reproduce this on mediawiki.org, both for being logged in and not being logged in. Haven't seen anything suspicious since 2013-01-05 in the server admin log at http://wikitech.wikimedia.org/view/Server_admin_log either (except for job queue with lots of items).
Decreasing prio/seve again.

ForoaW wrote:

Problem still persists on Commons. Search is a major tool to check and complete categories. (Hundreds of thousands of image categorisation backlog).

If the problem exists, please provide explicit and exact steps to reproduce (what to do when, Search in page name vs. page contains etc, a URL / search term to reproduce with) so others can try to reproduce.
"It still happens" only is unfortunately not helpful.

ForoaW wrote:

The basic problem is that search fails to return search results in a random way without indication that it fails; we know that it fails only because there are no results and because we know that there should be results. I got it this morning a couple of times on Commons. In general, redoing the same search (or a couple of times) one or more seconds later finally returns some results. So the basic problem is that the service is not reliable and does not report that there is a problem. Why I proposed on another bug report to return at least a status and the date of the search database update (which is another source of frustration as it looks as if it takes between 1 and 5 days before new files are included in the search database).

As test procedure, one could easily make a script that uses as search string the name of a random category (or a word of it) and searches in files, galleries and categories: each search should yield some results. Obviously, such tests should be done on en:wiki (that contains the largest volume of data) or Commons (that probably has the most items in its database).

(In reply to comment #30)

If the problem exists, please provide explicit and exact steps to reproduce
(what to do when, Search in page name vs. page contains etc, a URL / search
term to reproduce with) so others can try to reproduce.
"It still happens" only is unfortunately not helpful.

Comment 24, comment 25, and comment 26 could not be any more explicit, showing very clearly both the symptom of the problem and the steps to reproduce (the URL bar was intentionally included in both screenshots).

This problem happens intermittently, through absolutely no fault of users. This bug is waiting on a sysadmin to investigate, debug, and resolve the problem.

(In reply to comment #28)

Have not been able to reproduce this [...]
Decreasing prio/seve again.

Per comment 32, I've marked it "critical" again; it still needs an assignee.
(It could be a legitimate "blocker", it surely blocks any search-related development/debugging/whatever.)

(In reply to comment #32)

Comment 24, comment 25, and comment 26 could not be any more explicit

For your case (mediawiki.org) yes.
But comment 29 (that I answered) was about Commons.

ForoaW wrote:

Failed several times last hour. On the bottom of the source, it reads:

  • <!-- Served by srv232 in 10.193 secs. --> when it returns no results, no idea what time it took
  • <!-- Served by srv192 in 0.631 secs. --> after second attempt with 36 results

A second case http://commons.wikimedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=Bottle+filling+-incategory%3A%22Bottle_filling%22&fulltext=Search&ns0=1&ns6=1&ns14=1&redirs=1&profile=advanced:

  • fail: <!-- Served by mw39 in 10.283 secs. --> after roughly 10 seconds
  • fail: <!-- Served by mw49 in 10.188 secs. --> after roughly 10 seconds
  • 68 results :<!-- Served by mw38 in 0.619 secs. -->

It looks as if a search query taking more than 10 seconds aborts the request. I think that some searches without results return in much less than 10 seconds, I will try to estimate that better.

ForoaW wrote:

Previous failed tests has been done in 25 minutes of time, 3 failures out of 10 or so activations. Then I did about 100 tests without failre. Below, a couple of failures from this morning in say 45 minutes, about 50 searches returned valid results.

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=B%C3%A9rat+-incategory:%22B%C3%A9rat%22

  • Fail : <!-- Served by mw26 in 10.187 secs. --
  • 320 results: !-- Served by mw48 in 1.273 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Commonwealth%20War%20Graves%20Commission%20cemeteries%20in%20England+-incategory:%22Commonwealth_War_Graves_Commission_cemeteries_in_England%22

  • Fail: <!-- Served by mw28 in 10.253 secs. -->
  • 22 results: !-- Served by mw52 in 0.552 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Images%20from%20the%20Geograph%20British%20Isles%20project%20needing%20categories%20in%20grid%20NW9667+-incategory:%22Images_from_the_Geograph_British_Isles_project_needing_categories_in_grid_NW9667%22

  • Fail: <!-- Served by srv245 in 10.196 secs. -->
  • 1 result: <!-- Served by mw55 in 0.353 secs. -->

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Schwalm-Radweg+-incategory:%22Schwalm-Radweg%22

  • Fail: <!-- Served by srv237 in 10.204 secs. -->
  • Fail: <!-- Served by srv267 in 10.181 secs. -->
  • Fail: <!-- Served by mw28 in 10.288 secs. -->
  • 1 result: <!-- Served by srv261 in 0.224 secs. -->

(In reply to comment #34)

(In reply to comment #32)

Comment 24, comment 25, and comment 26 could not be any more explicit

For your case (mediawiki.org) yes.
But comment 29 (that I answered) was about Commons.

Yeah, I'm not sure how many times it needs to be confirmed as broken. It's broken. Really. It needs to be fixed and that requires a sysadmin to investigate, debug, and resolve the issue. Can you find a willing sysadmin, please?

Valerie.m.juarez wrote:

(In reply to comment #24)

Isarra just experienced this problem on mediawiki.org and I was able to
reproduce (after many tries). The HTML source of the page with no results
contained "<!-- Served by srv275 in 10.143 secs. -->"...

I could reproduce this error about 10% of the time (About 3 out of 30ish reloads would return no results).

"<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one of the pages.

I'm not sure this is very helpful to debugging.

I would love to know if there is any way we can provide more info from the client side to help track down this issue.

(In reply to comment #38)

(In reply to comment #24)
"<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one
of the pages.

I'm not sure this is very helpful to debugging.

I would love to know if there is any way we can provide more info from the
client side to help track down this issue.

As I understand it, the "served by" HTML comment generally tells a user which server last parsed that particular page. In the context of search results, it tells the user which Apache server served the results. However, I believe for this bug we're interested in which _search_ server or cluster produced the results, not which Apache served the results. I don't believe the search server information is exposed anywhere.

(In reply to comment #39)

(In reply to comment #38)

(In reply to comment #24)
"<!-- Served by mw27 in 10.181 secs. -->" Was contained in the source of one
of the pages.

I'm not sure this is very helpful to debugging.

I would love to know if there is any way we can provide more info from the
client side to help track down this issue.

As I understand it, the "served by" HTML comment generally tells a user which
server last parsed that particular page. In the context of search results, it
tells the user which Apache server served the results. However, I believe for
this bug we're interested in which _search_ server or cluster produced the
results, not which Apache served the results.

That's correct, the served by mwXX is which apache. however there is a comment in the html of special:search looking like:

<!-- Search results fetched via search=[search14,search14], highlight=[search14], suggest=[search16] in 476 ms -->

Which would probably be more helpful (I assume anyhow. Not overly familiar with search infrastructure)

<!-- Search results fetched via search=[search14,search14],
highlight=[search14], suggest=[search16] in 476 ms -->

Just to be 100% clear. That comment is *NOT* from a search that failed. I just copied and pasted from a successful search to show what the comment looked like.

Valerie.m.juarez wrote:

I don't see a comment like that on a page where the search failed.

Valerie.m.juarez wrote:

Page generated on failed search

Attached:

Valerie.m.juarez wrote:

Attached the php file generated when a search fails. If that helps.

Created attachment 11618
HTML source diffed between responses with and without results (same URL/query)

took less than 10 tries to get a no results page. (and then had to do it again because my phone OOM'd and still it was <10x)

the fetched via line is in fact missing for the empty result set.

Attached:

sumanah wrote:

I am still running into this. Just now I searched on mediawiki.org for "blog" and got 0 results at first, then reran the search and got a lot.

(See Bug 16236 for more repro cases, in case anyone wants them.)

I am adding Munagala Ramanath (Ram) to cc and raising priority to "Highest" - Ram, can you take a look at this?

Per the comments above where it was discovered that timed out search requests do not include a comment saying what search server was used, we should probably change that.(unless I missed something. Im not too familiar with search)

More specificly for this problem-logging all failed searches and seeing if there is an obvious pattern in terms of which search host failed would probably be a good idea

ram wrote:

Status of this issue is now being tracked in 43544. I've attached a script there
that allows this problem to be reproduced at will.

I'm fully engaged on this issue but it will be another week or two before there is any material progress since it is taking time to understand the PHP code at one end and the Java code at the other.

So, apologies for the problems with unreliable search results on several wikipages so far.
Ram is going to take a look at these problems (see bug 42423 comment 48), but it'll take some more time. I'm tentatively assigning this report to Ram.

Trying to summarize the situation:

Issues with unreliable search on Commons:
Bug 42431 (hmm, marked as dup of bug 42423), bug 43920, bug 35691

Bug 42423 itself is very generic.
Initial comment mentions hi.wikipedia.org.
wikidata.org is mentioned (bug 42424 marked as dup), {en|fr}.wikisource.org (bug 42426 marked as dup).
It also mentions mediawiki.org (copied from bug 16236 comment 14, and bug 42423 comment 24).
Bug 42423 comment 19 states Commons problems.

Bug 42423 comment 35 and bug 42423 comment 36 implies that some requests take longer than 10sec and abort then.

Better debugging such problems is the subject of bug 43544: Show an error message instead of "zero results".
Thanks to Ram, bug 43544 also has a script to reproduce these problems.

A totally separate issue is bug 43663: Search on ua.wikimedia.org (chapter website) does not work AT ALL.

For general information on the Search situation, also see the posting by Ram at
http://lists.wikimedia.org/pipermail/wikitech-l/2013-February/066273.html

  • Bug 43920 has been marked as a duplicate of this bug. ***

I experienced a fail on Meta today. I did a specific PrefixIndex directed search, success; ran the same search on a broader criteria, fail, went back a minute later and it worked.

Search run from m:User:Billinghurst using the COIBot search boxes, search word Abercrombie, circa 22:33, 27 February 2013 (UTC)

Quick update:

Ram (who started a few weeks ago) is trying to improve the Search debugging infrastructure first by working on

  • bug 45266
  • bug 43544

so it will be easier to find potential reasons (some bugs are expected to get fixed by this, or at least easier to identify).

After these two bugs have been resolved, bug 42423 and bug 43663 are very likely next on the list. Sorry that this takes a bit longer, but the plan is to "do it right".

Bug 43544 is fixed now so there should be at least error messages when the results are inconsistent.

Can anybody say if this is the case?

With the given examples in this bug report I could not manage so far to get inconsistent results or errors.

I think this bug can probably be marked resolved/fixed at this point.

ForoaW wrote:

I could not observe a false empty search return with zero results. I notice from times to times a red time-out message (say 15 times per week); Maybe the message can be a bit masssaged, such as Temporary search engine overload, please try again ...

I must admit, that now I did some additional tests, the system impresses me; I did not manage to get it in time-out.

Closing as per comment 54 and comment 55.
Thanks everybody, and again sorry that improving the situation took a while (and improving the Search is still ongoing work and complicated enough). :-/

Followup issues: bug 45266, bug 47761.