Page MenuHomePhabricator

ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean)
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=160&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3AMaharashtra_Control_of_Organised_Crime_Act&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&uselang=content&smaxage=86400&maxage=86400

/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=160&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3AMaharashtra_Control_of_Organised_Crime_Act&gsrnamespace=0&gsrlimit=3&gsrqiprofile=classic_noboostlinks&uselang=content&smaxage=86400&maxage=86400   BadMethodCallException from line 158 of /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean) {"exception_id":"WTlRuwpAIDwAAEtlYkEAAABD","caught_by":"other"} 
[Exception BadMethodCallException] (/srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiQuerySearch.php:158) Call to a member function termMatches() on a non-object (boolean)
  #0 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiQuerySearch.php(47): ApiQuerySearch->run(ApiPageSet)
  #1 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiPageSet.php(180): ApiQuerySearch->executeGenerator(ApiPageSet)
  #2 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiPageSet.php(144): ApiPageSet->executeInternal(boolean)
  #3 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiQuery.php(238): ApiPageSet->execute()
  #4 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiMain.php(1578): ApiQuery->execute()
  #5 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiMain.php(545): ApiMain->executeAction()
  #6 /srv/mediawiki/php-1.30.0-wmf.2/includes/api/ApiMain.php(516): ApiMain->executeActionWithErrorHandling()
  #7 /srv/mediawiki/php-1.30.0-wmf.2/api.php(83): ApiMain->execute()

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Anomie subscribed.

It seems that the call to $search->searchText( $query ) is returning an 'ok' Status object that has ->getValue() as false rather than either null or a SearchResultSet as expected. Removing the "gsrqiprofile=classic_noboostlinks" parameter causes this not to happen.

debt triaged this task as High priority.Jun 8 2017, 5:06 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

It looks like what happened here is there as a cache, internal to CirrusSearch, that is used for more like queries. This particular query for some reason got a 502 Bad Gateway from nginx that is fronting the elasticsearch servers with SSL, and that was cached. Most likely that occurred because we have a rolling restart running across the elasticsearch cluster for a version upgrade right now and nginx couldn't talk to elasticsearch for a moment. That shouldn't have been cached.

As an immediate fix i've cleared out the cache keys for the ~50 pages that have logged this error. I am working out how to prevent these bad values from being cached in the future, and i'll check with @Gehel about perhaps depooling servers from LVS while they are being restarted, The servers are only down for a handful of seconds during a restart but it seems that's enough to trigger a few errors. The caching ended up remembering that error an repeating it enough times to be noticed in the logs.

Is there a way to have a shorter cache on 5xx errors? On the other side, yes, we could (and should) depool nodes during restarts! I added the pool/depool to my ugly scripts, moving them to cumin will make it cleaner...

this is still happening intermittently.

@Gehel The thing i haven't been able to figure out yet is that the code does not cache errors, it only caches when the search returns a success. Obviously though that isn't true and somehow errors are being cached anyways. I'm looking back into this to try and reproduce the error in a test environment.

debt lowered the priority of this task from High to Medium.Aug 22 2017, 5:30 PM
debt subscribed.

This appears to still be happening on a lower level occurrence every day but we are having issues consistently reproducing it. Moving to the backlog for now.

Mentioned in SAL (#wikimedia-operations) [2017-10-16T23:46:39Z] <ebernhardson> increase elasticsearch log levels for transport to debug to help track down T167410

Change 384633 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Check multi responses for http errors

https://gerrit.wikimedia.org/r/384633

Looked into this a bit more today, correlated log messages:

mediawiki error in logstash: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.10.16/mediawiki/?id=AV8ktvUqePsvZ6LqtH8R

Occured at 2017-10-16T10:24:50 from mw1232 (10.64.48.67)

nginx error on elastic1037:

2017/10/16 10:24:50 [error] 3545#3545: *239618179 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.64.48.67, server: , request: "POST /_msearch HTTP/1.1", upstream: "http://[::1]:9200/_msearch", host: "search.svc.eqiad.wmnet:9243"

Upstream here refers to elasticsearch, so somehow elasticsearch closed the socket? But with no errors logged on the elasticsearch side it must think it's just a typical close.

Closest log messages before and after error from production-search-eqiad.log on elastic1037 (not close enough to be correlated):

[2017-10-16T10:19:05,537][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][2222650] overhead, spent [2s] collecting in the last [2.6s]
[2017-10-16T11:04:02,659][WARN ][org.elasticsearch.monitor.jvm.JvmGcMonitorService] [gc][2225344] overhead, spent [2s] collecting in the last [2.8s]

GC logging shows very typical young GC's around that time, nothing that would prevent elasticsearch from responding. 10G of allocations in 5 seconds does sound a bit high, but that might be typical:

2017-10-14T10:24:48.003+0000: 2052284.168: [GC (Allocation Failure) 
Desired survivor size 38797312 bytes, new threshold 1 (max 15)
[PSYoungGen: 10425681K->27408K(10447872K)] 30929942K->20543467K(31419392K), 0.0435420 secs] [Times: user=0.74 sys=0.00, real=0.05 secs] 
2017-10-14T10:24:48.048+0000: 2052284.212: Total time for which application threads were stopped: 0.0511065 seconds, Stopping threads took: 0.0011211 seconds
2017-10-14T10:24:53.264+0000: 2052289.428: [GC (Allocation Failure) 
Desired survivor size 39321600 bytes, new threshold 1 (max 15)
[PSYoungGen: 10437392K->27171K(10437632K)] 30953451K->20560821K(31409152K), 0.0381115 secs] [Times: user=0.70 sys=0.00, real=0.04 secs] 
2017-10-14T10:24:53.303+0000: 2052289.467: Total time for which application threads were stopped: 0.0440409 seconds, Stopping threads took: 0.0002630 seconds
2017-10-14T10:24:55.964+0000: 2052292.128: Total time for which application threads were stopped: 0.0067332 seconds, Stopping threads took: 0.0004151 seconds
2017-10-14T10:24:55.969+0000: 2052292.133: Total time for which application threads were stopped: 0.0048483 seconds, Stopping threads took: 0.0001091 seconds

I haven't had much luck looking into whats happening on the elasticsearch side. I've just set {"logger":{"transport": "DEBUG"}} in the cluster settings which i think might turn up the transport logging, but i'm not 100% sure about that. Will have to wait a day or two and see if anything pops up. At a high level as far as i can tell elasticsearch is closing the connection to nginx. nginx was expecting a response, but elasticsearch doesn't seem to think closing it was any kind of error (based on the lack of logged error).

On further thought, i think transport logger is the wrong one. That looks to be inter-node transport. I'm not sure yet which logger would have information on REST connections. Will probably require more poking around to figure out where that is happening.

Increased logging levels to debug about 1.5 hours ago for:

  • org.elasticsearch.http
  • org.elasticsearch.transport
  • io.netty

Nothing particularly interesting there. A couple caught exception while handling client http traffic, closing connection messages, which is basically the same problem but in the other direction (nginx hung up on elasticsearch before it was expecting it). These messages don't seem to translate into user visible errors (at least, we don't log anything indicating there was a problem serving a request.)

I did a little tcpdump'ing of traffic and verified that nginx is not using keepalive when talking to elasticsearch. Nginx receives a request, opens a connection to elasticsearch, runs it, and closes the connection. This rules out anything to do with stale connections as our potential problems. I'm going to let the increased logging levels run awhile and see if anything correlates with the next termMatches error we log. These errors are relatively rare though, so perhaps its not worth digging in too deep.

Sounds like a plan — let it run for a bit with increased logging and see if we see more of these errors. Otherwise, we can go ahead and close this out.

Change 384633 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Check multi responses for http errors

https://gerrit.wikimedia.org/r/384633

Error is still pretty rare and not seeing anything in the new logging that will help. The raw error rate is slightly higher right now, at 160 errors in the last 24 hours, but it's really one error that was cached for 24 hours. We should stop caching that error once the above patch rolls out with wmf.5 this week.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM