Page MenuHomePhabricator

profiler.php sometimes emits RedisException "read error on connection" during request shutdown
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: W7QPvwrAIE0AAI9EhRgAAAAL

message
RedisException: read error on connection
stacktrace
#0 [internal function]: Redis->checkConnection()
#1 [internal function]: Redis->sockReadLine()
#2 [internal function]: Redis->sockReadData(NULL)
#3 [internal function]: Redis->processLongResponse()
#4 /srv/mediawiki/wmf-config/profiler.php(276): Redis->__call(string, array)
#5 [internal function]: Closure$#3()
#6 {main}

Notes

Seen 35 times in the past 30 days, since at least September 14 with 1.32.0-wmf.19.

Event Timeline

From the stacktrace it's unclear whether it's happening during pre-send shutdown or post-send shutdown. If it's pre-send shutdown, then this would probably fatal the webrequest in question.

General improvement

We should move it to post-shutdown and wrap in a try/catch so that we isolate any future potential for errors as well. (Profiling must never break the web request!)

Specific issue

As for the specific error found in this task, looks like it might have to do with the connection timeout being reached. Which seems fair, given we set it very low at 0.1s. It's quite acceptable to lose a few traces from the sampling profiler. Although it'd be good to confirm that this is indeed just caused by the network and/or redis service being a little slower to respond than usual.

Checking online led me to https://stackoverflow.com/a/18087513/319266 and https://github.com/phpredis/phpredis/issues/70 which further suggest three alternative explanations:

  • Redis service might be under load. I was under the impression we're pretty low on load, so if this is the case, it might be worth making sure the underlying system, network and Redis service are configured and setup correctly to handle the required load. Also with future growth in mind. If we're at capacity and dropping a few request per day, that seems worrying. Not because of what we lose today, but because it means we're at capacity.
  • The connect timeout might not be working. The thread suggests the timeout parameter is "relatively new" and might not be supported by our version of phpredis, in which case the 0.1s we're passing would be ineffective. We're not actually using phpredis, but HHVM's built-in port of it. So some differences here aren't unexpected. To be confirmed.
  • The phpredis client can sometimes persist connections without it being obvious, which could lead to the kind of error we found. We don't want to persist connections here, so to be confirmed that this is indeed not happening.

Given the start date of September 14, I'm assuming this is probably due to the switchover. We're logging to mwlog1001 which is in eqiad, so the distance is greater now and that would explain why connections would take longer to establish on average.

Based on the stacktrace, this isn't coming from $redis->connect(). Instead, it's coming from publish(). Specifically, it's from after the PUBLISH command is sent to the server, when the client is attempting to read the server's response. The profiling code doesn't actually use the response, though.

Right before reading the response from the Redis socket, HHVM's Redis\socketReadLine calls checkConnection() which checks stream_get_meta_data()['timed_out'], and then decides to throw based on that. (hhvm src)

I've thus eliminated possibility 2 (timeout not working) and 3 (persistent connections being used). This leads possibility 1 (Redis under load), and the initial premise if it simply being an actual timeout due to increased physical latency simply not making it possible to send all the data in under 100ms all the time.

Physical latency seems almost certain to be the issue. It's 36ms roundtrip from codfw to eqiad, so the tolerance is quite tight.

I'd suggest just wrapping in a try and ignoring a failure. Not really worth doing anything more than that, given that this has only occurred a couple of dozen times.

Yeah, agreed the current loss rate is quite acceptable. Priority is to not fatal user requests (a few dozen per week isn't much, but we also currently have a ton of distinct issues like that adding up), and to briefly confirm it isn't an early symptom of a deeper issue to worsen in the future (e.g. congestion, misconfiguration or overload).

Change 464178 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] profiler: Prevent Xenon flushes from being able to fatal a request

https://gerrit.wikimedia.org/r/464178

mwdebug2001 / graceful connect
$ret = $redis->connect( 'mwlog1001.eqiad.wmnet', 6379, 0.01 ); // unrealistic timeout, always fails
# bool(false)

// from hhvm/Redis::connect/fsockopen
# Warning: timed out after 0.01 seconds when connecting to mwlog1001.eqiad.wmnet [110]: Connection timed out

// from  hhvm/Redis::connect
# Warning: Failed connecting to redis server at mwlog1001.eqiad.wmnet: No such file or directory
mwdebug2001 / ungrateful publish
$redis->connect( 'mwlog1001.eqiad.wmnet', 6379, 0.035 ); // unrealistic timeout, fails about 50/50 
$redis->publish( 'testing-T206092', 'foo-300');
$redis->publish( 'testing-T206092', 'foo-301');
$redis->publish( 'testing-T206092', 'foo-302');
# exception RedisException: read error on connection
# #0 (): Redis->checkConnection()
# #1 (): Redis->sockReadLine()
# #2 (): Redis->sockReadData()
# #3 (): Redis->processLongResponse()
# #4 /srv/mediawiki/php-1.32.0-wmf.23/maintenance/eval.php(78): Redis->__call()
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-10-09T02:41:05Z] <krinkle@deploy1001> Synchronized wmf-config/profiler.php: T176916 / T206092 - Ie86e88777c48 (duration: 00m 56s)

Change 464178 merged by jenkins-bot:
[operations/mediawiki-config@master] profiler: Prevent flush from fataling a request

https://gerrit.wikimedia.org/r/464178

Mentioned in SAL (#wikimedia-operations) [2018-10-10T22:16:14Z] <krinkle@deploy1001> Synchronized wmf-config/arclamp.php: T206092 - If607ad111a (duration: 00m 48s)

Krinkle triaged this task as High priority.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM