Redis latency monitor config change broke Redis in Beta Cluster
Closed, ResolvedPublic

Description

Error on betacluster resetting my Preferences. Tried twice, same results.

[b376ad5b] /wiki/Special:Preferences/reset JobQueueConnectionError from line 753 of /srv/mediawiki/php-master/includes/jobqueue/JobQueueRedis.php: Unable to connect to redis server deployment-redis01.eqiad.wmflabs.

Backtrace:

#0 /srv/mediawiki/php-master/includes/jobqueue/JobQueueRedis.php(210): JobQueueRedis->getConnection()
#1 /srv/mediawiki/php-master/includes/jobqueue/JobQueue.php(324): JobQueueRedis->doBatchPush(array, integer)
#2 /srv/mediawiki/php-master/includes/jobqueue/JobQueue.php(296): JobQueue->batchPush(array, integer)
#3 /srv/mediawiki/php-master/includes/jobqueue/JobQueueGroup.php(129): JobQueue->push(array)
#4 /srv/mediawiki/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php(95): JobQueueGroup->push(array)
#5 /srv/mediawiki/php-master/includes/Hooks.php(195): BetaFeaturesHooks::updateUserCounts(User, array)
#6 /srv/mediawiki/php-master/includes/user/User.php(5095): Hooks::run(string, array)
#7 /srv/mediawiki/php-master/includes/user/User.php(3671): User->saveOptions()
#8 /srv/mediawiki/php-master/includes/specials/SpecialPreferences.php(141): User->saveSettings()
#9 /srv/mediawiki/php-master/includes/htmlform/HTMLForm.php(607): SpecialPreferences->submitReset(array)
#10 /srv/mediawiki/php-master/includes/htmlform/HTMLForm.php(504): HTMLForm->trySubmit()
#11 /srv/mediawiki/php-master/includes/htmlform/HTMLForm.php(519): HTMLForm->tryAuthorizedSubmit()
#12 /srv/mediawiki/php-master/includes/specials/SpecialPreferences.php(131): HTMLForm->show()
#13 /srv/mediawiki/php-master/includes/specials/SpecialPreferences.php(48): SpecialPreferences->showResetForm()
#14 /srv/mediawiki/php-master/includes/specialpage/SpecialPage.php(384): SpecialPreferences->execute(string)
#15 /srv/mediawiki/php-master/includes/specialpage/SpecialPageFactory.php(564): SpecialPage->run(string)
#16 /srv/mediawiki/php-master/includes/MediaWiki.php(280): SpecialPageFactory::executePath(Title, RequestContext)
#17 /srv/mediawiki/php-master/includes/MediaWiki.php(745): MediaWiki->performRequest()
#18 /srv/mediawiki/php-master/includes/MediaWiki.php(517): MediaWiki->main()
#19 /srv/mediawiki/php-master/index.php(43): MediaWiki->run()
#20 /srv/mediawiki/w/index.php(3): include(string)
#21 {main}
Quiddity created this task.Jan 25 2016, 5:58 PM
Quiddity added a subscriber: Quiddity.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 25 2016, 5:58 PM
twentyafterfour@deployment-redis01:~$ sudo /etc/init.d/redis-server start
Starting redis-server: 
*** FATAL CONFIG FILE ERROR ***
Reading the configuration file, at line 549
>>> 'latency-monitor-threshold 100'
Bad directive or wrong number of arguments
failed
twentyafterfour@deployment-redis01:~$ redis-server --version
Redis server v=2.8.4 sha=00000000:0 malloc=jemalloc-3.5.1 bits=64 build=ef3fa00fad205658

Latency monitoring requires version 2.8.13

mmodell assigned this task to ori.Jan 25 2016, 7:02 PM
mmodell added a subscriber: ori.

I commented the line from the config file and started redis, I'm going to leave it to ori to decide what to do about a permanent solution.

@ori: do you know why deployment-redis01 would have an older version of redis-server? I tried to update the package but there wasn't a newer one available.

The puppet code that caused this to break was inside of a conditional that tests for debian jessie. Is that incorrect?

greg changed the title from "Error on betacluster resetting my Preferences - JobQueueConnectionError from JobQueueRedis.php" to "Redis latency monitor config change broke Redis in Beta Cluster".
greg set Security to None.
greg triaged this task as "High" priority.
greg added subscribers: greg, Luke081515, MGChecker and 2 others.

Pri: Higher than normal, so High :)

@greg No, it's the same error as before. Furthermore, Special:RecentChanges doesn't get updated.

To help to prioritize, we have received on the GLAM mailing list a report the error also occurs on Commons Beta using the GLAM tools:

[f79958ce] /wiki/Special:GWToolset JobQueueConnectionError from line 753 of /srv/mediawiki/php-master/includes/jobqueue/JobQueueRedis.php: Unable to connect to redis server deployment-redis01.eqiad.wmflabs.

twentyafterfour@deployment-redis01:~$ redis-server --version
Redis server v=2.8.4 sha=00000000:0 malloc=jemalloc-3.5.1 bits=64 build=ef3fa00fad205658

Latency monitoring requires version 2.8.13

deployment-redis01 is a Trusty instance. And have redis 2.8.4-2+wmf1.

I guess all the production servers have been migrated to Jessie which comes with 3.0.6-2~bpo8+1.

So the proper fix would be to migrate all the beta cluster redis server from Trusty to Jessie.

The status of the latency-monitor-threshold setting:

InstanceDistribution Status
deployment-redis01TrustCommented out previously, puppet does not uncomment it
deployment-sentry01JessieEnabled (expected)
deployment-sentry2TrustyManually commented, puppet happy
Joe added a subscriber: Joe.Jan 27 2016, 9:05 AM

So, the issue was:

  1. Probably a redis puppet change was first merged on the beta puppetmaster without a conditional guard on the latency threshold config, which is now guarded to be applied only on jessie. this applied that setting to a file (redis.conf) we don't manage otherwise.
  2. When the os guard was introduced, see https://github.com/wikimedia/operations-puppet/blob/production/modules/redis/manifests/init.pp#L36-L42, this was not applied again, but in fact the file_line was undeclared in puppet and so was unmanaged
  3. Thus, puppet failed to remove it, making it impossible to start redis
  4. Commenting that line out was the thing to do, once @mmodell did that, the redis instances all correctly recovered.

This is a typical case of a transient error in a puppet config that needed to be manually corrected.

hashar closed this task as "Resolved".Jan 27 2016, 9:25 AM

So it is all fixed now. Thank you @Joe for the helpful summary !

On deployment-redis01 I have manually ran /etc/init.d/redis-server start it started listening on 6379

Then when running puppet:

Notice: /Stage[main]/Redis/Service[redis-server]/ensure: ensure changed 'running' to 'stopped'

Seems we do not want to use the redis-server service.


We have different services defined:

redis-instance-tcp_6379 start/running
redis-instance-tcp_6378 start/running
redis-instance-tcp_6380 start/running
redis-instance-tcp_6381 start/running

But there are no matching process... So I mass restarted them:

# initctl list|grep redis-instance|cut -d\  -f1|xargs -n1 initctl restart
redis-instance-tcp_6379 start/running, process 7157
redis-instance-tcp_6378 start/running, process 7163
redis-instance-tcp_6380 start/running, process 7169
redis-instance-tcp_6381 start/running, process 7175

And they are around now:

# netstat -tlnp|grep redis-server
tcp        0      0 0.0.0.0:6378            0.0.0.0:*               LISTEN      7163/redis-server 0
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      7157/redis-server 0
tcp        0      0 0.0.0.0:6380            0.0.0.0:*               LISTEN      7169/redis-server 0
tcp        0      0 0.0.0.0:6381            0.0.0.0:*               LISTEN      7175/redis-server 0
# ps -A f|grep redis-ser
 7202 pts/3    S+     0:00  |                       \_ grep --color=auto redis-ser
 7157 ?        Ssl    0:10 /usr/bin/redis-server 0.0.0.0:6379            
 7163 ?        Ssl    0:00 /usr/bin/redis-server 0.0.0.0:6378            
 7169 ?        Ssl    0:00 /usr/bin/redis-server 0.0.0.0:6380            
 7175 ?        Ssl    0:00 /usr/bin/redis-server 0.0.0.0:6381

I then killed them all and upstart managed to restart them automatically.

So I am assuming it is now properly fixed.