Page MenuHomePhabricator

confd broken on deployment-redis hosts
Closed, InvalidPublic

Description

This seems to be breaking sessions on beta wikis, so you can't login...

From the confd log:

Jun  6 23:30:17 deployment-redis05 confd[2337]: 2018-06-06T23:30:17Z deployment-redis05 /usr/bin/confd[2337]: INFO SRV domain set to -scheme
Jun  6 23:30:17 deployment-redis05 confd[2337]: 2018-06-06T23:30:17Z deployment-redis05 /usr/bin/confd[2337]: FATAL Cannot get nodes from SRV records lookup _etcd._tcp.-scheme: invalid domain name

and

root@deployment-redis05:/etc/confd# service confd status
● confd.service - confd
   Loaded: loaded (/lib/systemd/system/confd.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: exit-code) since Wed 2018-06-06 23:44:48 UTC; 8s ago
  Process: 3872 ExecStart=/usr/bin/confd -backend $CONFD_BACKEND $CONFD_DISCOVERY $CONFD_OPTS (code=exited, status=1/FAILURE)
 Main PID: 3872 (code=exited, status=1/FAILURE)

Jun 06 23:44:48 deployment-redis05 systemd[1]: confd.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 23:44:48 deployment-redis05 systemd[1]: confd.service: Unit entered failed state.
Jun 06 23:44:48 deployment-redis05 systemd[1]: confd.service: Failed with result 'exit-code'.

Event Timeline

Reedy triaged this task as High priority.Jun 6 2018, 11:54 PM
Reedy updated the task description. (Show Details)

Brandon reckons it's something to do with confd::srv_dns not being set correctly on beta

I'm guessing the config got broken a little while ago, and because it didn't trigger a reload of confd... it's only just been applied/loaded when the hosts have been rebooted as part of the cloud host reboots...

This is not really an issue and redis is correctly working on these servers:

deployment-redis05:~$ systemctl status redis-instance-tcp_6379.service 
● redis-instance-tcp_6379.service - Advanced key-value store
   Loaded: loaded (/lib/systemd/system/redis-instance-tcp_6379.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2018-06-06 23:10:18 UTC; 14h ago
 Main PID: 438 (redis-server)
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/redis-instance-tcp_6379.service
           └─438 /usr/bin/redis-server 0.0.0.0:6379

and the fact that confd is broken doesn't affect the functionality of the redis server itself.

Brandon reckons it's something to do with confd::srv_dns not being set correctly on beta

That's weird, I thought that used to be set

More session issues: T172560
See also T173646

I found that the nutcracker sockets on some of mediawiki hosts were refusing connections when I tried to run redis-cli -a $password_here -s /var/run/nutcracker/redis_eqiad.sock get enwiki:captcha:1331838370, restarting nutcracker fixed that on deployment-mediawiki06.deployment-prep.eqiad.wmflabs,deployment-mira.deployment-prep.eqiad.wmflabs,deployment-tin.deployment-prep.eqiad.wmflabs
deployment-jobrunner03.deployment-prep.eqiad.wmflabs,deployment-mediawiki-[07,09].deployment-prep.eqiad.wmflabs,deployment-snapshot01.deployment-prep.eqiad.wmflabs saying "Error: Server closed the connection", nutcracker logs show nc_redis.c:1092 parsed unsupported command 'COMMAND' - strange because the nutcracker versions are the same across the working and non-working hosts. I do notice the working hosts are on jessie and the ones struggling with this are stretch

It looks like uninstalling the default redis-tools version I had put on those hosts to test this (3:3.2.6-1 that came from http://deb.debian.org/debian) and installing 5:4.0.9-2~bpo9+1 (that came from http://mirrors.wikimedia.org/debian) may have fixed the problem with redis-cli not being able to talk to nutcracker
I just checked and I can successfully sign into beta