Page MenuHomePhabricator

clouddb* hosts with ipv6 access timeout from cumin
Closed, ResolvedPublic

Description

Looks like something has changed within clouddb* hosts and its ipv6 name resolution lately?
I cannot connect to them from our cumin1001 host as looks like it is resolving ipv6 first?

root@cumin1001:/home/marostegui# telnet clouddb1013.eqiad.wmnet 3311
Trying 2620:0:861:101:10:64:0:28...
^C
root@cumin1001:/home/marostegui# telnet 10.64.0.118 3311
Trying 10.64.0.118...
Connected to 10.64.0.118.
Escape character is '^]'.
Y
5.5.5-10.4.22-MariaDB��s(hVU*no�`uJP4Q|>_Dt?mysql_native_passwordConnection closed by foreign host.

Can we get this looked at?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Just to make it clear, after a timeout I would assume it reverts to the ipv4 resolve and it ends up working, but it takes minutes. So this needs to be looked at anyways as it is also affecting maintain-dbusers script (per @taavi's comment on IRC)

So this needs to be looked at anyways as it is also affecting maintain-dbusers script (per @taavi's comment on IRC)

maintain-dbusers is broken on labstore1004 by IPv6 being selected to talk to databases:

Nov 23 20:04:19 labstore1004 maintain-dbusers[19780]: pymysql.err.OperationalError: (1045, "Access denied for user 'labsdbadmin'@'2620:0:861:119:10:64:37:19' (using password: YES)")
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Main process exited, code=exited, status=1/FAILURE
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Unit entered failed state.
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Failed with result 'exit-code'.
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Service hold-off time over, scheduling restart.
Nov 23 20:04:19 labstore1004 systemd[1]: Stopped Maintain labsdb accounts.
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Start request repeated too quickly.
Nov 23 20:04:19 labstore1004 systemd[1]: Failed to start Maintain labsdb accounts.
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Unit entered failed state.
Nov 23 20:04:19 labstore1004 systemd[1]: maintain-dbusers.service: Failed with result 'exit-code'.
Marostegui raised the priority of this task from High to Unbreak Now!.Nov 24 2022, 6:40 AM

I think there are many things broken already that this deserves an UBN

I see some immediate actions:

I'm going to try the first one for now to make things work again, but probably the second one is the preferred long term solution

Thanks for the ideas:

a) Sounds good
b) That won't solve it, as it is not a problem on the IPs but on the grants, which is not something we can easily solve (T270101 and more concrete T270101#7631460)
c) Not sure if that's easy too

I would go for a and then evaluate again

Thanks!

Mentioned in SAL (#wikimedia-cloud) [2022-11-24T09:53:09Z] <dcaro> removed ip6 dns entry from nb for coluddb1013 (T323550)

Yep, we noticed the same issue with b xd, thanks to a comment in the hiera setting, going for a.

@Marostegui we should change all the clouddb* right?

Yes, all of them are having the same issue.

Mentioned in SAL (#wikimedia-cloud) [2022-11-24T10:16:19Z] <dcaro> removed ip6 dns name entry from nb for coluddb* (T323550)

Removed all the AAAA entries for clouddb* servers, mainain-dbusers now works as expected

dcaro lowered the priority of this task from Unbreak Now! to High.Nov 24 2022, 10:26 AM

With a workaround in place we can move to high again to continue the investigation.

dcaro changed the task status from Open to In Progress.Nov 24 2022, 10:27 AM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Summary of the actions taken:

  • Removed the DNS Name from netbox for all the ip6 addresses attached to clouddb* hosts
  • Ran the dns repo update cookbook dcaro@cumin1001:~$ sudo cookbook sre.dns.netbox -t T323550 "Removed AAAA entry for all clouddbs"
  • Cleaned caches on the DNS servers dcaro@cumin1001:~$ for i in 13 14 15 16 17 18 19 20 21; do sudo cumin 'A:dns-rec' 'rec_control wipe-cache clouddb10'$i'.eqiad.wmnet.'; done

The task where the records were changed is T312557

So from my side it is all fine.
We should also double check with @taavi and @bd808 as they also reported issues.