Page MenuHomePhabricator

Can't connect to analytics replicas from Toolforge
Closed, ResolvedPublic

Description

See https://replag.toolforge.org: all of the analytics replicas are down, while the web replicas are doing fine. Connecting via Toolforge gives the following error:

tools.vahurzpubot@tools-sgebastion-10:~$ mysql --defaults-file=replica.my.cnf --host enwiki.analytics.db.svc.wikimedia.cloud
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

Event Timeline

FWIW, databases are not down, WMCS hosts can't connect to them. From DBA point of view, everything is normal

Vahurzpu renamed this task from Analytics wiki replicas are down to Can't connect to analytics replicas from .Jun 5 2023, 6:02 PM
Vahurzpu renamed this task from Can't connect to analytics replicas from to Can't connect to analytics replicas from Toolforge.

Sorry about this. I have attempted to reverse the operation that I did to depool the wikireplicas. It was supposed to fail back to the web wikireplicas without any downtime.

Now I have put back the two dbproxy101[8-9] servers as they were.

btullis@puppetmaster1001:~$ sudo -i confctl select "service=wikireplicas-a" get
{"dbproxy1018.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"}
{"dbproxy1019.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"}
btullis@puppetmaster1001:~$ sudo -i confctl select "service=wikireplicas-b" get
{"dbproxy1019.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1018.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

But I am still getting connectivity problems from toolforge.

btullis@tools-sgebastion-10:~$ sql enwiki_p
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

Mentioned in SAL (#wikimedia-analytics) [2023-06-05T18:20:02Z] <btullis> restarted haproxy service on dbproxy1018 for T338172

Restarting haproxy on dbproxy1018 didn't help either.
Perhaps it is something to do with this layer: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#VM_proxies

I'm not surrently in the clouddb-services project, so I can't ssh clouddb-wikireplicas-proxy-1.clouddb-services.eqiad1.wikimedia.cloud to have a look.

I'd be grateful for any assistance to work out what has happened.

I added you to that project.

It seems like the issue is with 208.80.154.242 (wikireplicas-a.wikimedia.org):

taavi@clouddb-wikireplicas-proxy-2:~$ nc wikireplicas-a.wikimedia.org 3317
:# empty
taavi@clouddb-wikireplicas-proxy-2:~$ nc wikireplicas-b.wikimedia.org 3317
Y
5.5.5-10.4.22-MariaDB)a;BvM^o5?YB{gq$[!0kN#mysql_native_password
^C

This seems like an LVS issue. I see this in Icinga on lvs1018:

CRITICAL: Services known to PyBal but not to IPVS: set(['208.80.154.242:3316', '208.80.154.242:3317', '208.80.154.242:3314', '208.80.154.242:3315', '208.80.154.242:3312', '208.80.154.242:3313', '208.80.154.242:3311', '208.80.154.242:3318'])

If it helps, my bot was abrutly disconnected at 11:55:39.748 UTC, I believe this is the time when the replicas went down.

bd808 triaged this task as Unbreak Now! priority.

Things are working for me at the moment when querying from tools-sgebastion-11.tools.eqiad1.wikimedia.cloud (dev.toolforge.org), It looks like https://replag.toolforge.org is happy again too. It would be querying from a Kubernetes exec node (currently tools-k8s-worker-51).

Things are working for me at the moment when querying from tools-sgebastion-11.tools.eqiad1.wikimedia.cloud (dev.toolforge.org), It looks like https://replag.toolforge.org is happy again too. It would be querying from a Kubernetes exec node (currently tools-k8s-worker-51).

Thanks @bd808 - yes, I confirm that it's also working for me.

I can now verify that LVS is happy, using the same technique as @taavi from the clouddb-wikireplicas-proxy servers.

btullis@clouddb-wikireplicas-proxy-1:~$ nc -vz wikireplicas-a.wikimedia.org 3317
Connection to wikireplicas-a.wikimedia.org 3317 port [tcp/*] succeeded!

Let's call this {{Done}}