Page MenuHomePhabricator

galera/haproxy service flaps in eqiad1
Closed, ResolvedPublic

Description

The haproxy logs on cloudcontrol100x nodes are somewhat full of messages like this:

Jun 14 19:23:38 cloudcontrol1004 haproxy[370057]: [WARNING] 164/192338 (370057) : Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 56ms. 0 active and 2 backup servers left. Running on backup. 851 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:23:43 cloudcontrol1004 haproxy[370057]: Backup Server mysql/cloudcontrol1003.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 41ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:23:43 cloudcontrol1004 haproxy[370057]: [WARNING] 164/192343 (370057) : Backup Server mysql/cloudcontrol1003.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 41ms. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 14 19:23:43 cloudcontrol1004 haproxy[370057]: Server mysql/cloudcontrol1005.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 52ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 14 19:23:43 cloudcontrol1004 haproxy[370057]: [WARNING] 164/192343 (370057) : Server mysql/cloudcontrol1005.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 52ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 14 19:23:45 cloudcontrol1004 haproxy[370057]: Backup Server mysql/cloudcontrol1003.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 38ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.

This is concerning for a few reasons. It might reflect actual galera/mysql outages, or it might reflect haproxy being overly sensitive and causing service interruptions where there should be none.

I don't see any similar behavior in codfw1dev.

Event Timeline

haproxy doesn't monitor mysql access directly; rather it uses a purpose-built rest server which checks the state of Galera synchronization and responds with a 200 if all is well. The haproxy line that defines this check looks like this:

server cloudcontrol1004.wikimedia.org 208.80.154.132:23306 check port 9990 inter 2s rise 2 fall 4 backup

That port 9990 is the health check service. There's one running on each cloudcontrol, and the service itself is puppet/modules/profile/files/openstack/base/galera/nodecheck.sh

I ran a few tests and it looks like if the nodecheck run exits by surprise (that is, without returning a specific http return code) that gets logged in haproxy as 'Connection reset by peer'. So the next question is... what is happening to cause that test code to exit by surprise?

In theory it logs to /var/log/nodecheck/err.log but that logfile is uniformly empty. I tried to fix that issue with https://gerrit.wikimedia.org/r/c/operations/puppet/+/805475 but that doesn't seem to have made a difference.

I still don't have a theory of why the tests are failing, but this patch makes HAProxy care a bit less about the test failures:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/805478

That seems to have reduced the number of flaps considerably, but not eliminated them. And, in any case, there's still a deeper mystery left to be solved.

haproxy doesn't monitor mysql access directly; rather it uses a purpose-built rest server which checks the state of Galera synchronization and responds with a 200 if all is well. The haproxy line that defines this check looks like this:

server cloudcontrol1004.wikimedia.org 208.80.154.132:23306 check port 9990 inter 2s rise 2 fall 4 backup

That port 9990 is the health check service. There's one running on each cloudcontrol, and the service itself is puppet/modules/profile/files/openstack/base/galera/nodecheck.sh

Expanding on this, it is handled actually by a systemd socket unit:

root@cloudcontrol1005:~# systemctl status nodecheck.socket
● nodecheck.socket - Socket to check galera node state
     Loaded: loaded (/lib/systemd/system/nodecheck.socket; enabled; vendor preset: enabled)
     Active: active (listening) since Tue 2022-05-10 19:42:29 UTC; 1 months 5 days ago
     Listen: [::]:9990 (Stream)
   Accepted: 8652915; Connected: 0;
      Tasks: 0 (limit: 154197)
     Memory: 8.0K
     CGroup: /system.slice/nodecheck.socket

Warning: journal has been rotated since unit was started, output may be incomplete.

And it logs some stuff on journal:

root@cloudcontrol1005:~# journalctl -n 500 | grep nodecheck
...
Jun 15 06:39:32 cloudcontrol1005 systemd[1]: nodecheck@8652974-208.80.154.85:9990-208.80.154.85:42992.service: Succeeded.
Jun 15 06:39:33 cloudcontrol1005 systemd[1]: nodecheck@8652975-208.80.154.85:9990-208.80.154.132:35656.service: Succeeded.
Jun 15 06:39:34 cloudcontrol1005 systemd[1]: nodecheck@8652976-208.80.154.85:9990-208.80.154.85:43098.service: Succeeded.
Jun 15 06:39:34 cloudcontrol1005 systemd[1]: nodecheck@8652977-208.80.154.85:9990-208.80.154.23:36122.service: Succeeded.
Jun 15 06:39:35 cloudcontrol1005 systemd[1]: nodecheck@8652978-208.80.154.85:9990-208.80.154.132:35836.service: Succeeded.

Running this manually to try to catch one of the glimpses:

root@cloudcontrol1005:~# while true; do datestamp=$(date +%H%M%S_%d%m%y); outfile="galeralogs/$datestamp"; bash -x nodecheck.sh 2>$outfile.err | tee $outfile | grep -q '200 OK' && { rm $outfile; rm $outfile.err; } || echo "Got one at $datestamp"; sleep 1; date; done

Crossing fingers xd

I have started also a process to curl the systemd socket side too at the same time, see which step the issue is haproxy -> systemd -> script:

counter=0; mkdir -p galeralogs_curl; while true; do datestamp=$(date +%H%M%S_%d%m%y); outfile="galeralogs_curl/$datestamp"; curl -v http://127.0.0.1:9990 2>&1 | tee $outfile | grep -q '200 OK' && { rm $outfile; } || { echo "Got one at $datestamp"; counter=$((counter+1)); }; sleep 1; echo "$(date) - got $counter"; done

It does seem though as if cloudcontrol1005 loses connectivity to the other two, for example at 03:05:

### 1005
Jun 15 03:05:31 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/030531 (3110491) : Backup Server mysql/cloudcontrol1004.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 48ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 03:05:33 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/030533 (3110491) : Backup Server mysql/cloudcontrol1003.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 43ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 03:05:44 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/030544 (3110491) : Backup Server mysql/cloudcontrol1004.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 44ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
Jun 15 03:05:47 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/030547 (3110491) : Backup Server mysql/cloudcontrol1003.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 52ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.

### 1003
Jun 15 03:05:24 cloudcontrol1004 haproxy[2102169]: [WARNING] 165/030524 (2102169) : Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 63ms. 0 active and 2 backup servers left. Running on backup. 856 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 03:05:44 cloudcontrol1004 haproxy[2102169]: [WARNING] 165/030544 (2102169) : Server mysql/cloudcontrol1005.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 58ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.


### 1004
Jun 15 03:05:24 cloudcontrol1004 haproxy[2102169]: [WARNING] 165/030524 (2102169) : Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 63ms. 0 active and 2 backup servers left. Running on backup. 856 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 03:05:44 cloudcontrol1004 haproxy[2102169]: [WARNING] 165/030544 (2102169) : Server mysql/cloudcontrol1005.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 58ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.

So I suspect that the issue might not be the galera instances or the check script, but the network itself (as the connectivity is severed, though they still think themselves are up, and 1003<->1004 seem to be ok between them too, so it's different views of the world).

Hmm...
Trying locally with the haproxy, I'm unable to get the same error message (socket error), when the script returns
nothing it seems instead to just time out:

shell1: docker run --rm --publish 3306 --volume $PWD/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg --name my-running-haproxy  --sysctl net.ipv4.ip_unprivileged_port_start=0 haproxy:2.2.9

...

# this does 10 successful checks, and then starts replying empty, so the server goes up and down
shell2: for i in $(seq 10); do echo -en "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nContent-Length: 23\r\n\r\nGalera node is ready.\r\n\r\n" | nc -l 9991 ; done; while true; do echo -en "" | nc -l 9991 ; done;

...

logs:
[WARNING] 165/094916 (8) : Backup Server mysql/test2 is UP, reason: Layer7 check passed, code: 200, check duration: 0ms. 0 active and 1 backup servers online. Running on backup. 0 sessions requeued, 0 total in queue.
[WARNING] 165/094932 (8) : Backup Server mysql/test2 is DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 165/094932 (8) : proxy 'mysql' has no server available!
# chunked from cloudcontrol1005
> cat haproxy.cfg
global
    #chroot /var/lib/haproxy
    daemon
    group haproxy
    log /dev/log local0 notice
    #stats socket /run/haproxy/haproxy.sock mode 666 expose-fd listeners level user
    user haproxy


defaults
    log global
    maxconn 4096
    mode http
    option dontlognull
    option forceclose
    option forwardfor
    option httplog
    option redispatch
    retries 3
    timeout client 50s
    timeout connect 10s
    timeout http-request 5s
    timeout server 120s

listen mysql
    bind *:3306
    mode tcp
    option tcpka
    option httpchk
    server test1 192.168.1.100:12345 check port 9990 inter 1s rise 2 fall 4 backup
    server test2 192.168.1.100:12346 check port 9991 inter 1s rise 2 fall 4 backup
    timeout  client 90m
    timeout  server 90m

IIRC When I hacked it to exit(1) every time with no echos I got 'connection reset' on the haproxy side.

So a flap happened and I was monitoring a bunch of stuff.
The first flap from 1003:

Jun 15 11:25:01 cloudcontrol1003 haproxy[3435489]: [WARNING] 165/112501 (3435489) : Server mysql/cloudcontrol1005.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 55ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 11:25:05 cloudcontrol1003 haproxy[3435489]: [WARNING] 165/112505 (3435489) : Server mysql/cloudcontrol1005.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 47ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.

The second flap on 1005:

Jun 15 12:26:32 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/122632 (3110491) : Backup Server mysql/cloudcontrol1004.wikimedia.org is DOWN, reason: Socket error, info: "Connection reset by peer", check duration: 42ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 15 12:26:37 cloudcontrol1005 haproxy[3110491]: [WARNING] 165/122637 (3110491) : Backup Server mysql/cloudcontrol1004.wikimedia.org is UP, reason: Layer7 check passed, code: 200, check duration: 44ms. 1 active and 2 backup servers online. 0 sessions requeued, 0 total in queue.
  • The execution of the checknode.sh script on each node did not fail once
  • The curl to the checknode systemd socket on localhost did not fail once
  • There were 2 pings lost when pinging from 1004 -> 1005

For the second flap, on 1004 I can see all the checks arriving and succeeding (not one missed, 208.80.154.85 is cloudcontrol1005):

Jun 15 12:22:01 cloudcontrol1004 systemd[1]: nodecheck@8623076-208.80.154.132:9990-208.80.154.85:59606.service: Succeeded.
Jun 15 12:22:03 cloudcontrol1004 systemd[1]: nodecheck@8623081-208.80.154.132:9990-208.80.154.85:59680.service: Succeeded.
Jun 15 12:22:05 cloudcontrol1004 systemd[1]: nodecheck@8623087-208.80.154.132:9990-208.80.154.85:59752.service: Succeeded.
Jun 15 12:22:07 cloudcontrol1004 systemd[1]: nodecheck@8623092-208.80.154.132:9990-208.80.154.85:59800.service: Succeeded.
Jun 15 12:22:09 cloudcontrol1004 systemd[1]: nodecheck@8623096-208.80.154.132:9990-208.80.154.85:59850.service: Succeeded.
Jun 15 12:22:11 cloudcontrol1004 systemd[1]: nodecheck@8623101-208.80.154.132:9990-208.80.154.85:59922.service: Succeeded.
Jun 15 12:22:13 cloudcontrol1004 systemd[1]: nodecheck@8623107-208.80.154.132:9990-208.80.154.85:59992.service: Succeeded.
Jun 15 12:22:16 cloudcontrol1004 systemd[1]: nodecheck@8623109-208.80.154.132:9990-208.80.154.85:60048.service: Succeeded.
Jun 15 12:22:18 cloudcontrol1004 systemd[1]: nodecheck@8623115-208.80.154.132:9990-208.80.154.85:60124.service: Succeeded.
Jun 15 12:22:20 cloudcontrol1004 systemd[1]: nodecheck@8623119-208.80.154.132:9990-208.80.154.85:60184.service: Succeeded.
Jun 15 12:22:22 cloudcontrol1004 systemd[1]: nodecheck@8623125-208.80.154.132:9990-208.80.154.85:60248.service: Succeeded.
Jun 15 12:22:24 cloudcontrol1004 systemd[1]: nodecheck@8623129-208.80.154.132:9990-208.80.154.85:60316.service: Succeeded.
Jun 15 12:22:26 cloudcontrol1004 systemd[1]: nodecheck@8623134-208.80.154.132:9990-208.80.154.85:60384.service: Succeeded.
Jun 15 12:22:28 cloudcontrol1004 systemd[1]: nodecheck@8623138-208.80.154.132:9990-208.80.154.85:60466.service: Succeeded.
Jun 15 12:22:30 cloudcontrol1004 systemd[1]: nodecheck@8623144-208.80.154.132:9990-208.80.154.85:60536.service: Succeeded.
Jun 15 12:22:32 cloudcontrol1004 systemd[1]: nodecheck@8623149-208.80.154.132:9990-208.80.154.85:60610.service: Succeeded.
Jun 15 12:22:34 cloudcontrol1004 systemd[1]: nodecheck@8623154-208.80.154.132:9990-208.80.154.85:60678.service: Succeeded.
Jun 15 12:22:36 cloudcontrol1004 systemd[1]: nodecheck@8623160-208.80.154.132:9990-208.80.154.85:60780.service: Succeeded.
Jun 15 12:22:38 cloudcontrol1004 systemd[1]: nodecheck@8623163-208.80.154.132:9990-208.80.154.85:60860.service: Succeeded.
Jun 15 12:22:40 cloudcontrol1004 systemd[1]: nodecheck@8623168-208.80.154.132:9990-208.80.154.85:60916.service: Succeeded.
Jun 15 12:22:42 cloudcontrol1004 systemd[1]: nodecheck@8623174-208.80.154.132:9990-208.80.154.85:32772.service: Succeeded.
Jun 15 12:22:44 cloudcontrol1004 systemd[1]: nodecheck@8623178-208.80.154.132:9990-208.80.154.85:32866.service: Succeeded.
Jun 15 12:22:46 cloudcontrol1004 systemd[1]: nodecheck@8623183-208.80.154.132:9990-208.80.154.85:32938.service: Succeeded.
Jun 15 12:22:48 cloudcontrol1004 systemd[1]: nodecheck@8623189-208.80.154.132:9990-208.80.154.85:32990.service: Succeeded.
Jun 15 12:22:50 cloudcontrol1004 systemd[1]: nodecheck@8623194-208.80.154.132:9990-208.80.154.85:33064.service: Succeeded.
Jun 15 12:22:53 cloudcontrol1004 systemd[1]: nodecheck@8623199-208.80.154.132:9990-208.80.154.85:33128.service: Succeeded.
Jun 15 12:22:55 cloudcontrol1004 systemd[1]: nodecheck@8623204-208.80.154.132:9990-208.80.154.85:33192.service: Succeeded.
Jun 15 12:22:57 cloudcontrol1004 systemd[1]: nodecheck@8623208-208.80.154.132:9990-208.80.154.85:33268.service: Succeeded.
Jun 15 12:22:59 cloudcontrol1004 systemd[1]: nodecheck@8623213-208.80.154.132:9990-208.80.154.85:33360.service: Succeeded.

So I think that the issue is not the check or checking themselves, but something is making the connection break, possibly not the check itself.

I'm trying to do a tcpdump capture of the event to see at the network level what's happening, for that I got the script:

1#!/bin/bash
2
3# dcaro trying to catch the haproxy flap
4
5old_logs=$(journalctl -u haproxy -n1 | tail -n1)
6new_logs=$old_logs
7count=3
8gotten=0
9
10tcpdump \
11 -Z root \
12 -i eno1 \
13 -w /tmp/galera_testing.pcap \
14 -G 120 \
15 host cloudcontrol1003.wikimedia.org or host cloudcontrol1004.wikimedia.org \
16 &
17
18while [[ $gotten -lt $count ]]; do
19 while [[ "$old_logs" == "$new_logs" ]]; do
20 sleep 1
21 new_logs=$(journalctl -u haproxy -n1 | tail -n1)
22 done
23 echo "Got one at $(date)"
24 cp /tmp/galera_testing.pcap /tmp/galera_testing.pcap.$(date +%H%M%S_%d%m%y)
25 gotten=$((gotten+1))
26done
27
28echo "stopping the tcpdump process, we got a hit"
29pkill tcpdump

running on all three cloudcontrols, it will collect a pcap files for any event that happens of the (up to) last 5min,
for three events maximum.

:crossingfingers:

ps. I was unable to reproduce the error locally, even trying to kill processes and such did not get the "Connection
reset by peer" message.

#

I change the exit code on 1003, it did not even fail:

root@cloudcontrol1003:~# nodecheck.sh
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 23

Galera node is ready.

root@cloudcontrol1003:~# echo $?
1

root@cloudcontrol1003:~# curl http://127.0.0.1:9990/
Galera node is ready.

root@cloudcontrol1003:~# echo $?
0

root@cloudcontrol1003:~# systemctl status nodecheck.socket
● nodecheck.socket - Socket to check galera node state
     Loaded: loaded (/lib/systemd/system/nodecheck.socket; enabled; vendor preset: enabled)
     Active: active (listening) since Tue 2022-05-10 20:00:11 UTC; 1 months 5 days ago
     Listen: [::]:9990 (Stream)
   Accepted: 8677861; Connected: 0;
      Tasks: 0 (limit: 77086)
     Memory: 12.0K
     CGroup: /system.slice/nodecheck.socket

Warning: journal has been rotated since unit was started, output may be incomplete.

No flapping :/

The difference is whether or not the test script echos an http return code. If it exists without a return code we get the 'connection reset by peer' failure in haproxy.

On the other side: if the test ALWAYS returns an http return, haproxy registers the service as up even if mariadb is stopped.

Change 805867 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] galera nodecheck.sh: noisy logging

https://gerrit.wikimedia.org/r/805867

Change 805867 merged by Andrew Bogott:

[operations/puppet@production] galera nodecheck.sh: noisy logging

https://gerrit.wikimedia.org/r/805867

Change 805872 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] galera nodecheck: cat response content in one go

https://gerrit.wikimedia.org/r/805872

Change 805872 merged by Andrew Bogott:

[operations/puppet@production] galera nodecheck: cat response content in one go

https://gerrit.wikimedia.org/r/805872

We have improved logging and now can (maybe) see this issue happening in /var/log/nodecheck/err.log:

cat: write error: Connection reset by peer

There were some more during the night, none caused a flap though, and the novafullstack instances kep getting stuck xd

Change 806437 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Replace the bash/socket-based galera healthcheck with a python flask app

https://gerrit.wikimedia.org/r/806437

Change 806437 merged by Andrew Bogott:

[operations/puppet@production] Replace the bash/socket-based galera healthcheck with a python flask app

https://gerrit.wikimedia.org/r/806437

Change 806449 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Galera/haproxy: remove old bash healthcheck script, replace with flask

https://gerrit.wikimedia.org/r/806449

Change 806448 had a related patch set uploaded (by Andrew Bogott; author: Majavah):

[operations/puppet@production] P:openstack::haproxy: fix galera health check

https://gerrit.wikimedia.org/r/806448

Change 806450 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::openstack::base::galera::node: remove old absented files

https://gerrit.wikimedia.org/r/806450

Change 806448 merged by Andrew Bogott:

[operations/puppet@production] P:openstack::haproxy: fix galera health check

https://gerrit.wikimedia.org/r/806448

Change 806449 merged by Andrew Bogott:

[operations/puppet@production] Galera/haproxy: remove old bash healthcheck script, replace with flask

https://gerrit.wikimedia.org/r/806449

Mentioned in SAL (#wikimedia-cloud) [2022-06-17T17:53:52Z] <andrewbogott> switching to a new python-based health check for galera and haproxy. This may make things more stable, or it may not. T310664

Change 806457 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] remove nodecheck.sh. It was replaced with nodecheck.py

https://gerrit.wikimedia.org/r/806457

This is fixed now, been stable for some days, closing.

Change 806450 merged by Andrew Bogott:

[operations/puppet@production] profile::openstack::base::galera::node: remove old absented files

https://gerrit.wikimedia.org/r/806450

Change 806457 merged by Andrew Bogott:

[operations/puppet@production] remove nodecheck.sh. It was replaced with nodecheck.py

https://gerrit.wikimedia.org/r/806457