Page MenuHomePhabricator

dbproxy1005 reports database failover
Closed, ResolvedPublic

Description

dbproxy1005 reports haproxy failover at 19:22:03 via Icinga.

The failover itself also shows up in the server logs:

-- Logs begin at Tue 2018-10-16 09:14:01 UTC, end at Wed 2018-10-24 21:46:44 UTC. --
Oct 24 19:21:54 dbproxy1005 haproxy[17979]: Server mariadb/db1073 is DOWN. 0 active and 1 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

However after checking the haproxy stats it seems the host is up and running:

banyek@dbproxy1005:~ $ echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,
mariadb,FRONTEND,,,0,0,5000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,0,0,0,,0,0,
mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN 2934/99999999,1,1,0,46,1,9006,9006,,1,2,1,,0,,2,0,,0,L7OK,0,0,,,,,,,,,,,0,0,,,,,-1,5.5.5-10.1.36-MariaDB,,0,0,0,0,,,,Layer7 check passed,,99999999,20,2934,,,,,,tcp,,,,,,,,
mariadb,db1117:3325,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,14,3,1745239,25,,1,2,2,,0,,2,0,,0,L7OK,0,0,,,,,,,,,,,0,0,,,,,-1,5.5.5-10.1.33-MariaDB,,0,0,0,0,,,,Layer7 check passed,,2,3,4,,,,,,tcp,,,,,,,,
mariadb,BACKEND,0,0,0,0,500,0,0,0,0,0,,0,0,0,0,UP,1,0,1,,0,1748685,0,,1,2,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,,,,,,,,,,,,,,tcp,,,,,,,,

Checking in db1073 I don't see anything in the logs, (journalctl -u mariadb doesn't have any events since oct 04) the server is up and running, answering questions, etc.
Also checking on show processlist I see this server is used - the connections are not coming from dbproxy1005

I think it would be safe to reload haproxy -as this seems a transient error - that would solve the alert, but I just ACK it now, and close reload haproxy in the morning as we were talking about this.

Event Timeline

Banyek triaged this task as Medium priority.Oct 24 2018, 9:59 PM

The host is not up and running, it says: db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN

You are lucky no one uses dbproxy1005 at the moment- otherwise it would have gone read-only.

Yep, that means it solved the question @Bstorm mentioned
@jcrespo I mean db1073 itself is up and running, but the service is down, that's why I mentioned this seems a transient error

Please reload the proxy and work with @Bstorm or whoever may help to identify next steps.

What do you think, can T207881 be related? I don't really think so, but the timestamps are correlating. (Which may not mean anything, but better to mention than not)

After reloading HAProxy it reports correctly:

mariadb,db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP

T207881 is mediawiki, db1072 is m5, nothing to do.

Sorry, I was not clear, I like your idea of possible there was a small network issue, and I was thinking if there could be a network issue which could affected the connection between mw hosts and db hosts? (I don't really think so, because if that applies, then we probably noticed it, but I wanted to bring this up b/c of the timings

Banyek claimed this task.

I was wondering what had happened....

This happened again, restarting proxy, as I don't see a clear connection with max_connections. Network instability?

-- Logs begin at Sat 2019-04-20 15:06:53 UTC, end at Thu 2019-05-02 16:07:12 UTC. --
May 02 14:53:39 dbproxy1005 haproxy[14940]: Backup Server mariadb/db1117:3325 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May 02 14:55:25 dbproxy1005 haproxy[14940]: Server mariadb/db1073 is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May 02 14:55:25 dbproxy1005 haproxy[14940]: proxy mariadb has no server available!
May 02 15:04:22 dbproxy1005 haproxy[14940]: Backup Server mariadb/db1117:3325 is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May 02 15:04:22 dbproxy1005 haproxy[14940]: proxy mariadb has no server available!

Both servers were detected as down, so likely a network/app level issue of the proxy, not the databases.