Page MenuHomePhabricator

HAproxy on dbproxy hosts lack enough logging
Closed, ResolvedPublic

Description

Apparently, haproxy is lacking proper logging on dbproxy roles (or at least, dbproxy1002), but I remember reading logs of why failovers happen and when. It would be nice to have those on journalctl for easy debugging when connection or other failures happen.

This is motivated by a recent dbproxy1002 losing connectivity towards db1062 (m2 master), causing a brief period of read only.

We should may also increase the timeouts so they take more than 10 seconds of unavailability to proceed to a failover, given how much time it takes for the service to be back into good state afterwards (and requires human intervention- e.g. maybe in the order of 60 seconds).

Event Timeline

jcrespo created this task.Aug 2 2018, 4:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2018, 4:19 PM
Marostegui moved this task from Triage to Backlog on the DBA board.Aug 3 2018, 7:37 AM
herron triaged this task as Medium priority.Aug 3 2018, 9:11 PM

Should we add prometheus-haproxy-exporter in scope of this, too?

No need, tracked on T191400

My proposal to get this logging would be to enable: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4.2-option%20log-health-checks

When this option is enabled, any change of the health check status or to
the server's health will be logged, so that it becomes possible to know
that a server was failing occasional checks before crashing, or exactly when
it failed to respond a valid HTTP status, then when the port started to
reject connections, then when the server stopped responding at all.

Change 454039 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db-master.cfg: Enable haproxy health-check logging

https://gerrit.wikimedia.org/r/454039

Mentioned in SAL (#wikimedia-operations) [2018-08-21T05:12:06Z] <marostegui> Stop puppet on dbproxy1006 to do some haproxy logging tests - https://phabricator.wikimedia.org/T201021

After all the tests, these are the options needed on db-master.cfg:

option tcplog
option log-health-checks
log /dev/log local0 info

Which works fine with journald and produces the following output

journalctl --no-pager -u haproxy
<snip>
Aug 21 06:15:30 dbproxy1006 haproxy[10474]: Proxy mariadb started.
Aug 21 06:15:30 dbproxy1006 haproxy[10478]: Health check for server mariadb/db1063 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.31-MariaDB", check duration: 0ms, status: 20/20 UP.
Aug 21 06:15:31 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 3/3 UP.
Aug 21 06:16:27 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP.
Aug 21 06:16:31 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2002ms, status: 1/3 UP.
Aug 21 06:16:35 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2002ms, status: 0/2 DOWN.
Aug 21 06:16:35 dbproxy1006 haproxy[10478]: Backup Server mariadb/db1117:3321 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 21 06:17:01 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 1/2 DOWN.
Aug 21 06:17:03 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 3/3 UP.
Aug 21 06:17:03 dbproxy1006 haproxy[10478]: Backup Server mariadb/db1117:3321 is UP. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.

Change 454039 merged by Marostegui:
[operations/puppet@production] db-master.cfg: Enable haproxy health-check logging

https://gerrit.wikimedia.org/r/454039

Mentioned in SAL (#wikimedia-operations) [2018-08-21T07:55:38Z] <marostegui> Slowly reload haproxies on dbproxy10XX - T201021

Change 454205 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db-replicas.cfg: Enable healthcheck logging

https://gerrit.wikimedia.org/r/454205

Change 454205 merged by Marostegui:
[operations/puppet@production] db-replicas.cfg: Enable healthcheck logging

https://gerrit.wikimedia.org/r/454205

Marostegui closed this task as Resolved.Aug 21 2018, 8:19 AM
Marostegui claimed this task.

Deployed, reloaded haproxies everywhere and I can see now the checks on the logs.

jcrespo reopened this task as Open.Sep 6 2018, 5:18 PM
jcrespo raised the priority of this task from Medium to High.

Mentioned in SAL (#wikimedia-operations) [2018-09-06T17:20:11Z] <jynus_> dropping old logs from dbproxy1003 T201021

I will play with the different log levels tomorrow to see which is the minimum we can do to still get the requests logged, or at least the failures

I have purged logs from other dbproxies too just to make sure they are ok.

Mentioned in SAL (#wikimedia-operations) [2018-09-07T04:58:18Z] <marostegui> Disable puppet on dbproxy1006 for logging testing - T201021

Using crit instead of info give us when a server goes down but it doesn't log every single health check request.
So we get to know when the failover was triggered:

Sep 07 05:05:34 dbproxy1006 haproxy[9498]: Backup Server mariadb/db1117:3321 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Change 458633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy: Reduce the amount of logging

https://gerrit.wikimedia.org/r/458633

Change 458633 merged by Marostegui:
[operations/puppet@production] dbproxy: Reduce the amount of logging

https://gerrit.wikimedia.org/r/458633

Mentioned in SAL (#wikimedia-operations) [2018-09-07T08:33:08Z] <marostegui> Rebooting haproxies to pick up new config after all the tests - T201021

Marostegui closed this task as Resolved.Sep 7 2018, 9:06 AM

I have reloaded haproxy everywhere and it is now under control and not logging every single second :-).
Also purged all the logs to make them to be at the same level they used to be before the change.