Maniphest T201021

HAproxy on dbproxy hosts lack enough logging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Aug 2 2018, 4:19 PM

Description

Apparently, haproxy is lacking proper logging on dbproxy roles (or at least, dbproxy1002), but I remember reading logs of why failovers happen and when. It would be nice to have those on journalctl for easy debugging when connection or other failures happen.

This is motivated by a recent dbproxy1002 losing connectivity towards db1062 (m2 master), causing a brief period of read only.

We should may also increase the timeouts so they take more than 10 seconds of unavailability to proceed to a failover, given how much time it takes for the service to be back into good state afterwards (and requires human intervention- e.g. maybe in the order of 60 seconds).

Details

Subject	Repo	Branch	Lines +/-
dbproxy: Reduce the amount of logging	operations/puppet	production	+2 -2
db-replicas.cfg: Enable healthcheck logging	operations/puppet	production	+3 -0
db-master.cfg: Enable haproxy health-check logging	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects

Mentioned Here: T191400: Add haproxy-exporter to dbproxy hosts

Event Timeline

jcrespo created this task.Aug 2 2018, 4:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2018, 4:19 PM

Marostegui moved this task from Triage to Backlog on the DBA board.Aug 3 2018, 7:37 AM

herron triaged this task as Medium priority.Aug 3 2018, 9:11 PM

The timeout increase was done at: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450542/

Should we add prometheus-haproxy-exporter in scope of this, too?

No need, tracked on T191400

My proposal to get this logging would be to enable: https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4.2-option%20log-health-checks

When this option is enabled, any change of the health check status or to
the server's health will be logged, so that it becomes possible to know
that a server was failing occasional checks before crashing, or exactly when
it failed to respond a valid HTTP status, then when the port started to
reject connections, then when the server stopped responding at all.

Change 454039 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db-master.cfg: Enable haproxy health-check logging

https://gerrit.wikimedia.org/r/454039

gerritbot added a project: Patch-For-Review.Aug 20 2018, 3:31 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-21T05:12:06Z] <marostegui> Stop puppet on dbproxy1006 to do some haproxy logging tests - https://phabricator.wikimedia.org/T201021

After all the tests, these are the options needed on db-master.cfg:

option tcplog
option log-health-checks
log /dev/log local0 info

Which works fine with journald and produces the following output

journalctl --no-pager -u haproxy
<snip>
Aug 21 06:15:30 dbproxy1006 haproxy[10474]: Proxy mariadb started.
Aug 21 06:15:30 dbproxy1006 haproxy[10478]: Health check for server mariadb/db1063 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.31-MariaDB", check duration: 0ms, status: 20/20 UP.
Aug 21 06:15:31 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 3/3 UP.
Aug 21 06:16:27 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2000ms, status: 2/3 UP.
Aug 21 06:16:31 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2002ms, status: 1/3 UP.
Aug 21 06:16:35 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 failed, reason: Layer4 timeout, check duration: 2002ms, status: 0/2 DOWN.
Aug 21 06:16:35 dbproxy1006 haproxy[10478]: Backup Server mariadb/db1117:3321 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 21 06:17:01 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 1/2 DOWN.
Aug 21 06:17:03 dbproxy1006 haproxy[10478]: Health check for backup server mariadb/db1117:3321 succeeded, reason: Layer7 check passed, code: 0, info: "5.5.5-10.1.33-MariaDB", check duration: 0ms, status: 3/3 UP.
Aug 21 06:17:03 dbproxy1006 haproxy[10478]: Backup Server mariadb/db1117:3321 is UP. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.

Change 454039 merged by Marostegui:
[operations/puppet@production] db-master.cfg: Enable haproxy health-check logging

https://gerrit.wikimedia.org/r/454039

Mentioned in SAL (#wikimedia-operations) [2018-08-21T07:55:38Z] <marostegui> Slowly reload haproxies on dbproxy10XX - T201021

Change 454205 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db-replicas.cfg: Enable healthcheck logging

https://gerrit.wikimedia.org/r/454205

Change 454205 merged by Marostegui:
[operations/puppet@production] db-replicas.cfg: Enable healthcheck logging

https://gerrit.wikimedia.org/r/454205

Deployed, reloaded haproxies everywhere and I can see now the checks on the logs.

There is now too much logging, or it is not rotated fast enough: logs are consuming 70% of available disk: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=dbproxy1003&panelId=521&fullscreen&from=1533662253688&to=1536254253688

Mentioned in SAL (#wikimedia-operations) [2018-09-06T17:20:11Z] <jynus_> dropping old logs from dbproxy1003 T201021

I will play with the different log levels tomorrow to see which is the minimum we can do to still get the requests logged, or at least the failures

I have purged logs from other dbproxies too just to make sure they are ok.

Mentioned in SAL (#wikimedia-operations) [2018-09-07T04:58:18Z] <marostegui> Disable puppet on dbproxy1006 for logging testing - T201021

Using crit instead of info give us when a server goes down but it doesn't log every single health check request.
So we get to know when the failover was triggered:

Sep 07 05:05:34 dbproxy1006 haproxy[9498]: Backup Server mariadb/db1117:3321 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Change 458633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy: Reduce the amount of logging

https://gerrit.wikimedia.org/r/458633

Change 458633 merged by Marostegui:
[operations/puppet@production] dbproxy: Reduce the amount of logging

https://gerrit.wikimedia.org/r/458633

Mentioned in SAL (#wikimedia-operations) [2018-09-07T08:33:08Z] <marostegui> Rebooting haproxies to pick up new config after all the tests - T201021

I have reloaded haproxy everywhere and it is now under control and not logging every single second :-).
Also purged all the logs to make them to be at the same level they used to be before the change.

HAproxy on dbproxy hosts lack enough loggingClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

HAproxy on dbproxy hosts lack enough logging
Closed, ResolvedPublic
Actions