High database error rates on s2 and s3
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jan 24 2018, 5:45 PM

Description

(High here means we-should-investigate-high, not things-are-critically-broken-high)

s2 is showing high error rate on prometheus:

Screenshot_20180124_093450.png (1×2 px, 192 KB)

s3 is showing high error connection rate (on db1077, but specially db1078) coming mostly from the job queue:

https://logstash.wikimedia.org/goto/59e95d9d4db3bdd222658e2f08306efd

The real issue, if any, is not yet clear for both cases. It could be some bad permissions, it could be low connection time at PHP side (in which case it should probably be mitigated, and not get its timeout increased); it could be connection overload; it could be some other kind of error).

Event Timeline

jcrespo created this task.Jan 24 2018, 5:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2018, 5:45 PM

The biggest offender is one of the new jobrunners, so while I'd tend to think it has to do with connection overload, but I think this merits a better investigation for sure.

• Marostegui moved this task from Triage to Backlog on the DBA board.Feb 5 2018, 11:55 AM

Should we consider this solved? It has not happened for a month: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=10&fullscreen&orgId=1&from=now-30d&to=now

Actually, this got solved for s2 this morning, but now it is happening on s1: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=10&fullscreen&orgId=1&from=1521744688622&to=1521820284514&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

I am not sure I am understanding your comment, on that graph (and the one I posted) it is mostly 0 errors, no?

We followed up on IRC, but we identified this is a terbium process trying to connect to db1067 and not having the right grants. We are not even sure if that should be running, so we may look at what it is and what it is trying to do and either fix it or kill it.

So by doing a traffic capture, I have seen proxysql trying to connect from terbium to db1067 every 10 seconds, which matches what we saw on the logs:

libmariadb._pid.1572._client_version.2.3.1	_platform.x86_64.program_name.proxysql_monitor
16:33:29.362863 IP 10.64.32.13.57126 > 10.64.48.22.3306: Flags [F.], seq 201, ack 176, win 58, options [nop,nop,], length 0

Finally fixed, it was a badly configured proxysql (not on production).

	F12772060: Screenshot_20180124_093450.png
	Jan 24 2018, 5:45 PM

High database error rates on s2 and s3Closed, ResolvedPublicActions

Description

Event Timeline

High database error rates on s2 and s3
Closed, ResolvedPublic
Actions