Page MenuHomePhabricator

Review labsdb1005 MariaDB configuration against prod standards
Closed, ResolvedPublic

Description

MariaDB is not maintained for Jessie and labsdb1005 is Jessie.

We're afraid we might be missing some settings that are used in Stretch, which could be causing the latest outages we've seen.

This task is to sync up with DBAs and identify potential improvements (including evaluating upgrade to Stretch).

Event Timeline

Worse, we may very well have the settings that are used in Stretch, and the settings may not apply well to our version of MariaDB.

If our hardware issues clear up, the highest priority is on moving this DB to stretch and friends T216173
Until then...

Besides evaluating against typical standards, a general healthcheck, if possible from @jcrespo or @Marostegui might help things until we can get this moved. Moving things and possibly per-user limits as in T216170 are likely ways forward, however extreme slowness and some odd timeouts were reported before the outage. I don't know what might have been causing them.

Our MariaDB configuration is almost essentially the same in any mariadb version. Our concern with jessie is that upgrades not happening means exposure to vulns (on the mariadb server).

At this moment, I've been getting reports of ridiculously slow query times even against unique indexes. We are still running table repair (against very large tables), so we are hoping that's what's up. However, I'm wondering if the performance of this thing is the real cause of the connection pileup. Hard to tell right now with repairs underway.

There were reports of this happening earlier, but none prove whether the slowness is the chicken or the egg. There are definitely "timeout" errors where there shouldn't be any in some places before the connections filled up.

We also found this T216202, but it's just one disk...

I've stopped the repair, it is not possible to run it with so many ongoing queris, that will create matadata locking. I have also killed some queries.

Bstorm claimed this task.

The question around this issue appears to have been answered, and the master is now on Stretch which helps.