Page MenuHomePhabricator

db1047 out of disk space, eventlogging_sync spam
Closed, ResolvedPublic

Description

db1047 paged for disk space, on /var/log examination there's a lot of spam from eventlogging_sync.{log,err}. While the .log file was symlinked to /srv the .err file wasn't. I symlinked it to /srv as well.

db1047:/var/log$ ls -la eventlogging_sync.*
lrwxrwxrwx 1 root root      43 Dec  5 01:56 eventlogging_sync.err -> /srv/log/eventlogging/eventlogging_sync.err
lrwxrwxrwx 1 root root      43 Aug 26 20:53 eventlogging_sync.log -> /srv/log/eventlogging/eventlogging_sync.log

The error itself is about tls errors

ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
ERROR 2026 (HY000): SSL connection error: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

dbstore1002 suffers the same problem but has a larger /

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2016-12-05T02:12:38Z] <godog> add --skip-ssl to mysql commands on eventlogging_sync on db1047 - T152364

Mentioned in SAL (#wikimedia-operations) [2016-12-05T02:14:20Z] <godog> add --skip-ssl to mysql commands on eventlogging_sync on dbstore1002 - T152364

Change 325257 had a related patch set uploaded (by Marostegui):
eventlogging_sync: By pass ssl check on localhost

https://gerrit.wikimedia.org/r/325257

Thanks for taking care of this. I have submitted a patch to skip this check: https://gerrit.wikimedia.org/r/325257
I am not completely aware of the whole history of this script and process, so I will wait for @jcrespo and his input on this.

Change 325257 merged by Jcrespo:
eventlogging_sync: By pass ssl check on localhost

https://gerrit.wikimedia.org/r/325257

This was caused by the cert expiration on all analytics hosts, making all mysql connections from other databases to fail. This was part of the mitigation of the ongoing issue on T152188. End users were not affected because there is a backup refilling process that took over, although this created lots of issues. We need to restart all eventlog hosts to revert the patch.

Change 325273 had a related patch set uploaded (by Jcrespo):
Renew expired TLS certificate for eventlogging hosts

https://gerrit.wikimedia.org/r/325273

I have re-enabled puppet and ran it to pick up the commit.

jcrespo claimed this task.

The ongoing issues are now resolved. Long term fixes will go on T152188.

Change 325273 merged by Jcrespo:
Renew expired TLS certificate for eventlogging hosts

https://gerrit.wikimedia.org/r/325273