Page MenuHomePhabricator

Gerrit 285208 broke eventlogging_sync.sh
Closed, ResolvedPublic

Description

I got a warning from icinga low on space on / for db1047.

I took a quick look (I'm on mobile) and saw that there was the log for /usr/local/bin/eventlogging_sync.sh that was filling up all the space because of an infinite loop. The error message in the error log was:

/usr/local/bin/eventlogging_sync.sh: line 37: mysql: command not found

I saw that it is managed by puppet, so I decided to kill it, rotate manually the logs (both log and err that were pretty big) and run puppet to make it start, hoping that i would take the proper environment to work again.

Unfortunately it didn't but somehow changed the way it's failing. Now the error is just:

ERROR 1045 (28000): Access denied for user 'root'@'10.64.16.148' (using password: NO)

So it seems that now is not reading the credentials from /root/.my.cnf

After digging a bit I found that today was merged https://gerrit.wikimedia.org/r/#/c/285208/ that changed the way the script is run, but actually breaking it in all hosts where is configured to run.

I'm not sure why the behaviour is different between the two runs.

Event Timeline

Change 285249 had a related patch set uploaded (by Volans):
Revert "mariadb: don't spawn el_sync.sh via sudo -u root"

https://gerrit.wikimedia.org/r/285249

Volans triaged this task as Unbreak Now! priority.

I'm reverting the fix right now to make it work again, then we can sync to do a proper fix.

Change 285249 merged by Volans:
Revert "mariadb: don't spawn el_sync.sh via sudo -u root"

https://gerrit.wikimedia.org/r/285249

Volans lowered the priority of this task from Unbreak Now! to Low.Apr 25 2016, 8:18 PM

I've merged the revert and killed the process in the infinite loop on db1047 and dbstore1002 that I think are the only 2 server where this is running from a quick search on puppet, but I could be mistaken.

If you are aware of other places where this runs, please kill the process if it's in an infinite loop and force puppet to restart it.

I've also noticed (from icinga alarms) that periodically eventlogging_sync.sh crashes with this error:

mysqldump: Couldn't find table: "MobileV?b?lickTracking_5929948"
ERROR 1146 (42S02) at line 1: Table 'log.MobileV?b?lickTracking_5929948' doesn't exist

On db1047 from show tables inside mysql:

| MobileV?b?lickTracking_5929948                    |

Setting all connection charset and collations to utf8 I got:

MobileV<EF><BF><BD>b<EF><BF><BD>lickTracking_5929948

While on filesystem:

-rw-rw---- 1 mysql mysql  4406 Jan  6  2015 MobileV@fffdb@fffdlickTracking_5929948.frm

It could be my terminal configuration, but looks like there is some encoding/corruption issue for this table that needs to be fixed.
Eventlogging will crash each time it reaches it and the alarm on icinga will fire until the next Puppet run will restart it.

jcrespo claimed this task.

closing because the ongoing issues were fixed, and long-term fixes will be done on T124307