Page MenuHomePhabricator

integration-slave-trusty-1004 can't connect to mysql
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/mwext-qunit-composer/3218/console

/srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)

Event Timeline

Mentioned in SAL [2016-05-23T14:32:18Z] <jzerebecki> offlined integration-slave-trusty-1004 because it can't connect to mysql T135997

@thcipriani mentioned slaves are somehow/sometime missing mysql :(

I have rebooted that host earlier today. So maybe our puppet / service does not start on boot.

hashar triaged this task as Unbreak Now! priority.May 23 2016, 7:03 PM
hashar added a project: Essential-Work.

Sounds bad:

integration-slave-trusty-1017.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1004.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1006.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1003.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1025.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1023.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1014.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1001.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1018.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1012.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1015.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1024.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1016.integration.eqiad.wmflabs:
    mysql stop/waiting
integration-slave-trusty-1013.integration.eqiad.wmflabs:
    mysql start/running, process 26269
integration-slave-trusty-1011.integration.eqiad.wmflabs:
    mysql stop/waiting

The mysql service is managed by puppet. Due to T96230 / T126699 we have a custom patch to handle mysql https://gerrit.wikimedia.org/r/#/c/204528/19/modules/role/manifests/ci/slave/labs.pp,cm

it is apparently entirely broken for some reason, most probably a recentish patches that landed in puppet.git production branch.

These machines seems to have mysql enabled on reboot:

thcipriani@integration-saltmaster:~$ sudo salt -G 'oscodename:trusty' cmd.run 'ls /etc/rc2.d | grep mysql'
integration-slave-trusty-1017.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1023.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1006.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1001.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1011.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1024.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1004.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1025.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1014.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1015.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1012.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1013.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1003.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1016.integration.eqiad.wmflabs:

S20mysql

integration-slave-trusty-1018.integration.eqiad.wmflabs:

S20mysql

pupppet logs don't show anything unusual.

The puppet service uses provider => debian and puppet agent eventually runs:

/etc/init.d/mysql status; echo $?
mysql stop/waiting
0

The shell script eventually checks whether there is an upstart job and thus invokes initctl status mysql which does not know about mysql...

Restarting an instance:

Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/var/lib/mysql]/owner: owner changed 'root' to 'mysql'
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/var/lib/mysql]/group: group changed 'root' to 'mysql'
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/var/lib/mysql]/mode: mode changed '1777' to '0775'

There is no mysql running ...

On some instances we have two process:

/usr/sbin/mysqld

/bin/sh /usr/bin/mysqld_safe
 \_ /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --

That last one is wrong :)

hashar claimed this task.

I have ended up with killall mysqld and upstart restarted it:

# salt -v '*trusty*'  cmd.run '/etc/init.d/mysql status'
Executing job with jid 20160523193513679149
-------------------------------------------

integration-slave-trusty-1023.integration.eqiad.wmflabs:
    mysql start/running, process 12035
integration-slave-trusty-1017.integration.eqiad.wmflabs:
    mysql start/running, process 20129
integration-slave-trusty-1014.integration.eqiad.wmflabs:
    mysql start/running, process 8892
integration-slave-trusty-1004.integration.eqiad.wmflabs:
    mysql start/running, process 15530
integration-slave-trusty-1015.integration.eqiad.wmflabs:
    mysql start/running, process 15508
integration-slave-trusty-1024.integration.eqiad.wmflabs:
    mysql start/running, process 5512
integration-slave-trusty-1025.integration.eqiad.wmflabs:
    mysql start/running, process 27209
integration-slave-trusty-1018.integration.eqiad.wmflabs:
    mysql start/running, process 15626
integration-slave-trusty-1012.integration.eqiad.wmflabs:
    mysql start/running, process 12085
integration-slave-trusty-1003.integration.eqiad.wmflabs:
    mysql start/running, process 2599
integration-slave-trusty-1011.integration.eqiad.wmflabs:
    mysql start/running, process 14315
integration-slave-trusty-1006.integration.eqiad.wmflabs:
    mysql start/running, process 6281
integration-slave-trusty-1001.integration.eqiad.wmflabs:
    mysql start/running, process 14202
integration-slave-trusty-1016.integration.eqiad.wmflabs:
    mysql start/running, process 13213
integration-slave-trusty-1013.integration.eqiad.wmflabs:
    mysql start/running, process 1276