Page MenuHomePhabricator

mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges"
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/#/c/269169/ failed to merge with the following error at the end of the mediawiki-extensions-php55 job:

00:39:04 [mediawiki-extensions-php55] $ /bin/bash -xe /tmp/hudson4352663662233031134.sh
00:39:04 + /srv/deployment/integration/slave-scripts/bin/mw-teardown-mysql.sh
00:39:04 ERROR 1269 (HY000) at line 1: Can't revoke all privileges for one or more of the requested users
00:39:05 Build step 'Execute a set of scripts' changed build result to FAILURE
00:39:05 Build step 'Execute a set of scripts' marked build as failure
00:39:05 Archiving artifacts
00:39:05 Finished: FAILURE

For full details, see https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/599/console

Probably unrelatedly, there are also warnings in the middle of the unit tests:

00:28:08 Populating links tables...
00:28:09 Completed
00:28:09 Updated 0  workflows
00:28:09 Failed: 0
00:28:09 
00:28:09 Array
00:28:09 (
00:28:09 )
00:28:09 PHP Notice:  Cannot find site jenkins_u0_mw in sites table [Called from Wikibase\Client\WikibaseClient::newSiteGroup in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php at line 591] in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php on line 300
00:28:09 PHP Stack trace:
00:28:09 PHP   1. {main}() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:0
00:28:09 PHP   2. require_once() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:214
00:28:09 PHP   3. UpdateMediaWiki->execute() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/doMaintenance.php:103
00:28:09 PHP   4. LoggedUpdateMaintenance->execute() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:181
00:28:09 PHP   5. FlowCreateTemplates->doDBUpdates() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/Maintenance.php:1438
00:28:09 PHP   6. FlowCreateTemplates->create() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Flow/maintenance/FlowCreateTemplates.php:84
00:28:09 PHP   7. WikiPage->doEditContent() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Flow/maintenance/FlowCreateTemplates.php:113
00:28:09 PHP   8. WikiPage->prepareContentForEdit() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/page/WikiPage.php:1710
00:28:09 PHP   9. AbstractContent->getParserOutput() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/page/WikiPage.php:2172
00:28:09 PHP  10. Hooks::run() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/content/AbstractContent.php:501
00:28:09 PHP  11. call_user_func_array() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/Hooks.php:195
00:28:09 PHP  12. Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers::onContentAlterParserOutput() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/Hooks.php:195
00:28:09 PHP  13. Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers::newFromGlobalState() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php:87
00:28:09 PHP  14. Wikibase\Client\WikibaseClient->getLangLinkHandler() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php:66
00:28:09 PHP  15. Wikibase\Client\WikibaseClient->getLangLinkSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:704
00:28:09 PHP  16. Wikibase\Client\WikibaseClient->getSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:570
00:28:09 PHP  17. Wikibase\Client\WikibaseClient->newSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:609
00:28:09 PHP  18. wfWarn() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:591
00:28:09 PHP  19. MWDebug::warning() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/GlobalFunctions.php:1181
00:28:09 PHP  20. MWDebug::sendMessage() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php:155
00:28:09 PHP  21. trigger_error() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php:300
00:28:09 
00:28:09 Notice: Cannot find site jenkins_u0_mw in sites table [Called from Wikibase\Client\WikibaseClient::newSiteGroup in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php at line 591] in /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php on line 300
00:28:09 
00:28:09 Call Stack:
00:28:09     0.0006     275400   1. {main}() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:0
00:28:09     0.0036     634296   2. require_once('/mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/doMaintenance.php') /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:214
00:28:09     0.3272   19210824   3. UpdateMediaWiki->execute() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/doMaintenance.php:103
00:28:09     2.2897   38129000   4. LoggedUpdateMaintenance->execute() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/update.php:181
00:28:09     2.2911   38129448   5. FlowCreateTemplates->doDBUpdates() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/maintenance/Maintenance.php:1438
00:28:09     2.2965   38520816   6. FlowCreateTemplates->create() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Flow/maintenance/FlowCreateTemplates.php:84
00:28:09     2.3424   40019200   7. WikiPage->doEditContent() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Flow/maintenance/FlowCreateTemplates.php:113
00:28:09     2.3489   40293360   8. WikiPage->prepareContentForEdit() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/page/WikiPage.php:1710
00:28:09     2.4773   45720160   9. AbstractContent->getParserOutput() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/page/WikiPage.php:2172
00:28:09     2.5731   47579664  10. Hooks::run() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/content/AbstractContent.php:501
00:28:09     2.5735   47607312  11. call_user_func_array() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/Hooks.php:195
00:28:09     2.5735   47607888  12. Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers::onContentAlterParserOutput() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/Hooks.php:195
00:28:09     2.5735   47608080  13. Wikibase\Client\Hooks\ParserOutputUpdateHookHandlers::newFromGlobalState() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php:87
00:28:09     2.5740   47649336  14. Wikibase\Client\WikibaseClient->getLangLinkHandler() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/ParserOutputUpdateHookHandlers.php:66
00:28:09     2.5825   47998984  15. Wikibase\Client\WikibaseClient->getLangLinkSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:704
00:28:09     2.5825   47999072  16. Wikibase\Client\WikibaseClient->getSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:570
00:28:09     2.5825   47999216  17. Wikibase\Client\WikibaseClient->newSiteGroup() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:609
00:28:09     2.5880   48171536  18. wfWarn() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php:591
00:28:09     2.5886   48171816  19. MWDebug::warning() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/GlobalFunctions.php:1181
00:28:09     2.5893   48173800  20. MWDebug::sendMessage() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php:155
00:28:09     2.5893   48174176  21. trigger_error() /mnt/jenkins-workspace/workspace/mediawiki-extensions-php55/src/includes/debug/MWDebug.php:300
00:28:09 
00:28:10 Removed 0 links to special pages.
00:28:10 Completed
00:28:10 Completed

Event Timeline

Catrope created this task.Feb 12 2016, 1:00 AM
Catrope raised the priority of this task from to Unbreak Now!.
Catrope updated the task description. (Show Details)
Catrope added a subscriber: Catrope.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2016, 1:00 AM
Krinkle renamed this task from CI broken for Echo (possibly other things) with error "Can't revoke all privileges for one or more of the requested users" to mediawiki-extensions-php55 broken for Echo (and possibly others) with "mw-teardown-mysql.sh: Can't revoke all privileges".Feb 16 2016, 1:19 AM
Krinkle set Security to None.
Catrope renamed this task from mediawiki-extensions-php55 broken for Echo (and possibly others) with "mw-teardown-mysql.sh: Can't revoke all privileges" to mediawiki-extensions-php55 fails intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges".Feb 16 2016, 4:42 AM
Krinkle renamed this task from mediawiki-extensions-php55 fails intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" to mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges".Feb 16 2016, 2:24 PM
hashar added subscribers: Tgr, StudiesWorld.
hashar added a subscriber: hashar.

From T126810:

In https://gerrit.wikimedia.org/r/#/c/270350/ PS1 the hhvm job is reported as a failure even though it actually isn't.

The Cannot find site jenkins_u0_mw in sites table errors are unrelated. It is Wikibase assuming the site table should have an entry for $wgDBname. not a big deal.

The error: ERROR 1269 (HY000) at line 1: Can't revoke all privileges for one or more of the requested users is very concerning. I have no idea what might be the cause, but maybe it is one of our MySQL instance being borked somehow. Will try some forensic to find whether it is a single slave or any.

I also noticed a bunch of failures with some database table mysteriously not existing anymore....

Random example from a Wikibase change against wmf.13 ( https://gerrit.wikimedia.org/r/#/c/271273/ ):

update.php (build console) yelling about:

00:00:38.424 Error: 1146 Table 'jenkins_u0_mw.filearchive' doesn't exist (127.0.0.1:3306)

On integration-slave-trusty-1018 I have dmesg filled with:

[Tue Feb  9 13:20:54 2016] init: mysql main process (1087) terminated with status 1
[Tue Feb  9 13:20:54 2016] init: mysql main process ended, respawning
[Tue Feb  9 13:20:54 2016] init: mysql post-start process (1089) terminated with status 1
[Tue Feb  9 13:20:54 2016] type=1400 audit(1455024054.643:26): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/mysqld" pid=1513 comm="apparmor_parser"
[Tue Feb  9 13:20:56 2016] init: mysql main process (1525) terminated with status 1
[Tue Feb  9 13:20:56 2016] init: mysql main process ended, respawning
[Tue Feb  9 13:20:56 2016] init: mysql post-start process (1526) terminated with status 1
[Tue Feb  9 13:20:56 2016] type=1400 audit(1455024056.724:27): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/mysqld" pid=1642 comm="apparmor_parser"
[Tue Feb  9 13:20:58 2016] init: mysql main process (1654) terminated with status 1
[Tue Feb  9 13:20:58 2016] init: mysql respawning too fast, stopped

So somehow the nice MySQL server is being killed.

The whole MySQL setup needs to be overhauled it is very messy. We have the datadir set to a tmpfs via:

https://gerrit.wikimedia.org/r/#/c/204528/13/manifests/role/ci.pp,cm

And there is a crazy hack that probably cause puppet to restart it randomly :-/

https://integration.wikimedia.org/ci/job/mwext-qunit-composer/804/consoleFull

16:34:15 Building remotely on integration-slave-trusty-1001 (phpflavor-hhvm contintLabsSlave UbuntuTrusty phpflavor-php55) in workspace /mnt/jenkins-workspace/workspace/mwext-qunit-composer

From /var/log/syslog

16:38:13 + /srv/deployment/integration/slave-scripts/bin/mw-teardown-mysql.sh
16:38:13 ERROR 1269 (HY000) at line 1: Can't revoke all privileges for one or more of the requested users

From

Feb 17 16:33:17 integration-slave-trusty-1011 puppet-agent[22288]: Applying configuration version '1455719570'
Feb 17 16:33:24 integration-slave-trusty-1011 crontab[23518]: (root) LIST (root)
Feb 17 16:33:36 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Packages::Libcurl4_gnutls_dev/Package[libcurl4-gnutls-dev]/ensure) ensure changed 'purged' to 'present'
Feb 17 16:34:07 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Sysfs/Service[sysfsutils]/ensure) ensure changed 'stopped' to 'running'
Feb 17 16:34:07 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Sysfs/Service[sysfsutils]) Unscheduling refresh on Service[sysfsutils]
Feb 17 16:34:16 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Contint::Packages::Labs/Exec[/usr/bin/apt-get -y build-dep hhvm]/returns) executed successfully
Feb 17 16:34:41 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Contint::Packages::Python/Package[setuptools]/ensure) created
Feb 17 16:34:43 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Contint::Packages::Python/Package[pip]/ensure) created
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) 
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) --- /etc/init/mysql.override#0112016-02-17 16:05:30.901762365 +0000
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) +++ /tmp/puppet-file20160217-22288-2tv8qb#0112016-02-17 16:34:44.437398490 +0000
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) @@ -0,0 +1 @@
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) +manual
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: FileBucket got a duplicate file {md5}d41d8cd98f00b204e9800998ecf8427e
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]) Filebucketed /etc/init/mysql.override to puppet with sum d41d8cd98f00b204e9800998ecf8427e
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content) content changed '{md5}d41d8cd98f00b204e9800998ecf8427e' to '{md5}cf3f2a865fbea819dadd439586eaee31'
Feb 17 16:34:44 integration-slave-trusty-1011 puppet-agent[22288]: (/Stage[main]/Role::Ci::Slave::Labs/Service[mysql]/enable) enable changed 'false' to 'true'
Feb 17 16:34:46 integration-slave-trusty-1011 puppet-agent[22288]: Finished catalog run in 106.33 seconds
root@integration-slave-trusty-1011:~# puppet agent -tv
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Caching catalog for integration-slave-trusty-1011.integration.eqiad.wmflabs
Info: Applying configuration version '1455729625'
Notice: /Stage[main]/Packages::Libcurl4_gnutls_dev/Package[libcurl4-gnutls-dev]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Contint::Packages::Labs/Package[hhvm-dev]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Sysfs/Service[sysfsutils]: Unscheduling refresh on Service[sysfsutils]
Notice: /Stage[main]/Contint::Packages::Python/Package[setuptools]/ensure: created
Notice: /Stage[main]/Contint::Packages::Python/Package[pip]/ensure: created
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content: 
--- /etc/init/mysql.override    2016-02-17 17:19:28.225371432 +0000
+++ /tmp/puppet-file20160217-15209-3h6niu       2016-02-17 17:29:17.601451551 +0000
@@ -0,0 +1 @@
+manual

Info: FileBucket got a duplicate file {md5}d41d8cd98f00b204e9800998ecf8427e
Info: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]: Filebucketed /etc/init/mysql.override to puppet with sum d41d8cd98f00b204e9800998ecf8427e
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content: content changed '{md5}d41d8cd98f00b204e9800998ecf8427e' to '{md5}cf3f2a865fbea819dadd439586eaee31'
Notice: /Stage[main]/Role::Ci::Slave::Labs/Service[mysql]/enable: enable changed 'false' to 'true'
Notice: Finished catalog run in 107.97 seconds
root@integration-slave-trusty-1011:~# puppet agent -tv
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Caching catalog for integration-slave-trusty-1011.integration.eqiad.wmflabs
Info: Applying configuration version '1455729625'
Notice: /Stage[main]/Packages::Libcurl4_gnutls_dev/Package[libcurl4-gnutls-dev]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Sysfs/Service[sysfsutils]: Unscheduling refresh on Service[sysfsutils]
Notice: /Stage[main]/Contint::Packages::Labs/Exec[/usr/bin/apt-get -y build-dep hhvm]/returns: executed successfully
Notice: /Stage[main]/Contint::Packages::Python/Package[setuptools]/ensure: created
Notice: /Stage[main]/Contint::Packages::Python/Package[pip]/ensure: created
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content: 
--- /etc/init/mysql.override    2016-02-17 17:29:18.381451944 +0000
+++ /tmp/puppet-file20160217-22044-17zpp8d      2016-02-17 17:32:45.141553980 +0000
@@ -0,0 +1 @@
+manual

Info: FileBucket got a duplicate file {md5}d41d8cd98f00b204e9800998ecf8427e
Info: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]: Filebucketed /etc/init/mysql.override to puppet with sum d41d8cd98f00b204e9800998ecf8427e
Notice: /Stage[main]/Role::Ci::Slave::Labs/File[/etc/init/mysql.override]/content: content changed '{md5}d41d8cd98f00b204e9800998ecf8427e' to '{md5}cf3f2a865fbea819dadd439586eaee31'
Notice: /Stage[main]/Role::Ci::Slave::Labs/Service[mysql]/enable: enable changed 'false' to 'true'
Notice: Finished catalog run in 130.63 seconds

Yup the override file is a craziness we inject which is also used by puppet service {}.

From a comment I made on https://gerrit.wikimedia.org/r/#/c/204528/:

I have found out why /etc/init/mysql.override keeps disappearing. The puppet implementation for an upstart service uses the .override to enable/disable a service. If it is enabled it magically write an empty .override and when disabled inject 'manual' to it.
So the override file is always overwritten by puppet. We need to set:
service { 'mysql':
    enable => manual,
}

Really need to be investigated more.

Another thing is on instance reboot, the database is recreated from scratch and miss the deb-maint user which is used to check the status of the service. That might cause troubles as well.

I am tempted to disable the default mysql service entirely and instead spawn a mysql on job start that has the datatdir pointing to the per job tmpfs. That might be a bit more robust. On job completion, kill mysql and the tmpfs is reclaimed anyway.

Need a bit more thoughts.

Mentioned in SAL [2016-02-17T18:11:18Z] <jzerebecki> updated cherry-pick https://gerrit.wikimedia.org/r/#/c/204528/14 on integration-puppetmaster T126699

Mentioned in SAL [2016-02-17T18:59:28Z] <jzerebecki> updating chery-pick https://gerrit.wikimedia.org/r/#/c/204528/15 on integration-puppetmaster T126699

PS15 fixes the problem

JanZerebecki closed this task as Resolved.Feb 17 2016, 7:12 PM
JanZerebecki claimed this task.

During that job run there was no puppet run on that host. Also there is nothing in dmesg that shows mysql or anything else dying. I can't access https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1011/builds , but it seems to not happen on every job on that host.

Maybe I was wrong in merging tasks. Jan fix most probably deal with tables randomly going away while tests are running due to puppet probably restarting MySQL (yeah that is crazy). I am confident the hack he found fix it.

For:
ERROR 1269 (HY000) at line 1: Can't revoke all privileges for one or more of the requested users

I am not sure what this issue is really about. The user/database creation and deletion are done via slaves scripts in integration/jenkins.git (look for /bin/*mysql*):

mw-install-mysql.sh:

DROP DATABASE IF EXISTS ${MW_DB};
CREATE DATABASE ${MW_DB};
GRANT ALL ON ${MW_DB}.* to '${MW_DB_USER}'@'${MW_DB_HOST}' identified by '${MW_DB_PASS}';

mw-teardown-mysql.sh:

DROP DATABASE IF EXISTS ${MW_DB};

REVOKE ALL PRIVILEGES, GRANT OPTION FROM '${MW_DB_USER}'@'${MW_DB_HOST}';
DROP USER '${MW_DB_USER}'@'${MW_DB_HOST}';

Maybe the database is still being used at the end of the job because a connection is improperly closed / left over procedure still running.

Maybe we could pass some sql commands to show the state of database / user privileges etc before revoking and dropping it.

@JanZerebecki: Any progress / news to share? (Asking as you are set as assignee and this has "Unbreak now" priority)

hashar lowered the priority of this task from Unbreak Now! to High.Mar 9 2016, 2:13 PM

That is supposed to be fixed by https://gerrit.wikimedia.org/r/#/c/204528/ I recently rebased it and it is still applied on CI instances.

The trick is that the puppet definition was incorrect and would cause puppet to restart MySQL whenever puppet run (every 20 minutes?) because it thought the service was down.

The hack Jan found is to define the service in puppet to be handled via init script instead of upstart.

I havent heard of this bug since Jan fixed it. So I guess it is now pending a merge in puppet.git. Lowering priority.

JanZerebecki removed JanZerebecki as the assignee of this task.Mar 10 2016, 3:43 PM

I won't find time to make sure the patch that is cherry picked on CI is getting merged.

greg added a subscriber: greg.Mar 10 2016, 3:46 PM

I can put it into the PuppetSWAT window for next week if you want.

I worded that incorrectly, that is not what I meant. I have not looked at it enough to know if it is ready to be merged and know that there are still problems that cause puppet to churn on CI slaves in at least one of the cherry picked patches. OTOH perhaps I'm being too perfectionist and we should only check if the patches do not affect production systems?

A puppet patch landed a few days ago that moved all the manifest/role/ci.pp under the modules/roles/ci/ tree. So that caused the puppet master to be in a dirty rebase state. That is fixed now.

I am confident Jan hack fixed it up properly, albeit at the price of a workaround but it is fairly commented and pointing to this task.

So my position is to get the puppet patch merged and we can mark this task as resolved.

Change 204528 had a related patch set uploaded (by Hashar):
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

*summary*

This task has a long history, the puppet patch has been applied for a year or so as part of T96230. @JanZerebecki had the final call and we now have something which is robust and stable on CI labs instance.

So we would just need https://gerrit.wikimedia.org/r/#/c/204528/ to land in puppet.git and we can call this task done.

Change 204528 merged by Filippo Giunchedi:
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

hashar closed this task as Resolved.Jul 12 2016, 6:59 PM