tools-checker-01:~$ sudo toolscheckerctl stop toolschecker_labs_private stop/waiting toolschecker_toolsdb stop/waiting toolschecker_dumps stop/waiting toolschecker_cron stop/waiting toolschecker_webservice_kubernetes stop/waiting toolschecker_continuous_job_trusty stop/waiting toolschecker_labsdb_labsdb1005 stop/waiting toolschecker_grid_start_precise stop/waiting toolschecker_service_start stop/waiting toolschecker_nfs_secondary_cluster_showmount stop/waiting toolschecker_nfs_home stop/waiting stop: Unknown instance: Traceback (most recent call last): File "/usr/local/sbin/toolscheckerctl", line 33, in <module> subprocess.check_call(['service', service, args.action]) File "/usr/lib/python3.4/subprocess.py", line 561, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['service', 'toolschecker_nfs_showmount', 'stop']' returned non-zero exit status 1 tools-checker-01:~$ sudo toolscheckerctl start toolschecker_labs_private start/running, process 10314 toolschecker_toolsdb start/running, process 10322 toolschecker_dumps start/running, process 10330 toolschecker_cron start/running, process 10338 toolschecker_webservice_kubernetes start/running, process 10346 toolschecker_continuous_job_trusty start/running, process 10354 toolschecker_labsdb_labsdb1005 start/running, process 10362 toolschecker_grid_start_precise start/running, process 10370 toolschecker_service_start start/running, process 10378 toolschecker_nfs_secondary_cluster_showmount start/running, process 10386 toolschecker_nfs_home start/running, process 10394 toolschecker_nfs_showmount start/running, process 10402 start: Job is already running: toolschecker_kubernetes_nodes_ready Traceback (most recent call last): File "/usr/local/sbin/toolscheckerctl", line 33, in <module> subprocess.check_call(['service', service, args.action]) File "/usr/lib/python3.4/subprocess.py", line 561, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['service', 'toolschecker_kubernetes_nodes_ready', 'start']' returned non-zero exit status 1 tools-checker-01:~$ sudo toolscheckerctl start start: Job is already running: toolschecker_labs_private Traceback (most recent call last): File "/usr/local/sbin/toolscheckerctl", line 33, in <module> subprocess.check_call(['service', service, args.action]) File "/usr/lib/python3.4/subprocess.py", line 561, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['service', 'toolschecker_labs_private', 'start']' returned non-zero exit status 1
Description
Related Objects
- Mentioned Here
- T184832: Decommission labsdb1001 and labsdb1003
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2019-01-09T09:59:44Z] <gtirloni> rebooted tools-checker-01 (T213252)
After the reboot, there are checks that are still not running:
tools-checker-01:~$ sudo toolscheckerctl status | sort -k2 -r toolschecker_nfs_showmount stop/waiting toolschecker_labsdb_labsdb1003 stop/waiting toolschecker_labsdb_labsdb1003rw stop/waiting toolschecker_labsdb_labsdb1001 stop/waiting toolschecker_labsdb_labsdb1001rw stop/waiting toolschecker_grid_start_precise stop/waiting toolschecker_continuous_job_precise stop/waiting toolschecker_service_start start/running, process 2529 toolschecker_webservice_kubernetes start/running, process 2519 toolschecker_kubernetes_nodes_ready start/running, process 2508 toolschecker_kubernetes_etcd start/running, process 2498 toolschecker_flannel_etcd start/running, process 2485 toolschecker_cron start/running, process 2475 toolschecker_grid_start_trusty start/running, process 2465 toolschecker_continuous_job_trusty start/running, process 2455 toolschecker_dumps start/running, process 2446 toolschecker_toolsdb start/running, process 2437 toolschecker_labsdb_labsdb1004rw start/running, process 2423 toolschecker_labsdb_labsdb1005 start/running, process 2407 toolschecker_redis start/running, process 2395 toolschecker_nfs_home start/running, process 2387 toolschecker_ldap start/running, process 2378 toolschecker_nfs_secondary_cluster_showmount start/running, process 2371 toolschecker_labs_private start/running, process 2363 toolschecker_puppet_catalog start/running, process 2355 toolschecker_self start/running, process 2347
02:05:28 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount status toolschecker_nfs_showmount stop/waiting 02:06:18 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount start toolschecker_nfs_showmount start/running, process 14698 02:06:22 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount status toolschecker_nfs_showmount start/running, process 14698
Starting the first one manually did seem to work. Are they simply not enabled (start on boot)?
The boot service starting seemed all failed:
Jan 9 09:58:43 tools-checker-01 kernel: [ 15.743269] FS-Cache: Netfs 'nfs' registered for caching Jan 9 09:58:43 tools-checker-01 kernel: [ 15.760279] init: Failed to spawn toolschecker main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.784213] init: Failed to spawn toolschecker_labsdb_labsdb1003 main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.802854] init: Failed to spawn toolschecker_continuous_job_trusty main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.815255] Installing knfsd (copyright (C) 1996 okir@monad.swb.de). Jan 9 09:58:43 tools-checker-01 kernel: [ 15.829076] init: Failed to spawn toolschecker_flannel_etcd main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.853230] init: Failed to spawn toolschecker_labsdb_labsdb1001 main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.855349] init: Failed to spawn toolschecker_puppet_catalog main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.888520] init: Failed to spawn toolschecker_nfs_secondary_cluster_showmount main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.891387] init: Failed to spawn toolschecker_grid_start_trusty main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.915945] init: Failed to spawn toolschecker_labsdb_labsdb1001rw main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.949428] init: Failed to spawn toolschecker_labsdb_labsdb1003rw main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.951630] init: Failed to spawn toolschecker_service_start main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 15.976144] init: Failed to spawn toolschecker_kubernetes_etcd main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.024132] init: Failed to spawn toolschecker_labsdb_labsdb1005 main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.035022] init: Failed to spawn toolschecker_redis main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.044197] init: Failed to spawn toolschecker_toolsdb main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.046291] init: Failed to spawn toolschecker_continuous_job_precise main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.052201] init: Failed to spawn toolschecker_ldap main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.059837] init: Failed to spawn toolschecker_dumps main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.069203] init: Failed to spawn toolschecker_nfs_home main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.071314] init: Failed to spawn toolschecker_labs_private main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.080757] init: Failed to spawn toolschecker_labsdb_labsdb1004rw main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.082921] init: Failed to spawn toolschecker_nfs_showmount main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.101206] init: Failed to spawn toolschecker_self main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.115024] init: Failed to spawn toolschecker_grid_start_precise main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.116600] init: Failed to spawn toolschecker_webservice_kubernetes main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.129750] init: Failed to spawn toolschecker_cron main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.140133] init: Failed to spawn toolschecker_kubernetes_nodes_ready main process: unable to getpwnam: No such file or directory Jan 9 09:58:43 tools-checker-01 kernel: [ 16.759673] init: idmapd main process (939) terminated with status 1 Jan 9 09:58:43 tools-checker-01 kernel: [ 16.759688] init: idmapd main process ended, respawning
The current batch was started by puppet:
Jan 9 09:59:58 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/File[/run/toolschecker]/ensure) created Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[self]/Service[toolschecker_self]/ensure) ensure changed 'stopped' to 'running' Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[self]/Service[toolschecker_self]) Unscheduling refresh on Service[toolschecker_self] Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[puppet_catalog]/Service[toolschecker_puppet_catalog]/ensure) ensure changed 'stopped' to 'running' Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[puppet_catalog]/Service[toolschecker_puppet_catalog]) Unscheduling refresh on Service[toolschecker_puppet_catalog] Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labs_private]/Service[toolschecker_labs_private]/ensure) ensure changed 'stopped' to 'running' Jan 9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labs_private]/Service[toolschecker_labs_private]) Unscheduling refresh on Service[toolschecker_labs_private] Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_secondary_cluster_showmount]/Service[toolschecker_nfs_secondary_cluster_showmount]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_secondary_cluster_showmount]/Service[toolschecker_nfs_secondary_cluster_showmount]) Unscheduling refresh on Service[toolschecker_nfs_secondary_cluster_showmount] Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[ldap]/Service[toolschecker_ldap]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[ldap]/Service[toolschecker_ldap]) Unscheduling refresh on Service[toolschecker_ldap] Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_home]/Service[toolschecker_nfs_home]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_home]/Service[toolschecker_nfs_home]) Unscheduling refresh on Service[toolschecker_nfs_home] Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[redis]/Service[toolschecker_redis]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[redis]/Service[toolschecker_redis]) Unscheduling refresh on Service[toolschecker_redis] Jan 9 10:00:01 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1005]/Service[toolschecker_labsdb_labsdb1005]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:01 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1005]/Service[toolschecker_labsdb_labsdb1005]) Unscheduling refresh on Service[toolschecker_labsdb_labsdb1005] Jan 9 10:00:02 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1004rw]/Service[toolschecker_labsdb_labsdb1004rw]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:02 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1004rw]/Service[toolschecker_labsdb_labsdb1004rw]) Unscheduling refresh on Service[toolschecker_labsdb_labsdb1004rw] Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[toolsdb]/Service[toolschecker_toolsdb]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[toolsdb]/Service[toolschecker_toolsdb]) Unscheduling refresh on Service[toolschecker_toolsdb] Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[dumps]/Service[toolschecker_dumps]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[dumps]/Service[toolschecker_dumps]) Unscheduling refresh on Service[toolschecker_dumps] Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[continuous_job_trusty]/Service[toolschecker_continuous_job_trusty]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[continuous_job_trusty]/Service[toolschecker_continuous_job_trusty]) Unscheduling refresh on Service[toolschecker_continuous_job_trusty] Jan 9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[grid_start_trusty]/Service[toolschecker_grid_start_trusty]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[grid_start_trusty]/Service[toolschecker_grid_start_trusty]) Unscheduling refresh on Service[toolschecker_grid_start_trusty] Jan 9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[cron]/Service[toolschecker_cron]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[cron]/Service[toolschecker_cron]) Unscheduling refresh on Service[toolschecker_cron] Jan 9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[flannel_etcd]/Service[toolschecker_flannel_etcd]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[flannel_etcd]/Service[toolschecker_flannel_etcd]) Unscheduling refresh on Service[toolschecker_flannel_etcd] Jan 9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_etcd]/Service[toolschecker_kubernetes_etcd]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_etcd]/Service[toolschecker_kubernetes_etcd]) Unscheduling refresh on Service[toolschecker_kubernetes_etcd] Jan 9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_nodes_ready]/Service[toolschecker_kubernetes_nodes_ready]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_nodes_ready]/Service[toolschecker_kubernetes_nodes_ready]) Unscheduling refresh on Service[toolschecker_kubernetes_nodes_ready] Jan 9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[webservice_kubernetes]/Service[toolschecker_webservice_kubernetes]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[webservice_kubernetes]/Service[toolschecker_webservice_kubernetes]) Unscheduling refresh on Service[toolschecker_webservice_kubernetes] Jan 9 10:00:07 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[service_start]/Service[toolschecker_service_start]/ensure) ensure changed 'stopped' to 'running' Jan 9 10:00:07 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[service_start]/Service[toolschecker_service_start]) Unscheduling refresh on Service[toolschecker_service_start]
The checkers that weren't started are those that doesn't seem to be managed by puppet. Looking at the list:
toolschecker_nfs_showmount stop/waiting
No idea what this is. I stopped it again.
toolschecker_labsdb_labsdb1003 stop/waiting toolschecker_labsdb_labsdb1003rw stop/waiting toolschecker_labsdb_labsdb1001 stop/waiting toolschecker_labsdb_labsdb1001rw stop/waiting
These labsdb100[13] are some very old databases that have been decommissioned in T184832
toolschecker_grid_start_precise stop/waiting toolschecker_continuous_job_precise stop/waiting
Precise is dead.
I think they are just some leftovers from the past, and puppet don't clean them up when they become unmanaged.
I've deleted any un-Puppetized checks from tools-checker-{01,02}. The backups are in /root/toolschecker-purge-2019-01-10
$ sudo toolscheckerctl stop toolschecker_labs_private stop/waiting toolschecker_toolsdb stop/waiting toolschecker_dumps stop/waiting toolschecker_cron stop/waiting toolschecker_webservice_kubernetes stop/waiting toolschecker_continuous_job_trusty stop/waiting toolschecker_labsdb_labsdb1005 stop/waiting toolschecker_service_start stop/waiting toolschecker_nfs_secondary_cluster_showmount stop/waiting toolschecker_nfs_home stop/waiting toolschecker_kubernetes_nodes_ready stop/waiting toolschecker_self stop/waiting toolschecker_flannel_etcd stop/waiting toolschecker_kubernetes_etcd stop/waiting toolschecker_grid_start_trusty stop/waiting toolschecker_puppet_catalog stop/waiting toolschecker_redis stop/waiting toolschecker_ldap stop/waiting toolschecker_labsdb_labsdb1004rw stop/waiting $ sudo toolscheckerctl start toolschecker_labs_private start/running, process 3426 toolschecker_toolsdb start/running, process 3436 toolschecker_dumps start/running, process 3445 toolschecker_cron start/running, process 3458 toolschecker_webservice_kubernetes start/running, process 3466 toolschecker_continuous_job_trusty start/running, process 3477 toolschecker_labsdb_labsdb1005 start/running, process 3487 toolschecker_service_start start/running, process 3496 toolschecker_nfs_secondary_cluster_showmount start/running, process 3509 toolschecker_nfs_home start/running, process 3518 toolschecker_kubernetes_nodes_ready start/running, process 3529 toolschecker_self start/running, process 3539 toolschecker_flannel_etcd start/running, process 3550 toolschecker_kubernetes_etcd start/running, process 3558 toolschecker_grid_start_trusty start/running, process 3569 toolschecker_puppet_catalog start/running, process 3577 toolschecker_redis start/running, process 3590 toolschecker_ldap start/running, process 3601 toolschecker_labsdb_labsdb1004rw start/running, process 3611
@Bstorm @zhuyifei1999 thanks for helping me investigate this.