Page MenuHomePhabricator

toolscheckerctl fails to stop/start checks
Closed, ResolvedPublic

Description

tools-checker-01:~$ sudo toolscheckerctl stop
toolschecker_labs_private stop/waiting
toolschecker_toolsdb stop/waiting
toolschecker_dumps stop/waiting
toolschecker_cron stop/waiting
toolschecker_webservice_kubernetes stop/waiting
toolschecker_continuous_job_trusty stop/waiting
toolschecker_labsdb_labsdb1005 stop/waiting
toolschecker_grid_start_precise stop/waiting
toolschecker_service_start stop/waiting
toolschecker_nfs_secondary_cluster_showmount stop/waiting
toolschecker_nfs_home stop/waiting
stop: Unknown instance: 
Traceback (most recent call last):
  File "/usr/local/sbin/toolscheckerctl", line 33, in <module>
    subprocess.check_call(['service', service, args.action])
  File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['service', 'toolschecker_nfs_showmount', 'stop']' returned non-zero exit status 1


tools-checker-01:~$ sudo toolscheckerctl start
toolschecker_labs_private start/running, process 10314
toolschecker_toolsdb start/running, process 10322
toolschecker_dumps start/running, process 10330
toolschecker_cron start/running, process 10338
toolschecker_webservice_kubernetes start/running, process 10346
toolschecker_continuous_job_trusty start/running, process 10354
toolschecker_labsdb_labsdb1005 start/running, process 10362
toolschecker_grid_start_precise start/running, process 10370
toolschecker_service_start start/running, process 10378
toolschecker_nfs_secondary_cluster_showmount start/running, process 10386
toolschecker_nfs_home start/running, process 10394
toolschecker_nfs_showmount start/running, process 10402
start: Job is already running: toolschecker_kubernetes_nodes_ready
Traceback (most recent call last):
  File "/usr/local/sbin/toolscheckerctl", line 33, in <module>
    subprocess.check_call(['service', service, args.action])
  File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['service', 'toolschecker_kubernetes_nodes_ready', 'start']' returned non-zero exit status 1

tools-checker-01:~$ sudo toolscheckerctl start
start: Job is already running: toolschecker_labs_private
Traceback (most recent call last):
  File "/usr/local/sbin/toolscheckerctl", line 33, in <module>
    subprocess.check_call(['service', service, args.action])
  File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['service', 'toolschecker_labs_private', 'start']' returned non-zero exit status 1

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-01-09T09:59:44Z] <gtirloni> rebooted tools-checker-01 (T213252)

After the reboot, there are checks that are still not running:

tools-checker-01:~$ sudo toolscheckerctl status | sort -k2 -r
toolschecker_nfs_showmount stop/waiting
toolschecker_labsdb_labsdb1003 stop/waiting
toolschecker_labsdb_labsdb1003rw stop/waiting
toolschecker_labsdb_labsdb1001 stop/waiting
toolschecker_labsdb_labsdb1001rw stop/waiting
toolschecker_grid_start_precise stop/waiting
toolschecker_continuous_job_precise stop/waiting
toolschecker_service_start start/running, process 2529
toolschecker_webservice_kubernetes start/running, process 2519
toolschecker_kubernetes_nodes_ready start/running, process 2508
toolschecker_kubernetes_etcd start/running, process 2498
toolschecker_flannel_etcd start/running, process 2485
toolschecker_cron start/running, process 2475
toolschecker_grid_start_trusty start/running, process 2465
toolschecker_continuous_job_trusty start/running, process 2455
toolschecker_dumps start/running, process 2446
toolschecker_toolsdb start/running, process 2437
toolschecker_labsdb_labsdb1004rw start/running, process 2423
toolschecker_labsdb_labsdb1005 start/running, process 2407
toolschecker_redis start/running, process 2395
toolschecker_nfs_home start/running, process 2387
toolschecker_ldap start/running, process 2378
toolschecker_nfs_secondary_cluster_showmount start/running, process 2371
toolschecker_labs_private start/running, process 2363
toolschecker_puppet_catalog start/running, process 2355
toolschecker_self start/running, process 2347
GTirloni renamed this task from toolscheckerctl fails to start checks to toolscheckerctl fails to stop/start checks.Jan 9 2019, 10:03 AM
GTirloni triaged this task as Medium priority.
02:05:28 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount status
toolschecker_nfs_showmount stop/waiting
02:06:18 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount start
toolschecker_nfs_showmount start/running, process 14698
02:06:22 0 ✓ zhuyifei1999@tools-checker-01: ~$ sudo service toolschecker_nfs_showmount status
toolschecker_nfs_showmount start/running, process 14698

Starting the first one manually did seem to work. Are they simply not enabled (start on boot)?

Are they simply not enabled (start on boot)?

No. Gotta check logs I guess...

The boot service starting seemed all failed:

Jan  9 09:58:43 tools-checker-01 kernel: [   15.743269] FS-Cache: Netfs 'nfs' registered for caching
Jan  9 09:58:43 tools-checker-01 kernel: [   15.760279] init: Failed to spawn toolschecker main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.784213] init: Failed to spawn toolschecker_labsdb_labsdb1003 main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.802854] init: Failed to spawn toolschecker_continuous_job_trusty main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.815255] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
Jan  9 09:58:43 tools-checker-01 kernel: [   15.829076] init: Failed to spawn toolschecker_flannel_etcd main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.853230] init: Failed to spawn toolschecker_labsdb_labsdb1001 main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.855349] init: Failed to spawn toolschecker_puppet_catalog main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.888520] init: Failed to spawn toolschecker_nfs_secondary_cluster_showmount main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.891387] init: Failed to spawn toolschecker_grid_start_trusty main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.915945] init: Failed to spawn toolschecker_labsdb_labsdb1001rw main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.949428] init: Failed to spawn toolschecker_labsdb_labsdb1003rw main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.951630] init: Failed to spawn toolschecker_service_start main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   15.976144] init: Failed to spawn toolschecker_kubernetes_etcd main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.024132] init: Failed to spawn toolschecker_labsdb_labsdb1005 main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.035022] init: Failed to spawn toolschecker_redis main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.044197] init: Failed to spawn toolschecker_toolsdb main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.046291] init: Failed to spawn toolschecker_continuous_job_precise main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.052201] init: Failed to spawn toolschecker_ldap main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.059837] init: Failed to spawn toolschecker_dumps main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.069203] init: Failed to spawn toolschecker_nfs_home main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.071314] init: Failed to spawn toolschecker_labs_private main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.080757] init: Failed to spawn toolschecker_labsdb_labsdb1004rw main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.082921] init: Failed to spawn toolschecker_nfs_showmount main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.101206] init: Failed to spawn toolschecker_self main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.115024] init: Failed to spawn toolschecker_grid_start_precise main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.116600] init: Failed to spawn toolschecker_webservice_kubernetes main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.129750] init: Failed to spawn toolschecker_cron main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.140133] init: Failed to spawn toolschecker_kubernetes_nodes_ready main process: unable to getpwnam: No such file or directory
Jan  9 09:58:43 tools-checker-01 kernel: [   16.759673] init: idmapd main process (939) terminated with status 1
Jan  9 09:58:43 tools-checker-01 kernel: [   16.759688] init: idmapd main process ended, respawning

The current batch was started by puppet:

Jan  9 09:59:58 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/File[/run/toolschecker]/ensure) created
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[self]/Service[toolschecker_self]/ensure) ensure changed 'stopped' to 'running'
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[self]/Service[toolschecker_self]) Unscheduling refresh on Service[toolschecker_self]
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[puppet_catalog]/Service[toolschecker_puppet_catalog]/ensure) ensure changed 'stopped' to 'running'
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[puppet_catalog]/Service[toolschecker_puppet_catalog]) Unscheduling refresh on Service[toolschecker_puppet_catalog]
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labs_private]/Service[toolschecker_labs_private]/ensure) ensure changed 'stopped' to 'running'
Jan  9 09:59:59 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labs_private]/Service[toolschecker_labs_private]) Unscheduling refresh on Service[toolschecker_labs_private]
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_secondary_cluster_showmount]/Service[toolschecker_nfs_secondary_cluster_showmount]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_secondary_cluster_showmount]/Service[toolschecker_nfs_secondary_cluster_showmount]) Unscheduling refresh on Service[toolschecker_nfs_secondary_cluster_showmount]
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[ldap]/Service[toolschecker_ldap]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[ldap]/Service[toolschecker_ldap]) Unscheduling refresh on Service[toolschecker_ldap]
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_home]/Service[toolschecker_nfs_home]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[nfs_home]/Service[toolschecker_nfs_home]) Unscheduling refresh on Service[toolschecker_nfs_home]
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[redis]/Service[toolschecker_redis]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:00 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[redis]/Service[toolschecker_redis]) Unscheduling refresh on Service[toolschecker_redis]
Jan  9 10:00:01 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1005]/Service[toolschecker_labsdb_labsdb1005]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:01 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1005]/Service[toolschecker_labsdb_labsdb1005]) Unscheduling refresh on Service[toolschecker_labsdb_labsdb1005]
Jan  9 10:00:02 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1004rw]/Service[toolschecker_labsdb_labsdb1004rw]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:02 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[labsdb_labsdb1004rw]/Service[toolschecker_labsdb_labsdb1004rw]) Unscheduling refresh on Service[toolschecker_labsdb_labsdb1004rw]
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[toolsdb]/Service[toolschecker_toolsdb]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[toolsdb]/Service[toolschecker_toolsdb]) Unscheduling refresh on Service[toolschecker_toolsdb]
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[dumps]/Service[toolschecker_dumps]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[dumps]/Service[toolschecker_dumps]) Unscheduling refresh on Service[toolschecker_dumps]
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[continuous_job_trusty]/Service[toolschecker_continuous_job_trusty]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:03 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[continuous_job_trusty]/Service[toolschecker_continuous_job_trusty]) Unscheduling refresh on Service[toolschecker_continuous_job_trusty]
Jan  9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[grid_start_trusty]/Service[toolschecker_grid_start_trusty]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[grid_start_trusty]/Service[toolschecker_grid_start_trusty]) Unscheduling refresh on Service[toolschecker_grid_start_trusty]
Jan  9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[cron]/Service[toolschecker_cron]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:04 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[cron]/Service[toolschecker_cron]) Unscheduling refresh on Service[toolschecker_cron]
Jan  9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[flannel_etcd]/Service[toolschecker_flannel_etcd]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[flannel_etcd]/Service[toolschecker_flannel_etcd]) Unscheduling refresh on Service[toolschecker_flannel_etcd]
Jan  9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_etcd]/Service[toolschecker_kubernetes_etcd]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:05 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_etcd]/Service[toolschecker_kubernetes_etcd]) Unscheduling refresh on Service[toolschecker_kubernetes_etcd]
Jan  9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_nodes_ready]/Service[toolschecker_kubernetes_nodes_ready]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[kubernetes_nodes_ready]/Service[toolschecker_kubernetes_nodes_ready]) Unscheduling refresh on Service[toolschecker_kubernetes_nodes_ready]
Jan  9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[webservice_kubernetes]/Service[toolschecker_webservice_kubernetes]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:06 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[webservice_kubernetes]/Service[toolschecker_webservice_kubernetes]) Unscheduling refresh on Service[toolschecker_webservice_kubernetes]
Jan  9 10:00:07 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[service_start]/Service[toolschecker_service_start]/ensure) ensure changed 'stopped' to 'running'
Jan  9 10:00:07 tools-checker-01 puppet-agent[1717]: (/Stage[main]/Toollabs::Checker/Toollabs::Check[service_start]/Service[toolschecker_service_start]) Unscheduling refresh on Service[toolschecker_service_start]

The checkers that weren't started are those that doesn't seem to be managed by puppet. Looking at the list:

toolschecker_nfs_showmount stop/waiting

No idea what this is. I stopped it again.

toolschecker_labsdb_labsdb1003 stop/waiting
toolschecker_labsdb_labsdb1003rw stop/waiting
toolschecker_labsdb_labsdb1001 stop/waiting
toolschecker_labsdb_labsdb1001rw stop/waiting

These labsdb100[13] are some very old databases that have been decommissioned in T184832

toolschecker_grid_start_precise stop/waiting
toolschecker_continuous_job_precise stop/waiting

Precise is dead.

I think they are just some leftovers from the past, and puppet don't clean them up when they become unmanaged.

Yeah. There's a lot of checkers that should never be running.

I've deleted any un-Puppetized checks from tools-checker-{01,02}. The backups are in /root/toolschecker-purge-2019-01-10

$ sudo toolscheckerctl stop
toolschecker_labs_private stop/waiting
toolschecker_toolsdb stop/waiting
toolschecker_dumps stop/waiting
toolschecker_cron stop/waiting
toolschecker_webservice_kubernetes stop/waiting
toolschecker_continuous_job_trusty stop/waiting
toolschecker_labsdb_labsdb1005 stop/waiting
toolschecker_service_start stop/waiting
toolschecker_nfs_secondary_cluster_showmount stop/waiting
toolschecker_nfs_home stop/waiting
toolschecker_kubernetes_nodes_ready stop/waiting
toolschecker_self stop/waiting
toolschecker_flannel_etcd stop/waiting
toolschecker_kubernetes_etcd stop/waiting
toolschecker_grid_start_trusty stop/waiting
toolschecker_puppet_catalog stop/waiting
toolschecker_redis stop/waiting
toolschecker_ldap stop/waiting
toolschecker_labsdb_labsdb1004rw stop/waiting

$ sudo toolscheckerctl start
toolschecker_labs_private start/running, process 3426
toolschecker_toolsdb start/running, process 3436
toolschecker_dumps start/running, process 3445
toolschecker_cron start/running, process 3458
toolschecker_webservice_kubernetes start/running, process 3466
toolschecker_continuous_job_trusty start/running, process 3477
toolschecker_labsdb_labsdb1005 start/running, process 3487
toolschecker_service_start start/running, process 3496
toolschecker_nfs_secondary_cluster_showmount start/running, process 3509
toolschecker_nfs_home start/running, process 3518
toolschecker_kubernetes_nodes_ready start/running, process 3529
toolschecker_self start/running, process 3539
toolschecker_flannel_etcd start/running, process 3550
toolschecker_kubernetes_etcd start/running, process 3558
toolschecker_grid_start_trusty start/running, process 3569
toolschecker_puppet_catalog start/running, process 3577
toolschecker_redis start/running, process 3590
toolschecker_ldap start/running, process 3601
toolschecker_labsdb_labsdb1004rw start/running, process 3611

@Bstorm @zhuyifei1999 thanks for helping me investigate this.

GTirloni claimed this task.