In 2018-11-14 we got a page for labnet1001:
PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
At the same time, we had these alerts for labstore1004:
PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
I downtimed both hosts in icinga for 1h to avoid additional SMS pages and begin investigating.
This was present in /var/log/syslog in labstore1004:
Nov 14 12:06:14 labstore1004 maintain-dbusers[18782]: pymysql.err.OperationalError: (1040, 'Too many connections')
And this was present in /var/log/syslog in labnet1001:
aborrero@labnet1001:~ $ sudo grep nova-fullstack /var/log/syslog Nov 14 12:06:17 labnet1001 kernel: [15801036.224638] init: nova-fullstack main process (14830) terminated with status 1 Nov 14 12:06:17 labnet1001 kernel: [15801036.224658] init: nova-fullstack main process ended, respawning Nov 14 12:06:18 labnet1001 kernel: [15801037.417144] init: nova-fullstack main process (17255) terminated with status 1 Nov 14 12:06:18 labnet1001 kernel: [15801037.417157] init: nova-fullstack main process ended, respawning Nov 14 12:06:19 labnet1001 kernel: [15801038.471882] init: nova-fullstack main process (17299) terminated with status 1 Nov 14 12:06:19 labnet1001 kernel: [15801038.471896] init: nova-fullstack main process ended, respawning Nov 14 12:06:21 labnet1001 kernel: [15801040.428913] init: nova-fullstack main process (17319) terminated with status 1 Nov 14 12:06:21 labnet1001 kernel: [15801040.428925] init: nova-fullstack main process ended, respawning Nov 14 12:06:23 labnet1001 kernel: [15801042.603344] init: nova-fullstack main process (17329) terminated with status 1 Nov 14 12:06:23 labnet1001 kernel: [15801042.603360] init: nova-fullstack main process ended, respawning Nov 14 12:06:24 labnet1001 kernel: [15801043.710914] init: nova-fullstack main process (17343) terminated with status 1 Nov 14 12:06:24 labnet1001 kernel: [15801043.710934] init: nova-fullstack main process ended, respawning Nov 14 12:06:26 labnet1001 kernel: [15801044.917311] init: nova-fullstack main process (17353) terminated with status 1 Nov 14 12:06:26 labnet1001 kernel: [15801044.917331] init: nova-fullstack main process ended, respawning Nov 14 12:06:27 labnet1001 kernel: [15801046.065372] init: nova-fullstack main process (17364) terminated with status 1 Nov 14 12:06:27 labnet1001 kernel: [15801046.065386] init: nova-fullstack main process ended, respawning Nov 14 12:06:28 labnet1001 kernel: [15801047.164004] init: nova-fullstack main process (17374) terminated with status 1 Nov 14 12:06:28 labnet1001 kernel: [15801047.164022] init: nova-fullstack main process ended, respawning Nov 14 12:06:29 labnet1001 kernel: [15801048.121265] init: nova-fullstack main process (17427) terminated with status 1 Nov 14 12:06:29 labnet1001 kernel: [15801048.121283] init: nova-fullstack main process ended, respawning Nov 14 12:06:30 labnet1001 kernel: [15801049.074531] init: nova-fullstack main process (17434) terminated with status 1 Nov 14 12:06:30 labnet1001 kernel: [15801049.074555] init: nova-fullstack respawning too fast, stopped Nov 14 12:10:36 labnet1001 puppet-agent[18946]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Service[nova-fullstack]/ensure) ensure changed 'stopped' to 'running' Nov 14 12:10:36 labnet1001 puppet-agent[18946]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Service[nova-fullstack]) Unscheduling refresh on Service[nova-fullstack]
This could be another occurrence of T188589: m5-master overloaded by idle connections to the nova database