TL;TR: RAM issue in DIMM A3 & A7, need hw support.
On 2018-10-27 21:12 UTC the following alerts appeared on #wikimedia-operations irc channel:
+icinga-wm> IRC echo bot PROBLEM - Host db1117 is DOWN: PING CRITICAL - Packet loss = 100% 23:13 PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 23:13 PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 23:13 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 23:13 PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1 23:13 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 23:13 PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1
The host was down indeed, I was not able to open an ssh connection there.
I connected to the servers manamement interface, and started a cirtual console, but I was not able to see anything in the console itself, as it was blank, but the virtual console doesn't shown any activity.
I checked the hosts power status, and it was 'ON'.
I decided to do a power cycle on the host with:
racadm serveraction powercycle
After the powercycle the host boot up.
On the console I've seen that the prometheus exporter was not able to start
OK ] Started Prometheus exporter for MySQL server. [ OK ] Stopped Prometheus exporter for MySQL server. [FAILED] Failed to start Prometheus exporter for MySQL server. See 'systemctl status prometheus-mysqld-exporter.service' for details. [ OK ] Started LSB: exim Mail Transport Agent.
I started the hosts mariadb instances one-by one:
systemctl start mariadb@m1 systemctl start mariadb@m2 systemctl start mariadb@m3 systemctl start mariadb@m5
The services started up without any trouble. The replications started, the instances cathed up.
Regarding the output of journcalctl -u mariadb@m1 (m2, m3, m5) all recovery (when needed) finished.
All the icinga checks became OK except systemd state.
After investigating the prometheus-mysqld-exporter service, it was not able to start as it was not able to find /var/lib/prometheus/.my.cnf file.
I checked db2078 (multiinstance host too, takes place in the same cluster) and I checked the home /var/lib/prometheus directory for .my.cnf file. It was there, and it suggested that the mysqld instance is available under /run/mysqld/mysqld.sock - which is not true, there's no mysql instance having that socket.
I created manually the same .my.cnf file, and started the prometheus mysqld exporter which reported running state.
This is not good, we have to puppetize this.
After the server become available I started to investigate the root cause of the outage, but I was not able to find anything.
The syslog file has the following entries around the given timestamp:
Oct 27 21:08:44 db1117 systemd[1]: Started Time & Date Service. Oct 27 21:09:01 db1117 CRON[22366]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'nf_conntrack' Oct 27 21:28:47 db1117 systemd-modules-load[873]: Inserted module 'ipmi_devintf' Oct 27 21:28:47 db1117 systemd[1]: Started Load Kernel Modules. Oct 27 21:28:47 db1117 systemd[1]: Started Remount Root and Kernel File Systems. Oct 27 21:28:47 db1117 systemd[1]: Started Create list of required static device nodes for the current kernel. Oct 27 21:28:47 db1117 systemd[1]: Started LVM2 metadata daemon. Oct 27 21:28:47 db1117 systemd[1]: Starting Create Static Device Nodes in /dev...
The /var/log/messages file is also empty:
Oct 27 20:57:08 db1117 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server Oct 27 21:28:47 db1117 kernel: [ 0.000000] microcode: microcode updated early to revision 0xb00002e, date = 2018-04-19
In the prometheus graphs there's a some anomalies around that time period but I am not sure how to interpret them.