Page MenuHomePhabricator

2021-01-15: ** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL **
Closed, ResolvedPublic

Description

Service: Ensure mysql credential creation for tools users is running
Host: labstore1004
Address: 10.64.37.19
State: CRITICAL

Date/Time: Fri Jan 15 09:10:56 UTC 2021

Notes URLs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

Acknowledged by :

Additional Info:

CRITICAL - Expecting active but unit maintain-dbusers is failed

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
OpenNone
OpenNone
Resolvedfnegri
ResolvedRobH
Resolved Bstorm
Resolved Bstorm
ResolvedMarostegui
ResolvedMarostegui
StalledNone
ResolvedNone
ResolvedMarostegui
ResolvedMarostegui
ResolvedRobH
ResolvedMarostegui
ResolvedMarostegui
Resolved Cmjohnson
Resolveddcaro

Event Timeline

dcaro renamed this task from 2021-01** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL ** to 2021-01-15: ** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL **.Jan 15 2021, 9:17 AM
dcaro claimed this task.
dcaro updated the task description. (Show Details)

Looking it seems that the server labstore1004 is unable to connect to the DB (timeout):

Jan 15 09:03:15 labstore1004 /usr/local/sbin/maintain-dbusers[21751]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MySQL server on 'clouddb1019.eqiad.wmnet' (timed out)").

The ones that failed so far:

dcaro@labstore1004:~$ sudo journalctl -u maintain-dbusers.service | egrep -o 'Could not connect to [^ ]*' | awk '{print $5}'  | sort | uniq
172.16.7.153
clouddb1014.eqiad.wmnet:3312
clouddb1015.eqiad.wmnet:3316
clouddb1017.eqiad.wmnet:3311
clouddb1017.eqiad.wmnet:3313
clouddb1018.eqiad.wmnet:3312
clouddb1019.eqiad.wmnet:3314
clouddb1019.eqiad.wmnet:3316
clouddb1020.eqiad.wmnet:3318

Port scan:

dcaro@labstore1004:~$ for server_port in $(sudo journalctl -u maintain-dbusers.service | egrep -o 'Could not connect to [^ ]*' | awk '{print $5}'  | sort | uniq); do nc -w2 -z "${server_port%%:*}" "${server_port##*:}" && echo "$server_port alive" || echo "$server_port dead"; done
172.16.7.153 dead
clouddb1014.eqiad.wmnet:3312 alive
clouddb1015.eqiad.wmnet:3316 alive
clouddb1017.eqiad.wmnet:3311 alive
clouddb1017.eqiad.wmnet:3313 alive
clouddb1018.eqiad.wmnet:3312 alive
clouddb1019.eqiad.wmnet:3314 dead
clouddb1019.eqiad.wmnet:3316 dead
clouddb1020.eqiad.wmnet:3318 alive

I got confirmation from @Marostegui that wiki replicas are under maintenance right now. Some of them were rebooted for kernel upgrades, clouddb1019 never came back from the reboot (see T272125).

So I guess this failure is expected.

However I also see:

Jan 15 04:17:36 labstore1004 /usr/local/sbin/maintain-dbusers[44221]: Could not connect to 172.16.7.153 due to (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)').  Skip
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]: Traceback (most recent call last):
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 778, in <module>
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     main()
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 768, in main
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     create_accounts(config)
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 569, in create_accounts
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     labsdb.close()
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 721, in close
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     raise err.Error("Already closed")
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]: pymysql.err.Error: Already closed
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Unit entered failed state.
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Failed with result 'exit-code'.
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Service hold-off time over, scheduling restart.
Jan 15 04:17:36 labstore1004 systemd[1]: Stopped Maintain labsdb accounts.
Jan 15 04:17:36 labstore1004 systemd[1]: Started Maintain labsdb accounts.
Jan 15 04:17:42 labstore1004 /usr/local/sbin/maintain-dbusers[23437]: Could not connect to 172.16.7.153 due to (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))").  Skipp
Jan 15 04:17:42 labstore1004 maintain-dbusers[23437]: Traceback (most recent call last):

Which indicates toolsdb (clouddb1001.clouddb-services.eqiad1.wikimedia.cloud) had some kind of issue too?

There isn't much on journal logs for that time on the toolsdb host:

Jan 15 04:11:01 clouddb1001 CRON[27677]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:11:01 clouddb1001 CRON[27678]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:11:01 clouddb1001 CRON[27677]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:12:01 clouddb1001 CRON[27680]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:12:01 clouddb1001 CRON[27681]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:12:01 clouddb1001 systemd[1]: Started Auto restart job: ssh.
Jan 15 04:12:01 clouddb1001 CRON[27680]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:12:01 clouddb1001 wmf-auto-restart[27682]: INFO: 2021-01-15 04:12:01,835 : No restart necessary for service ssh
Jan 15 04:13:01 clouddb1001 CRON[27690]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:13:01 clouddb1001 CRON[27691]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:13:01 clouddb1001 CRON[27690]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:14:01 clouddb1001 CRON[27706]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:14:01 clouddb1001 CRON[27707]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:14:02 clouddb1001 CRON[27706]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:15:01 clouddb1001 CRON[27710]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27709]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27712]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:15:01 clouddb1001 CRON[27713]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 15 04:15:01 clouddb1001 CRON[27711]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27714]: (root) CMD (/usr/local/bin/prometheus-ssh_open_sessions /var/lib/prometheus/node.d/ssh_open_sessions.prom)
Jan 15 04:15:01 clouddb1001 CRON[27711]: pam_unix(cron:session): session closed for user root
Jan 15 04:15:01 clouddb1001 CRON[27709]: pam_unix(cron:session): session closed for user root
Jan 15 04:15:01 clouddb1001 CRON[27710]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:16:01 clouddb1001 CRON[27722]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:16:01 clouddb1001 CRON[27723]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:16:02 clouddb1001 CRON[27722]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:17:01 clouddb1001 CRON[27726]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:17:01 clouddb1001 CRON[27725]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:17:01 clouddb1001 CRON[27727]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:17:01 clouddb1001 CRON[27728]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 15 04:17:01 clouddb1001 CRON[27725]: pam_unix(cron:session): session closed for user root
Jan 15 04:17:01 clouddb1001 CRON[27726]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:18:01 clouddb1001 CRON[27731]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:18:01 clouddb1001 CRON[27732]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:18:02 clouddb1001 CRON[27731]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:19:01 clouddb1001 CRON[27736]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:19:01 clouddb1001 CRON[27737]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:19:01 clouddb1001 CRON[27736]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:20:01 clouddb1001 CRON[27739]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:20:01 clouddb1001 CRON[27741]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:20:01 clouddb1001 CRON[27740]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:20:01 clouddb1001 CRON[27743]: (root) CMD (/usr/local/bin/prometheus-ssh_open_sessions /var/lib/prometheus/node.d/ssh_open_sessions.prom)
Jan 15 04:20:01 clouddb1001 CRON[27739]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:20:02 clouddb1001 CRON[27740]: pam_unix(cron:session): session closed for user root
aborrero triaged this task as Medium priority.Jan 15 2021, 10:08 AM

Change 656439 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain-dbusers: skipping servers needs to skip closing too

https://gerrit.wikimedia.org/r/656439

Change 656439 merged by Bstorm:
[operations/puppet@production] maintain-dbusers: skipping servers needs to skip closing too

https://gerrit.wikimedia.org/r/656439

That fixed it:

Jan 15 15:59:19 labstore1004 systemd[1]: Started Maintain labsdb accounts.
Jan 15 15:59:34 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3314 due to (2003, "Can't connect to MyS
Jan 15 15:59:34 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3314: Already closed
Jan 15 15:59:44 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MyS
Jan 15 15:59:44 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3316: Already closed
Jan 15 16:02:08 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3314 due to (2003, "Can't connect to MyS
Jan 15 16:02:08 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3314: Already closed
Jan 15 16:02:18 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MyS
Jan 15 16:02:18 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3316: Already closed

Mentioned in SAL (#wikimedia-operations) [2021-01-15T16:17:22Z] <bstorm> canceled downtime for maintain-dbusers on labstore1004 T272127