Page MenuHomePhabricator

2021-01-15: ** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL **
Closed, ResolvedPublic

Description

Service: Ensure mysql credential creation for tools users is running
Host: labstore1004
Address: 10.64.37.19
State: CRITICAL

Date/Time: Fri Jan 15 09:10:56 UTC 2021

Notes URLs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

Acknowledged by :

Additional Info:

CRITICAL - Expecting active but unit maintain-dbusers is failed

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
OpenNone
OpenJhernandez
ResolvedRobH
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
OpenNone
OpenNone
ResolvedJdforrester-WMF
OpenMarostegui
ResolvedRobH
OpenNone
OpenNone
ResolvedMarostegui
OpenMarostegui
ResolvedCmjohnson
Resolveddcaro

Event Timeline

dcaro renamed this task from 2021-01** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL ** to 2021-01-15: ** PROBLEM alert - labstore1004/Ensure mysql credential creation for tools users is running is CRITICAL **.Jan 15 2021, 9:17 AM
dcaro claimed this task.
dcaro updated the task description. (Show Details)

Looking it seems that the server labstore1004 is unable to connect to the DB (timeout):

Jan 15 09:03:15 labstore1004 /usr/local/sbin/maintain-dbusers[21751]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MySQL server on 'clouddb1019.eqiad.wmnet' (timed out)").

The ones that failed so far:

dcaro@labstore1004:~$ sudo journalctl -u maintain-dbusers.service | egrep -o 'Could not connect to [^ ]*' | awk '{print $5}'  | sort | uniq
172.16.7.153
clouddb1014.eqiad.wmnet:3312
clouddb1015.eqiad.wmnet:3316
clouddb1017.eqiad.wmnet:3311
clouddb1017.eqiad.wmnet:3313
clouddb1018.eqiad.wmnet:3312
clouddb1019.eqiad.wmnet:3314
clouddb1019.eqiad.wmnet:3316
clouddb1020.eqiad.wmnet:3318

Port scan:

dcaro@labstore1004:~$ for server_port in $(sudo journalctl -u maintain-dbusers.service | egrep -o 'Could not connect to [^ ]*' | awk '{print $5}'  | sort | uniq); do nc -w2 -z "${server_port%%:*}" "${server_port##*:}" && echo "$server_port alive" || echo "$server_port dead"; done
172.16.7.153 dead
clouddb1014.eqiad.wmnet:3312 alive
clouddb1015.eqiad.wmnet:3316 alive
clouddb1017.eqiad.wmnet:3311 alive
clouddb1017.eqiad.wmnet:3313 alive
clouddb1018.eqiad.wmnet:3312 alive
clouddb1019.eqiad.wmnet:3314 dead
clouddb1019.eqiad.wmnet:3316 dead
clouddb1020.eqiad.wmnet:3318 alive

I got confirmation from @Marostegui that wiki replicas are under maintenance right now. Some of them were rebooted for kernel upgrades, clouddb1019 never came back from the reboot (see T272125).

So I guess this failure is expected.

However I also see:

Jan 15 04:17:36 labstore1004 /usr/local/sbin/maintain-dbusers[44221]: Could not connect to 172.16.7.153 due to (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)').  Skip
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]: Traceback (most recent call last):
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 778, in <module>
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     main()
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 768, in main
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     create_accounts(config)
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/local/sbin/maintain-dbusers", line 569, in create_accounts
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     labsdb.close()
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 721, in close
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]:     raise err.Error("Already closed")
Jan 15 04:17:36 labstore1004 maintain-dbusers[44221]: pymysql.err.Error: Already closed
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Main process exited, code=exited, status=1/FAILURE
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Unit entered failed state.
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Failed with result 'exit-code'.
Jan 15 04:17:36 labstore1004 systemd[1]: maintain-dbusers.service: Service hold-off time over, scheduling restart.
Jan 15 04:17:36 labstore1004 systemd[1]: Stopped Maintain labsdb accounts.
Jan 15 04:17:36 labstore1004 systemd[1]: Started Maintain labsdb accounts.
Jan 15 04:17:42 labstore1004 /usr/local/sbin/maintain-dbusers[23437]: Could not connect to 172.16.7.153 due to (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))").  Skipp
Jan 15 04:17:42 labstore1004 maintain-dbusers[23437]: Traceback (most recent call last):

Which indicates toolsdb (clouddb1001.clouddb-services.eqiad1.wikimedia.cloud) had some kind of issue too?

There isn't much on journal logs for that time on the toolsdb host:

Jan 15 04:11:01 clouddb1001 CRON[27677]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:11:01 clouddb1001 CRON[27678]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:11:01 clouddb1001 CRON[27677]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:12:01 clouddb1001 CRON[27680]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:12:01 clouddb1001 CRON[27681]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:12:01 clouddb1001 systemd[1]: Started Auto restart job: ssh.
Jan 15 04:12:01 clouddb1001 CRON[27680]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:12:01 clouddb1001 wmf-auto-restart[27682]: INFO: 2021-01-15 04:12:01,835 : No restart necessary for service ssh
Jan 15 04:13:01 clouddb1001 CRON[27690]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:13:01 clouddb1001 CRON[27691]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:13:01 clouddb1001 CRON[27690]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:14:01 clouddb1001 CRON[27706]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:14:01 clouddb1001 CRON[27707]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:14:02 clouddb1001 CRON[27706]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:15:01 clouddb1001 CRON[27710]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27709]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27712]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:15:01 clouddb1001 CRON[27713]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 15 04:15:01 clouddb1001 CRON[27711]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:15:01 clouddb1001 CRON[27714]: (root) CMD (/usr/local/bin/prometheus-ssh_open_sessions /var/lib/prometheus/node.d/ssh_open_sessions.prom)
Jan 15 04:15:01 clouddb1001 CRON[27711]: pam_unix(cron:session): session closed for user root
Jan 15 04:15:01 clouddb1001 CRON[27709]: pam_unix(cron:session): session closed for user root
Jan 15 04:15:01 clouddb1001 CRON[27710]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:16:01 clouddb1001 CRON[27722]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:16:01 clouddb1001 CRON[27723]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:16:02 clouddb1001 CRON[27722]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:17:01 clouddb1001 CRON[27726]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:17:01 clouddb1001 CRON[27725]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:17:01 clouddb1001 CRON[27727]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:17:01 clouddb1001 CRON[27728]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 15 04:17:01 clouddb1001 CRON[27725]: pam_unix(cron:session): session closed for user root
Jan 15 04:17:01 clouddb1001 CRON[27726]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:18:01 clouddb1001 CRON[27731]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:18:01 clouddb1001 CRON[27732]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:18:02 clouddb1001 CRON[27731]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:19:01 clouddb1001 CRON[27736]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:19:01 clouddb1001 CRON[27737]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:19:01 clouddb1001 CRON[27736]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:20:01 clouddb1001 CRON[27739]: pam_unix(cron:session): session opened for user prometheus by (uid=0)
Jan 15 04:20:01 clouddb1001 CRON[27741]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jan 15 04:20:01 clouddb1001 CRON[27740]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 15 04:20:01 clouddb1001 CRON[27743]: (root) CMD (/usr/local/bin/prometheus-ssh_open_sessions /var/lib/prometheus/node.d/ssh_open_sessions.prom)
Jan 15 04:20:01 clouddb1001 CRON[27739]: pam_unix(cron:session): session closed for user prometheus
Jan 15 04:20:02 clouddb1001 CRON[27740]: pam_unix(cron:session): session closed for user root
aborrero triaged this task as Medium priority.Jan 15 2021, 10:08 AM

Change 656439 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain-dbusers: skipping servers needs to skip closing too

https://gerrit.wikimedia.org/r/656439

Change 656439 merged by Bstorm:
[operations/puppet@production] maintain-dbusers: skipping servers needs to skip closing too

https://gerrit.wikimedia.org/r/656439

That fixed it:

Jan 15 15:59:19 labstore1004 systemd[1]: Started Maintain labsdb accounts.
Jan 15 15:59:34 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3314 due to (2003, "Can't connect to MyS
Jan 15 15:59:34 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3314: Already closed
Jan 15 15:59:44 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MyS
Jan 15 15:59:44 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3316: Already closed
Jan 15 16:02:08 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3314 due to (2003, "Can't connect to MyS
Jan 15 16:02:08 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3314: Already closed
Jan 15 16:02:18 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not connect to clouddb1019.eqiad.wmnet:3316 due to (2003, "Can't connect to MyS
Jan 15 16:02:18 labstore1004 /usr/local/sbin/maintain-dbusers[51677]: Could not close connection to clouddb1019.eqiad.wmnet:3316: Already closed

Mentioned in SAL (#wikimedia-operations) [2021-01-15T16:17:22Z] <bstorm> canceled downtime for maintain-dbusers on labstore1004 T272127