Page MenuHomePhabricator

labstore - replication to codfw broken or not working yet
Closed, DuplicatePublic

Description

icinga says that on labstore1001, there are issues with either some backup jobs, or the monitoring of those backup jobs.

Last backup of the maps filesystem
Last backup of the others filesystem
Last backup of the tools filesystem

CRITICAL - Last run result for unit replicate-tools was exit-code

	CRITICAL - Last run result for unit replicate-others was exit-code

..

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=labstore1001&nostatusheader

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: Toolforge, Cloud-VPS, SRE.
Dzahn subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

since it says "was exit-code" it looks more like a typo in the monitoring script ?

on neon i can see the check commands used:

@neon:/etc/icinga# grep check_replicate puppet_services.cfg 
	check_command                  nrpe_check!check_replicate-maps-state!10
	check_command                  nrpe_check!check_replicate-others-state!10
	check_command                  nrpe_check!check_replicate-tools-state!10

but "check_replicate" can't be found anywhere in the puppet repo ?


on labstore1001:

cd /etc/nagios/nrpe.d/
cat check_replicate*.cfg

# File generated by puppet. DO NOT edit by hand
command[check_replicate-maps-state]=/usr/local/bin/nrpe_check_systemd_unit_state 'replicate-maps' periodic 90000# File generated by puppet. DO NOT edit by hand
command[check_replicate-others-state]=/usr/local/bin/nrpe_check_systemd_unit_state 'replicate-others' periodic 90000# File generated by puppet. DO NOT edit by hand
 /usr/local/bin/nrpe_check_systemd_unit_state 'replicate-maps' periodic 90000
CRITICAL - Last run result for unit replicate-maps was exit-code
103     if state['Result'] != 'success':
104         crit("Last run result for unit %s was %s" % (unit, state['Result']))
  • /usr/local/bin/nrpe_check_systemd_unit_state on labstore is puppetized, but the check_replicate commands and NRPE config seems to be missing
  • check_systemd_unit_state needs a fix around line 103/104

nevermind, the script actually just gets this from systemctl like this:

/bin/systemctl show replicate-maps | grep Result
Result=exit-code

This is what is executed:

ExecStart={ path=/usr/local/sbin/storage-replicate ; argv[]=/usr/local/sbin/storage-replicate /srv/project/maps labstore2001.codfw.wmnet

and:

usage: storage-replicate [-h] path host dest
storage-replicate: error: the following arguments are required: dest

Dzahn renamed this task from failed backups on labstore? to labstore - replication to codfw broken or not working yet.Feb 4 2016, 12:11 AM
Dzahn set Security to None.

Old snapshot on labstore2001 had gotten full, causing lvs to fail, causing the backup script to fail. I've cleaned them out on labstore2001, but haven't started the replication - that should start again on the timer, I think.