Page MenuHomePhabricator

Investigate cp1044's strange Ganglia graphs
Closed, DeclinedPublic

Description

Since yesterday, cp1044's Ganglia graphs seem completely messed up: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1044.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1460840203&g=cpu_report&z=large&c=Maps%20caches%20eqiad

I don't know what's currently going on with cp1044 (perhaps Icinga says the server has issues (which I can't see because I have no access), the ganglia daemon just has some problems, or this is expected behaviour..), so I hope someone can take a look. cp1044 still seems to properly serve content when looking at maps.wikimedia.org X-Cache's header, so I guess restarting the ganglia daemon should fix it.

Event Timeline

so this is _just_ for cp1044 , right?

killed gmond on cp1044 (which was running varnishstat).. started with /etc/init.d/ganglia-monitor start

does not look like it fixed it. cp1044 is all green in Icinga

@Dzahn: cp1044 is one of the new hosts with Varnish 4, there was a problem with gmond that ema fixed a couple of weeks ago, not sure about this one though!

Aha, that's a good hint. @ema is it possible this is different from others because it was used to test the fix or something?

I'm seeing graphs for cp1044 in ganglia (for misc eqiad though, not maps caches) https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cp1044.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS

however the service is reporting some failures wrt varnish

cp1044:~$ sudo systemctl status ganglia-monitor -l
● ganglia-monitor.service - (null)
   Loaded: loaded (/etc/init.d/ganglia-monitor)
   Active: active (running) since Mon 2016-04-25 23:00:39 UTC; 1 day 16h ago
  Process: 26349 ExecStop=/etc/init.d/ganglia-monitor stop (code=exited, status=0/SUCCESS)
  Process: 26623 ExecStart=/etc/init.d/ganglia-monitor start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/ganglia-monitor.service
           └─26625 /usr/sbin/gmond --pid-file /var/run/gmond.pid

Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MGT.child_died'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MAIN.bans_lurker_tests_tested'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MAIN.sess_drop'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MEMPOOL.req0.randry'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MAIN.pools'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.SMA.s0.g_alloc'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MEMPOOL.busyobj.toosmall'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MEMPOOL.sess0.live'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.MEMPOOL.sess0.sz_wanted'. Possible that the module has not been loaded.
Apr 25 23:00:44 cp1044 /usr/sbin/gmond[26625]: Unable to find the metric information for 'frontend.SMA.s0.c_fail'. Possible that the module has not been loaded.
fgiunchedi triaged this task as Medium priority.Apr 27 2016, 3:52 PM

@Southparkfan see the link in the comment above. it seems to be just about which cluster the server is in, your link had the "maps caches eqiad" part but it's in "misc caches eqiad" (maybe that changed at some point) and the graphs are there. enough investigation?

Dzahn lowered the priority of this task from Medium to Low.Apr 27 2016, 7:13 PM

@elukey: the ganglia/varnish issue we've seen when upgrading cp1044 was due to VSM files not being readable by the ganglia user since the v4 upgrade.
The solution is simple: add the ganglia user to the varnish group. Unfortunately I failed to do that in puppet: https://gerrit.wikimedia.org/r/#/c/281918/. Any help on that front would be greatly appreciated!

At any rate, the problem described in this ticket is probably due to cp1044 not being part of any cluster anymore.

cp1044 has been decomissioned per T133614