block labs IPs from sending data to prod ganglia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Oct 13 2015, 2:54 PM

Description

the misc_eqiad cluster in ganglia shows labs IPs, block that UDP traffic

As of May 10th

list of IPs that still show up now.. and the names they resolve t:

IP	Status	fqdn
10.68.17.70	OK	integration-slave-precise-1011.integration.eqiad.wmflabs.
10.68.16.53	OK	integration-raita.integration.eqiad.wmflabs.
10.68.17.87	OK	integration-slave-precise-1002.integration.eqiad.wmflabs.
10.64.48.132	OK	3(NXDOMAIN) (fixed by krenair)
10.68.16.36	??	proofreadpage.wikisource-dev.eqiad.wmflabs.
10.68.16.78	OK	language-dev.language.eqiad.wmflabs.
10.68.16.94	??	limn1.analytics.eqiad.wmflabs.
10.68.16.179	OK	graphite1.graphite.eqiad.wmflabs.
10.68.16.201	??	phab-01.phabricator.eqiad.wmflabs.
10.68.16.247	??	phab-test.contributors.eqiad.wmflabs.
10.68.16.253	OK	property-suggester.wikidata-dev.eqiad.wmflabs.
10.68.17.10	OK	vitalsigns-01.dashiki.eqiad.wmflabs.
10.68.17.134	OK	wlmjurytool2014.wlmjurytool.eqiad.wmflabs.
10.68.17.235	OK	fulltext.full-text-reference-tool.eqiad.wmflabs.
10.68.19.15	??	rt1.servermon.eqiad.wmflabs.
10.68.19.245	??	maintenance.analytics.eqiad.wmflabs.
10.68.22.38	OK	graphite-labs.graphite.eqiad.wmflabs.
10.68.23.143	OK	integration-slave-trusty-1004.integration.eqiad.wmflabs.

Potential way to get rid of Ganglia on an instance:

dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia; killall -9 gmond

Details

	Subject	Repo	Branch	Lines +/-
	ferm: Restrict ganglia aggregator a bit more	operations/puppet	production	+2 -2
	ganglia: don't run ganglia-monitor in labs	operations/puppet	production	+7 -0

Customize query in gerrit

Related Objects

Mentioned In: rOPUPb5a6dc645ea2: ferm: Restrict ganglia aggregator a bit more
rOPUP9d9ba835d2c8: ferm: Restrict ganglia aggregator a bit more
T134808: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class)
T118690: ganglia eqiad misc hosts shows various openstack vms
Mentioned Here: rOPUPb3ef0ad7c4c6: ganglia: Restrict ganglia aggregation to just production
T134808: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class)
T115194: Some labs instances IP have multiple PTR entries in DNS
T126518: Duplicate entries in labs internal dns

Event Timeline

Dzahn created this task.Oct 13 2015, 2:54 PM

Dzahn raised the priority of this task from to Needs Triage.

Dzahn updated the task description. (Show Details)

Dzahn added a project: acl*sre-team.

Dzahn subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 13 2015, 2:54 PM

Dzahn set Security to None.Oct 19 2015, 6:36 PM

Dzahn added a subscriber: ori.

Dzahn triaged this task as High priority.Oct 19 2015, 11:14 PM

Dzahn added a project: netops.

faidon moved this task from Backlog to This quarter on the netops board.Oct 27 2015, 2:14 PM

Dzahn mentioned this in T118690: ganglia eqiad misc hosts shows various openstack vms.Nov 16 2015, 6:31 PM

Dzahn merged a task: T118690: ganglia eqiad misc hosts shows various openstack vms.

Dzahn added subscribers: Reedy, StudiesWorld.

Krenair subscribed.Nov 16 2015, 10:54 PM

Change 284912 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: don't run ganglia-monitor in labs

https://gerrit.wikimedia.org/r/284912

gerritbot added a project: Patch-For-Review.Apr 22 2016, 4:12 PM

I think to properly fix this we'd need PRODUCTION_NETWORKS from https://gerrit.wikimedia.org/r/#/c/260926/ though https://gerrit.wikimedia.org/r/#/c/284912/ is a partial fix where we shouldn't be running ganglia in labs anyways

Change 284912 merged by Dzahn:
ganglia: don't run ganglia-monitor in labs

https://gerrit.wikimedia.org/r/284912

thanks @Dzahn ! there's another ~25 labs instances reporting data to misc-eqiad, likely because they are not running puppet or a self-hosted puppet master but we can fix that with ferm and/or ACLs

thanks for the fix. definitely fewer IPs in there now. the remaining ones i see currently:

10.68.16.147 (down)
10.68.17.204 (down)

10.68.16.53 (integration-raita.integration.eqiad.wmflabs.): < mutante> !log integration integration-raita "Could not find class role::ci::raita" puppet error. manually stopping ganglia-monitor - apt-get remove ganglia-monitor

10.68.16.62
10.68.16.66
10.68.16.78
10.68.16.94
10.68.16.179
10.68.16.201
10.68.16.247
10.68.16.253
10.68.17.10
10.68.17.70
10.68.17.87
10.68.17.134
10.68.17.207
10.68.17.235
10.68.18.253
10.68.19.15
10.68.19.37
10.68.19.245
10.68.21.109
10.68.22.38
10.68.23.143

12:31 < mutante> if i gave you a list of just IP addresses, could you easily run a salt command on that?
12:31 < mutante> (they are instances)
12:33 < YuviPanda> mutante: salt is useless in labs mostly. I recommend xargs + ssh (which is what I use) + root@

Mentioned in SAL [2016-04-27T19:46:42Z] <mutante> - gptest1.catgraph manually stopping ganglia (T115330)

host 10.68.16.66 is special, look how many names that has:

host 10.68.16.66 | wc -l
27

all in contintcloud.eqiad.wmflabs

Mentioned in SAL [2016-04-27T20:01:33Z] <mutante> language-dev dpkg-configure -a to fix borked dpkg (manually interrupted dist-upgrade?) , manually removing ganglia (T115330)

In T115330#2244714, @Dzahn wrote:

host 10.68.16.66 is special, look how many names that has:

hashar> mutante: chasemp relevant task is T126518 duplicate of T115194

Dzahn mentioned this in T134808: Puppet fails on labs instances due to Ganglia (ex: using apache::site puppet class).May 9 2016, 8:39 PM

As part of T134808 on CI and beta cluster I have ran:

salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia'

Some instances still had gmond running which prevents the purge of the package with the error: ganglia is still logged in, in that case I have killed the gmond process.

That also get rid of /etc/ganglia which causes puppet to randomly fails whenever a class ends up relaying on include ganglia or to add a custom monitoring probe under /etc/ganglia/. I have some patches attached to T134808 that relies on $::standard::has_ganglia to skip bits on labs.

Mentioned in SAL [2016-05-09T20:57:59Z] <hashar> Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330

list of IPs that still show up now.. and the names they resolve t:

10.68.17.70	integration-slave-precise-1011.integration.eqiad.wmflabs.
10.68.16.53	integration-raita.integration.eqiad.wmflabs.
10.68.17.87	integration-slave-precise-1002.integration.eqiad.wmflabs.
10.64.48.132	3(NXDOMAIN)
10.68.16.36	proofreadpage.wikisource-dev.eqiad.wmflabs.
10.68.16.78	language-dev.language.eqiad.wmflabs.
10.68.16.94	limn1.analytics.eqiad.wmflabs.
10.68.16.179	graphite1.graphite.eqiad.wmflabs.
10.68.16.201	phab-01.phabricator.eqiad.wmflabs.
10.68.16.247	phab-test.contributors.eqiad.wmflabs.
10.68.16.253	property-suggester.wikidata-dev.eqiad.wmflabs.
10.68.17.10	vitalsigns-01.dashiki.eqiad.wmflabs.
10.68.17.134	wlmjurytool2014.wlmjurytool.eqiad.wmflabs.
10.68.17.235	fulltext.full-text-reference-tool.eqiad.wmflabs.
10.68.19.15	rt1.servermon.eqiad.wmflabs.
10.68.19.245	maintenance.analytics.eqiad.wmflabs.
10.68.22.38	graphite-labs.graphite.eqiad.wmflabs.
10.68.23.143	integration-slave-trusty-1004.integration.eqiad.wmflabs.

Ran dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia on rt1.servermon.eqiad.wmflabs (10.68.19.15)

Puppet is broken on that instance: Could not find class role::racktables

Edit: Caused by https://gerrit.wikimedia.org/r/#/c/285308/ - had to reinstall the packages to get puppet running again, then it did the ensure=>removed thing on them

I can't log in to phab-01.phabricator.eqiad.wmflabs (10.68.16.201), even as root. Maybe someone with access to the labs salt master can get in.

I did graphite-labs.graphite.eqiad.wmflabs and graphite1.graphite.eqiad.wmflabs

integration-raita can be disregarded. that was fixed by hashar. i think it just needs a little more time to disappear from the UI but there is no new data

In T115330#2281529, @Dzahn wrote:

10.64.48.132 3(NXDOMAIN)

wmf4727-test.eqiad.wmnet - resolving that hostname was broken by https://gerrit.wikimedia.org/r/#/c/287099/

wlmjurytool2014.wlmjurytool.eqiad.wmflabs. - killed gmond, there is puppet fail about starting ganglia-monitor and i think it's self-hosted master. but gmond is gone and not coming back

Mentioned in SAL [2016-05-10T17:03:21Z] <mutante> killed gmond on vitalsigns-01, puppet::self master (T115330)

Mentioned in SAL [2016-05-10T17:11:24Z] <mutante> killed gmond on language-dev, self hosted puppetmaster, (T115330)

Mentioned in SAL [2016-05-10T17:14:28Z] <mutante> killed gmond on property-suggester, self hosted puppetmaster (T115330)

Mentioned in SAL [2016-05-10T17:17:42Z] <mutante> killed gmond, self hosted puppetmaster (T115330)

In T115330#2281542, @AlexMonk-WMF wrote:

Ran dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia

I did the same on the last couple instances where i killed gmond.

I have moved @Dzahn list of IP/FQDN to the task detail ( https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-4wg7fhgbo3bwvli/ ) this way anyone can amend the table and mark hosts as fixed.

beta-cluster and CI should be all good. Even got a few patches in puppet T134808#2281327

hashar updated the task description. (Show Details)May 10 2016, 6:54 PM

hashar updated the task description. (Show Details)

Dzahn updated the task description. (Show Details)May 10 2016, 6:57 PM

Dzahn updated the task description. (Show Details)

Maybe somebody in Analytics could take care of limn and maintenance.analytics?

Change 291192 had a related patch set uploaded (by Alexandros Kosiaris):
ferm: Restrict ganglia aggregator a bit more

https://gerrit.wikimedia.org/r/291192

Change 291192 merged by Alexandros Kosiaris:
ferm: Restrict ganglia aggregator a bit more

https://gerrit.wikimedia.org/r/291192

akosiaris mentioned this in rOPUP9d9ba835d2c8: ferm: Restrict ganglia aggregator a bit more.May 27 2016, 7:11 AM

https://gerrit.wikimedia.org/r/291819 and friends should be settings the grounds for mixing this mess finally

So, this is happening due to carbon having a public IP and somewhat relax firewall rules. The premise is that hosts with public IPs have that due to the need to be accessible from the entire internet which includes labs. Unfortunately a more lax rules using ALL_NETWORKS was used in the gmond instance as well allowing labs instances to push to carbon. The change above should set some building grounds for finally fixing this

phab-01.phabricator.eqiad.wmflabs should no longer be doing it

akosiaris mentioned this in rOPUPb5a6dc645ea2: ferm: Restrict ganglia aggregator a bit more.Jun 17 2016, 6:07 PM