Page MenuHomePhabricator

block labs IPs from sending data to prod ganglia
Closed, ResolvedPublic

Description

the misc_eqiad cluster in ganglia shows labs IPs, block that UDP traffic

As of May 10th

list of IPs that still show up now.. and the names they resolve t:

IPStatusfqdn
10.68.17.70OKintegration-slave-precise-1011.integration.eqiad.wmflabs.
10.68.16.53OKintegration-raita.integration.eqiad.wmflabs.
10.68.17.87OKintegration-slave-precise-1002.integration.eqiad.wmflabs.
10.64.48.132OK3(NXDOMAIN) (fixed by krenair)
10.68.16.36??proofreadpage.wikisource-dev.eqiad.wmflabs.
10.68.16.78OKlanguage-dev.language.eqiad.wmflabs.
10.68.16.94??limn1.analytics.eqiad.wmflabs.
10.68.16.179OKgraphite1.graphite.eqiad.wmflabs.
10.68.16.201??phab-01.phabricator.eqiad.wmflabs.
10.68.16.247??phab-test.contributors.eqiad.wmflabs.
10.68.16.253OKproperty-suggester.wikidata-dev.eqiad.wmflabs.
10.68.17.10OKvitalsigns-01.dashiki.eqiad.wmflabs.
10.68.17.134OKwlmjurytool2014.wlmjurytool.eqiad.wmflabs.
10.68.17.235OKfulltext.full-text-reference-tool.eqiad.wmflabs.
10.68.19.15??rt1.servermon.eqiad.wmflabs.
10.68.19.245??maintenance.analytics.eqiad.wmflabs.
10.68.22.38OKgraphite-labs.graphite.eqiad.wmflabs.
10.68.23.143OKintegration-slave-trusty-1004.integration.eqiad.wmflabs.

Potential way to get rid of Ganglia on an instance:

dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia; killall -9 gmond

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added a project: acl*sre-team.
Dzahn subscribed.
Dzahn added a subscriber: ori.
Dzahn triaged this task as High priority.Oct 19 2015, 11:14 PM
Dzahn added a project: netops.

Change 284912 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: don't run ganglia-monitor in labs

https://gerrit.wikimedia.org/r/284912

I think to properly fix this we'd need PRODUCTION_NETWORKS from https://gerrit.wikimedia.org/r/#/c/260926/ though https://gerrit.wikimedia.org/r/#/c/284912/ is a partial fix where we shouldn't be running ganglia in labs anyways

Change 284912 merged by Dzahn:
ganglia: don't run ganglia-monitor in labs

https://gerrit.wikimedia.org/r/284912

thanks @Dzahn ! there's another ~25 labs instances reporting data to misc-eqiad, likely because they are not running puppet or a self-hosted puppet master but we can fix that with ferm and/or ACLs

thanks for the fix. definitely fewer IPs in there now. the remaining ones i see currently:

10.68.16.147 (down)
10.68.17.204 (down)

10.68.16.53 (integration-raita.integration.eqiad.wmflabs.): < mutante> !log integration integration-raita "Could not find class role::ci::raita" puppet error. manually stopping ganglia-monitor - apt-get remove ganglia-monitor

10.68.16.62
10.68.16.66
10.68.16.78
10.68.16.94
10.68.16.179
10.68.16.201
10.68.16.247
10.68.16.253
10.68.17.10
10.68.17.70
10.68.17.87
10.68.17.134
10.68.17.207
10.68.17.235
10.68.18.253
10.68.19.15
10.68.19.37
10.68.19.245
10.68.21.109
10.68.22.38
10.68.23.143

12:31 < mutante> if i gave you a list of just IP addresses, could you easily run a salt command on that?
12:31 < mutante> (they are instances)
12:33 < YuviPanda> mutante: salt is useless in labs mostly. I recommend xargs + ssh (which is what I use) + root@

Mentioned in SAL [2016-04-27T19:46:42Z] <mutante> - gptest1.catgraph manually stopping ganglia (T115330)

host 10.68.16.66 is special, look how many names that has:

host 10.68.16.66 | wc -l
27

all in contintcloud.eqiad.wmflabs

Mentioned in SAL [2016-04-27T20:01:33Z] <mutante> language-dev dpkg-configure -a to fix borked dpkg (manually interrupted dist-upgrade?) , manually removing ganglia (T115330)

host 10.68.16.66 is special, look how many names that has:

hashar> mutante: chasemp relevant task is T126518 duplicate of T115194

As part of T134808 on CI and beta cluster I have ran:

salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia'

Some instances still had gmond running which prevents the purge of the package with the error: ganglia is still logged in, in that case I have killed the gmond process.

That also get rid of /etc/ganglia which causes puppet to randomly fails whenever a class ends up relaying on include ganglia or to add a custom monitoring probe under /etc/ganglia/. I have some patches attached to T134808 that relies on $::standard::has_ganglia to skip bits on labs.

Mentioned in SAL [2016-05-09T20:57:59Z] <hashar> Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330

list of IPs that still show up now.. and the names they resolve t:

10.68.17.70integration-slave-precise-1011.integration.eqiad.wmflabs.
10.68.16.53integration-raita.integration.eqiad.wmflabs.
10.68.17.87integration-slave-precise-1002.integration.eqiad.wmflabs.
10.64.48.1323(NXDOMAIN)
10.68.16.36proofreadpage.wikisource-dev.eqiad.wmflabs.
10.68.16.78language-dev.language.eqiad.wmflabs.
10.68.16.94limn1.analytics.eqiad.wmflabs.
10.68.16.179graphite1.graphite.eqiad.wmflabs.
10.68.16.201phab-01.phabricator.eqiad.wmflabs.
10.68.16.247phab-test.contributors.eqiad.wmflabs.
10.68.16.253property-suggester.wikidata-dev.eqiad.wmflabs.
10.68.17.10vitalsigns-01.dashiki.eqiad.wmflabs.
10.68.17.134wlmjurytool2014.wlmjurytool.eqiad.wmflabs.
10.68.17.235fulltext.full-text-reference-tool.eqiad.wmflabs.
10.68.19.15rt1.servermon.eqiad.wmflabs.
10.68.19.245maintenance.analytics.eqiad.wmflabs.
10.68.22.38graphite-labs.graphite.eqiad.wmflabs.
10.68.23.143integration-slave-trusty-1004.integration.eqiad.wmflabs.

Ran dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia on rt1.servermon.eqiad.wmflabs (10.68.19.15)

Puppet is broken on that instance: Could not find class role::racktables

Edit: Caused by https://gerrit.wikimedia.org/r/#/c/285308/ - had to reinstall the packages to get puppet running again, then it did the ensure=>removed thing on them

I can't log in to phab-01.phabricator.eqiad.wmflabs (10.68.16.201), even as root. Maybe someone with access to the labs salt master can get in.

I did graphite-labs.graphite.eqiad.wmflabs and graphite1.graphite.eqiad.wmflabs

integration-raita can be disregarded. that was fixed by hashar. i think it just needs a little more time to disappear from the UI but there is no new data

10.64.48.1323(NXDOMAIN)

wmf4727-test.eqiad.wmnet - resolving that hostname was broken by https://gerrit.wikimedia.org/r/#/c/287099/

wlmjurytool2014.wlmjurytool.eqiad.wmflabs. - killed gmond, there is puppet fail about starting ganglia-monitor and i think it's self-hosted master. but gmond is gone and not coming back

Mentioned in SAL [2016-05-10T17:03:21Z] <mutante> killed gmond on vitalsigns-01, puppet::self master (T115330)

Mentioned in SAL [2016-05-10T17:11:24Z] <mutante> killed gmond on language-dev, self hosted puppetmaster, (T115330)

Mentioned in SAL [2016-05-10T17:14:28Z] <mutante> killed gmond on property-suggester, self hosted puppetmaster (T115330)

Mentioned in SAL [2016-05-10T17:17:42Z] <mutante> killed gmond, self hosted puppetmaster (T115330)

Ran dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia

I did the same on the last couple instances where i killed gmond.

I have moved @Dzahn list of IP/FQDN to the task detail ( https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-4wg7fhgbo3bwvli/ ) this way anyone can amend the table and mark hosts as fixed.

beta-cluster and CI should be all good. Even got a few patches in puppet T134808#2281327

hashar updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Maybe somebody in Analytics could take care of limn and maintenance.analytics?

Change 291192 had a related patch set uploaded (by Alexandros Kosiaris):
ferm: Restrict ganglia aggregator a bit more

https://gerrit.wikimedia.org/r/291192

Change 291192 merged by Alexandros Kosiaris:
ferm: Restrict ganglia aggregator a bit more

https://gerrit.wikimedia.org/r/291192

https://gerrit.wikimedia.org/r/291819 and friends should be settings the grounds for mixing this mess finally

So, this is happening due to carbon having a public IP and somewhat relax firewall rules. The premise is that hosts with public IPs have that due to the need to be accessible from the entire internet which includes labs. Unfortunately a more lax rules using ALL_NETWORKS was used in the gmond instance as well allowing labs instances to push to carbon. The change above should set some building grounds for finally fixing this

phab-01.phabricator.eqiad.wmflabs should no longer be doing it

akosiaris claimed this task.

This is finally fixed in rOPUPb3ef0ad. labs VMs in https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous%2520eqiad&tab=m&vn=&hide-hf=false are already marked as down and metrics no longer flow in, gmond will clean them up and remove them fully in 1 day's time.

Finally resolving