Page MenuHomePhabricator

clouddumps1002: ferm is being started on every puppet run
Closed, ResolvedPublic

Description

For some reason on clouddumps1002, the ferm service is being started (corrective, stopped -> running) on every puppet run:

Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective)

This does not happen on other hosts. I just noticed it while verifying unrelated patches.

In general "resources change on every puppet run" used to be an Icinga alert and this being ferm-related seems to make it more important than other ones of this type.

Event Timeline

via elimination I've convinced myself that the issue here is 10_dumps_rsyncd :

# Autogenerated by puppet. DO NOT EDIT BY HAND!
#
# 
&R_SERVICE(tcp, 873, @resolve((mwlog1002.eqiad.wmnet mwlog2002.codfw.wmnet phab1004.eqiad.wmnet dumpsdata1001.eqiad.wmnet dumpsdata1002.eqiad.wmnet dumpsdata1003.eqiad.wmnet clouddumps1001.wikimedia.org clouddumps1002.wikimedia.org stat1006.eqiad.wmnet stat1007.eqiad.wmnet wdqs1009.eqiad.wmnet wcqs2001.codfw.wmnet wdqs2009.codfw.wmnet sagres.c3sl.ufpr.br sagres.c3sl.ufpr.br poincare.acc.umu.se ftp.acc.umu.se mirror.accum.se poincare.acc.umu.se ftp.acc.umu.se mirror.accum.se ftpmirror.your.org ftpmirror-ae0-4.us.your.org ftpmirror.your.org crcdtn01.crc.nd.edu wmrsync.crc.nd.edu wikimedia.mirror.us.dev wikimedia.mirror.us.dev 65.19.157.35 wikimedia.bringyour.com wikimedia.bringyour.com mirror.clarkson.edu mirror.clarkson.edu wikipedia.mirror.pdapps.org wikipedia.mirror.pdapps.org)));

Likely because one or more of those can't be resolved.

The troublesome entries are:

ftp.acc.umu.se
mirror.accum.se
ftp.acc.umu.se
mirror.accum.se

And yeah, each is in there twice, but being in twice is not causing the issue.

I don't see any real problem with those hosts other than that they're duplicates of each other. Seems like a ferm bug.

@Andrew Is it not maybe 65.19.157.35 ? Because that is the only IP in there and it fails to resolve:

[clouddumps1002:/etc/ferm/conf.d] $ host 65.19.157.35
Host 35.157.19.65.in-addr.arpa. not found: 3(NXDOMAIN)

all others work but this:

for host in mwlog1002.eqiad.wmnet mwlog2002.codfw.wmnet phab1004.eqiad.wmnet dumpsdata1001.eqiad.wmnet dumpsdata1002.eqiad.wmnet dumpsdata1003.eqiad.wmnet clouddumps1001.wikimedia.org clouddumps1002.wikimedia.org stat1006.eqiad.wmnet stat1007.eqiad.wmnet wdqs1009.eqiad.wmnet wcqs2001.codfw.wmnet wdqs2009.codfw.wmnet sagres.c3sl.ufpr.br sagres.c3sl.ufpr.br poincare.acc.umu.se ftp.acc.umu.se mirror.accum.se poincare.acc.umu.se ftp.acc.umu.se mirror.accum.se ftpmirror.your.org ftpmirror-ae0-4.us.your.org ftpmirror.your.org crcdtn01.crc.nd.edu wmrsync.crc.nd.edu wikimedia.mirror.us.dev wikimedia.mirror.us.dev 65.19.157.35 wikimedia.bringyour.com wikimedia.bringyour.com mirror.clarkson.edu mirror.clarkson.edu wikipedia.mirror.pdapps.org; do host $host; done | grep NX


Host 35.157.19.65.in-addr.arpa. not found: 3(NXDOMAIN)

@Andrew Is it not maybe 65.19.157.35 ? Because that is the only IP in there and it fails to resolve:

That was the first thing I tried! ferm-status is only satisfied if I remove ftp.acc.umu.se and mirror.accum.se -- removing 65.19.157.35 doesn't make a difference.

can you try running ferm-status with --verbose?

can you try running ferm-status with --verbose?

Yeah, it doesn't produce any output. That's why I had to do a binary search to figure out which entries it didn't like.

@Andrew Is it not maybe 65.19.157.35 ? Because that is the only IP in there and it fails to resolve:

That was the first thing I tried! ferm-status is only satisfied if I remove ftp.acc.umu.se and mirror.accum.se -- removing 65.19.157.35 doesn't make a difference.

Interesting! So what stands about these to me is:

  • mirror.accum.se is an alias for ftp.acc.umu.se., so basically it's just that one
  • ftp.acc.umu.se has multiple A records, which is unique among the other hosts in the list
;; ANSWER SECTION:
ftp.acc.umu.se.		478	IN	A	194.71.11.165
ftp.acc.umu.se.		478	IN	A	194.71.11.163
ftp.acc.umu.se.		478	IN	A	194.71.11.173

I would not be suprised if it's bug related to finding more than one A record.

I wonder if it would disappear if we use one of the 3 IP addresses instead of the host name. As the other IP in the list shows it should be ok to add IPs.

Change 911338 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] dumps::distribution::ferm: update to resolve hosts in puppetmaster

https://gerrit.wikimedia.org/r/911338

Im unable to recreate this did you fix it. either way i think this would be more strict if you pushed the DNS resolution to the puppetmaster instead of ferm. this can prevent errors where the DNS changes between the time the rules where loaded and the ferm-status script checks. see https://gerrit.wikimedia.org/r/c/operations/puppet/+/911338

I can confirm the issue that made me create this ticket is gone. So it's resolved. I don't know how it got resolved though.

So, now the status is:

[clouddumps1002:~] $ host ftp.acc.umu.se
ftp.acc.umu.se has address 194.71.11.163
ftp.acc.umu.se has address 194.71.11.165

The IP "194.71.11.173" that was there in T323324#8521638 is gone.

So seems like this was technically resolved by someone controlling umu.se or so.

Dzahn claimed this task.
Dzahn removed Dzahn as the assignee of this task.

Change 911338 merged by Jbond:

[operations/puppet@production] dumps::distribution::ferm: update to resolve hosts in puppetmaster

https://gerrit.wikimedia.org/r/911338