Page MenuHomePhabricator

prometheus in codfw can't reach cloudservices' pdns to fetch metrics
Closed, ResolvedPublic

Description

There are two jobs to fetch powerdns metrics in codfw from prometheus:

root@prometheus2005:/srv/prometheus/ops/targets# cat cloud-dev-pdns_codfw.yaml
# This file is managed by puppet
---
- labels:
    cluster: misc
    site: codfw
  targets:
  - cloudservices2004-dev:8081
  - cloudservices2005-dev:8081

root@prometheus2005:/srv/prometheus/ops/targets# cat cloud-dev-pdns-rec_codfw.yaml
# This file is managed by puppet
---
- labels:
    cluster: misc
    site: codfw
  targets:
  - cloudservices2004-dev:8082
  - cloudservices2005-dev:8082

These have stopped working since a month (JobUnavailable alert is firing). Is this known/expected ?

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenAndrew
OpenAndrew
ResolvedAndrew
Resolvedaborrero
OpenNone
OpenNone
Resolvedaborrero
Resolvedcmooney
Resolvedfgiunchedi

Event Timeline

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

aborrero edited projects, added cloud-services-team, User-aborrero; removed Cloud-Services.
aborrero added a subscriber: aborrero.

We renamed the servers, from .wikimedia.org to .codfw.wmnet.

Are these targets configured in puppet?

I just checked real quick, the ports should be available to the prometheus server:

aborrero@cloudservices2004-dev:~ $ sudo iptables-save -c | grep 10.192.16.75
[58321:3499260] -A INPUT -s 10.192.16.75/32 -j ACCEPT
aborrero@cloudservices2004-dev:~ $ sudo ss -putanl | grep -E 8081\|8082 
tcp   LISTEN 0      10            10.192.20.10:8081       0.0.0.0:*    users:(("pdns_server",pid=1391150,fd=7))                                                                                                                                                                                                                                                                                                   
tcp   LISTEN 0      10            10.192.20.10:8082       0.0.0.0:*    users:(("pdns_recursor",pid=3163970,fd=40))

Maybe the firewalling is happening elsewhere in the switches or core routers, CC @cmooney

Yes, I just checked tcpdump and is the return traffic not being accepted.

Change 938819 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/homer/public@master] CR: cloud-host: allow return traffic for PDNS servers

https://gerrit.wikimedia.org/r/938819

We renamed the servers, from .wikimedia.org to .codfw.wmnet.

Are these targets configured in puppet?

Yes the are, however hostnames are searched in all domains both codfw.wmnet and wikimedia.org (JFYI, you already found the root cause anyways!) thank you

Change 938819 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] CR: cloud-host: allow return traffic for PDNS servers

https://gerrit.wikimedia.org/r/938819

Patch deployed, should be fixed now.

fgiunchedi claimed this task.

Can confirm it is fixed! Thank you @aborrero !