Page MenuHomePhabricator

k3s.catalyst-dev.eqiad1.wikimedia.cloud DNS lookup not working
Closed, ResolvedPublic

Description

I tried to ssh to k3s.catalyst-dev.eqiad1.wikimedia.cloud today but I was not able to connect. After some debugging I found that the DNS is not resolving the hostname:

dancy@k3s-worker01:~$ hostname -f
k3s-worker01.catalyst-dev.eqiad1.wikimedia.cloud
dancy@k3s-worker01:~$ host k3s.catalyst-dev.eqiad1.wikimedia.cloud
Host k3s.catalyst-dev.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)

I tried rebooting the VM today but it did not help.

Details

Other Assignee
bd808

Event Timeline

https://openstack-browser.toolforge.org/server/k3s.catalyst-dev.eqiad1.wikimedia.cloud says the instance IP is 172.16.0.47. It seems to be the only instance in the project without an IPv6 address. I'm not sure why that would mess with it's exported DNS record though.

@dancy have you been able to access this VM in the past?

@dancy have you been able to access this VM in the past?

I just recently joined the catalyst project so this is my first time attempting to reach that host.

ok! It would be useful to know if/when that VM was ever functional and reachable.

ok! It would be useful to know if/when that VM was ever functional and reachable.

I've added @jnuche as a subscriber to this ticket. He will probably be able to shed some light on the topic.

I edited my ~/.ssh/config so that the correct jump host, key, and username would be used to connect to 172.16.0.47. With ssh -vvv 172.16.0.47 I see my session hopping through bastion.wmcloud.org, offering my key to Host '172.16.0.47', and then it looks like the instance closes the connection. I think more is messed up here than the PDNS records.

I have bodged up an a record and ptr record for this host, so you should be able to access it now. Because I created the record by hand it will likely live on forever after the VM is deleted so I'm going to keep this task open for a while.

If someone has a story about what happened and/or if that VM worked in the past, I'm still interested!

(I am able to ssh to the VM with a root key. I haven't exercised sssd/pam/ldap/etc there)

(I am able to ssh to the VM with a root key. I haven't exercised sssd/pam/ldap/etc there)

I can get in with my cloud root key now. I'm also realizing that I'm not a member of the project, so my user key shouldn't have worked no matter what.

2026-02-10T21:57:27.035803+00:00 k3s sshd[36482]: fatal: Access denied for user bd808 by PAM account configuration [preauth]

That's things working as they should. id bd808 works so sssd is probably doing ok.

I have bodged up an a record and ptr record for this host, so you should be able to access it now. Because I created the record by hand it will likely live on forever after the VM is deleted so I'm going to keep this task open for a while.

If someone has a story about what happened and/or if that VM worked in the past, I'm still interested!

Thanks! I'm able to ssh in now. I thought I'd find more running on the node (such as kubernetes pods) but it's very idle. I'll inquire with @jnuche about that.

Thanks @Andrew and @bd808 for helping with this!

ok! It would be useful to know if/when that VM was ever functional and reachable.

Yes, it's the master host of our K3s development cluster that we use to test changes to our infra stack before deploying to prod.

I was on vacation all of January so I haven't touched the dev cluster recently, I think the last time was late November. OpenTofu had no issues connecting to this host at the time, but then again the stack connects using the host's IP so I can't say for certain that there wasn't an issue with the DNS back then already.

Thanks! I'm able to ssh in now. I thought I'd find more running on the node (such as kubernetes pods) but it's very idle. I'll inquire with @jnuche about that.

I just ssh'd into the host and saw that K3s was crashed. I've restarted it and I can see the expected pods (note our workloads are on namespace cat-env):

jnuche@k3s:~$ kubectl -n cat-env get pod | head
NAME                                              READY   STATUS    RESTARTS      AGE
envdb-0                                           1/1     Running   2 (71d ago)   152d
wiki-3874e4396d-1915-mediawiki-f744ddbc5-plrx4    2/2     Running   2 (71d ago)   159d
wiki-bb0b246d45-2393-mediawiki-7bb946cdfd-fn9dl   2/2     Running   2 (71d ago)   123d
wiki-24ea17598d-2549-mediawiki-694449f967-5dhdf   2/2     Running   2 (71d ago)   110d
wiki-5484592608-3880-mediawiki-78dfbbf798-b5tkh   2/2     Running   0             5d14h
wiki-befc2d19e9-1723-mediawiki-5744dd5ddd-cnb5l   2/2     Running   2 (71d ago)   173d
wiki-482bd7320e-3823-mediawiki-5fcd88cfc9-wgbmt   2/2     Running   0             12d
wiki-33f926502f-3111-mediawiki-f47b4575f-xc9dg    2/2     Running   4 (54d ago)   71d
wiki-318839e344-3611-mediawiki-6b85544d46-hr8j4   2/2     Running   0             20d
thcipriani assigned this task to Andrew.
thcipriani updated Other Assignee, added: bd808.
thcipriani subscribed.

Can SSH now, thanks @bd808 and @Andrew for the help!