Page MenuHomePhabricator

zotero failing with getaddrinfo ENOTFOUND en.wikipedia.org and url-downloader.codfw.wikimedia.org, causing Citoid errors
Closed, ResolvedPublic

Description

See https://logstash.wikimedia.org/goto/1c5a271e011c321e22104cec2272a6b1

Screenshot 2021-07-08 at 17-44-09 Discover - Elastic.png (1×3 px, 456 KB)

This appears to line up with Citoid errors from Zotero https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-2d&to=now&refresh=5m

Screenshot 2021-07-08 at 17-44-43 Citoid - Grafana.png (744×1 px, 91 KB)

It's causing icinga to alert too:

15:53:52 <+icinga-wm> PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
15:57:28 <+icinga-wm> PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 400 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid

(IRC timestamps PDT)

At 2021-07-09 00:39 UTC I did a rolling restart of the zotero deployment in codfw, but per logstash, it didn't make a difference.

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-07-09T00:47:21Z] <legoktm> zotero rolling restart didn't help, filed T286360 for DNS issues

I can temporarily silence this if you want, by commenting out the offending test and redeploying, at least for the time being? Doesn't seem to actually affect operation, at least. We used to actually probe www.example.com but that would cause errors like this too, and switched to our own stuff so uptime would be concurrent... is it a temporary DNS issue do you think?

See also T163986

Change 703833 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Temporarily silence alerting test

https://gerrit.wikimedia.org/r/703833

Mentioned in SAL (#wikimedia-operations) [2021-07-09T11:40:00Z] <_joe_> deleting coredns pod in codfw, potentially causing T286360

I suspected something could be wrong with some coredns pods, so I queried each of them and scavenged the logs. Found some errors in the logs of one pod and decided to delete it.

Let's see in the next hour if anything else goes wrong, but this strongly suggests we need to tweak our readinessProbe for coredns.

Joe claimed this task.

As expected killing the faulty coredns pod fixed things.

Change 703833 abandoned by Mvolz:

[mediawiki/services/citoid@master] Temporarily silence alerting test

Reason:

Resolve via other route

https://gerrit.wikimedia.org/r/703833