Page MenuHomePhabricator

Update references from labmon.* to cloudmetrics.* on deployment-prep (beta cluster)
Open, MediumPublic

Description

Logstash is flooded with connection errors following the rename of labmon instances to cloudmonitor ones as per https://github.com/wikimedia/puppet/commit/95fcb029bef97b7422ceb9f7054d4e23b6a0820b and https://github.com/wikimedia/puppet/commit/54f1f161aa4a681fc63637eebb525514f5427499. We should hunt and rename the affected instances on deployment-prep on Horizon and rename them there so logstash-beta becomes readable again.

Affected instances
[maurelio@ubuntu cloud-instance-puppet-readonly](master)$ git grep -il "labmon"
deployment-prep/_.yaml
deployment-prep/deployment-docker-cxserver01.deployment-prep.eqiad.wmflabs.yaml
deployment-prep/deployment-eventgate-1.deployment-prep.eqiad.wmflabs.yaml
deployment-prep/deployment-eventgate-2.deployment-prep.eqiad.wmflabs.yaml
deployment-prep/deployment-eventgate-3.deployment-prep.eqiad.wmflabs.yaml
deployment-prep/deployment-eventlog05.deployment-prep.eqiad.wmflabs.yaml
deployment-prep/deployment-kafka.yaml
deployment-prep/deployment-maps.yaml
openstack/cloudcontrol-codfw1dev-test.yaml
openstack/cloudcontrol-upgrade.yaml
ores-staging/_.yaml
ores/_.yaml
phabricator/_.yaml

Event Timeline

MarcoAurelio triaged this task as Medium priority.EditedDec 26 2019, 3:20 PM
MarcoAurelio updated the task description. (Show Details)

Could not change on deployment-eventgate1-3 as those ain't in Horizon (?)

After replacing all usages of labmon1001 I continue to see in Logstash-Beta the following for deployment-cpjobqueue:

Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet

Details:

Error: Error sending hot-shots message: Error: getaddrinfo ENOTFOUND labmon1001.eqiad.wmnet
    at handleCallback (/srv/deployment/changeprop/deploy-cache/revs/c25a1c25ca9eb8d4a7c1459add82849b16f665b3/node_modules/hot-shots/lib/statsd.js:369:32)
    at doSend (dgram.js:372:7)
    at afterDns (dgram.js:362:5)
    at /srv/deployment/changeprop/deploy-cache/revs/c25a1c25ca9eb8d4a7c1459add82849b16f665b3/node_modules/dnscache/lib/index.js:136:58
    at Array.forEach (native)
    at /srv/deployment/changeprop/deploy-cache/revs/c25a1c25ca9eb8d4a7c1459add82849b16f665b3/node_modules/dnscache/lib/index.js:136:34
    at GetAddrInfoReqWrap.asyncCallback [as callback] (dns.js:62:16)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:76:17)

Not sure where to look at now. Shall I restart cpjobqueue from Horizon?

This is also interesting:

maurelio@deployment-changeprop:~$ sudo service rsyslog status
● rsyslog.service - System Logging Service
   Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled)
   Active: active (running) since Thu 2019-12-26 16:25:46 UTC; 1h 21min ago
     Docs: man:rsyslogd(8)
           https://www.rsyslog.com/doc/
 Main PID: 15454 (rsyslogd)
   CGroup: /system.slice/rsyslog.service
           └─15454 /usr/sbin/rsyslogd -n -iNONE

Dec 26 17:32:39 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:34:11 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:35:43 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:37:13 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:38:46 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:40:18 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:41:48 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:43:19 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:44:51 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Dec 26 17:46:23 deployment-changeprop rsyslogd[15454]: could not load module 'lmnsd_tcp', errors: trying to load module /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_tcp.so: /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_...g.com/e/2066 ]
Hint: Some lines were ellipsized, use -l to show in full.

Could not change on deployment-eventgate1-3 as those ain't in Horizon (?)

They don't exist. I think they were deleted at some point but remain in encapi. @Andrew?

Not sure where to look at now. Shall I restart cpjobqueue from Horizon?

No need to reboot the instance. I just restarted the cpjobqueue service inside the instance. Those messages should no longer be getting shipped to logstash.

Not sure where to look at now. Shall I restart cpjobqueue from Horizon?

No need to reboot the instance. I just restarted the cpjobqueue service inside the instance. Those messages should no longer be getting shipped to logstash.

Thank you @Krenair. It looks it keeps trying to reach labmon1001 though: https://logstash-beta.wmflabs.org/goto/5f21c9d24cd7ed32e83699e34cba2965 (just picked the most recent one).

On advice of @Andrew I've rebooted via Horizon deployment-changeprop. It looks like, after the reboot, deployment-cpjobqueue is no longer trying to reach labmon1001 and thus I can't see more of those EMERG messages.

Hmm, this spawns each ten minutes:

{
  "_index": "logstash-2019.12.27",
  "_type": "changeprop",
  "_id": "AW9HrbB0pvLXCyw4pFS6",
  "_version": 1,
  "_score": null,
  "_source": {
    "err": {
      "name": "Error",
      "stack": "Error: Local: All broker connections are down\n    at Error (native)",
      "message": "all broker connections are down",
      "origin": "local"
    },
    "level": "ERR",
    "pid": 136,
    "type": "changeprop",
    "message": "Producer error",
    "normalized_message": "Producer error",
    "levelPath": "error/producer",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "syslog",
      "es",
      "normalized_message_untrimmed"
    ],
    "@timestamp": "2019-12-27T14:06:02.801Z",
    "host": "deployment-changeprop",
    "@version": "1"
  },
  "fields": {
    "@timestamp": [
      1577455562801
    ]
  },
  "sort": [
    1577455562801
  ]
}

Hmm, this spawns each ten minutes:

{
  "_index": "logstash-2019.12.27",
  "_type": "changeprop",
  "_id": "AW9HrbB0pvLXCyw4pFS6",
  "_version": 1,
  "_score": null,
  "_source": {
    "err": {
      "name": "Error",
      "stack": "Error: Local: All broker connections are down\n    at Error (native)",
      "message": "all broker connections are down",
      "origin": "local"
    },
    "level": "ERR",
    "pid": 136,
    "type": "changeprop",
    "message": "Producer error",
    "normalized_message": "Producer error",
    "levelPath": "error/producer",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "syslog",
      "es",
      "normalized_message_untrimmed"
    ],
    "@timestamp": "2019-12-27T14:06:02.801Z",
    "host": "deployment-changeprop",
    "@version": "1"
  },
  "fields": {
    "@timestamp": [
      1577455562801
    ]
  },
  "sort": [
    1577455562801
  ]
}

Doubt that has anything to do with this task.

Based on running grep labmon /var/log/syslog | tail through cumin, these hosts still have errors connecting to labmon: deployment-parsoid09, deployment-sca02, deployment-docker-cxserver01. And possibly deployment-sca01, apparently grep thinks that log file is binary.

Restarted parsoid on -parsoid09, eventstreams on -sca02, mediawiki-services-cxserver on -docker-cxserver01 (which for some reason hasn't come back up properly, might be some weird systemd-docker interaction going on), and eventstreams on -sca01.

journalctl -xu mediawiki-services-cxserver.service has shown the reason for the container on -docker-cxserver01 not coming back up: Dec 27 14:35:22 deployment-docker-cxserver01 docker[13850]: /usr/bin/docker: Error response from daemon: Conflict. The name "/mediawiki-services-cxserver.service" is already in use by container 9e5c76a4f7. Seems that ID appears on docker ps --all as Dead status. I've run docker rm 9e5c76a4f7b0 and started the service to no avail.

Restarted the docker daemon itself and it seems to have recovered.

Change 612160 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/mediawiki-config@master] beta: Update domain for statsd

https://gerrit.wikimedia.org/r/612160

Not sure if this is the right task (the other candidate would be T243322: Unknown host: labmon1001.eqiad.wmnet (PHP Warning: socket_sendto(): Host lookup failed -10001)), but anyway, because of this migration server side logging to Statsd from https://en.wikipedia.beta.wmflabs.org no longer works, because labmon1001.eqiad.wmnet: Name or service not known. I've submitted a patch that maybe fixes the issue, but I'm not sure if it's as simple as updating the domain to cloudservices1001.eqiad.wmnet

Change 612160 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Update domain for statsd

https://gerrit.wikimedia.org/r/612160

Server-side logging to statsd still seems to not work (e.g. BetaMediaWiki.timing in Grafana Labs doesn't seem to have server-side generated data), but I'm not sure where/what to debug next.