Page MenuHomePhabricator

CloudVPS: diamond report some metrics with the host IP address instead of host name
Closed, DeclinedPublic

Description

Since hosts get migrated to the new region, Diamond seems to report some metrics using the host IP address instead of the hostname. But it seems to sort out eventually. I suspect there is a race condition where Diamond starts before the hostname is available. I do not think it happened in the old region.

Entries look like: project.integration.host-172.16.0.78

Example for the integration project on Grafana:

https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?orgId=1&from=now-1h&to=now&var-project=integration&var-server=All

Event Timeline

aborrero renamed this task from Diamond report some metrics with the host IP address instead of host name to CloudVPS: diamond report some metrics with the host IP address instead of host name.Nov 20 2018, 12:04 PM
aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.
aborrero added subscribers: Andrew, aborrero.

CC'ing @Andrew to see if this is a known issue.

relevant extracts from wikimedia-releng

Nov 19 22:30:29 <Krenair>	root@deployment-puppetmaster03:~# puppet cert list
Nov 19 22:30:29 <Krenair>	  "deployment-puppetdb02.deployment-prep.eqiad.wmflabs" (SHA256) 5D:B2:96:51:8D:49:63:30:B5:51:27:2D:78:35:8B:2F:E2:FC:3A:88:5C:F9:AE:64:49:E6:ED:03:73:6B:9D:03
Nov 19 22:30:29 <Krenair>	  "host-172-16-4-100.deployment-prep.eqiad.wmflabs"     (SHA256) 28:A8:84:1C:29:EC:08:03:9E:A4:D5:C3:25:3B:A4:3D:C1:6E:D9:F2:61:B3:EE:DC:24:68:E7:34:E8:11:73:22
Nov 19 22:30:29 <Krenair>	  "host-172-16-4-106.deployment-prep.eqiad.wmflabs"     (SHA256) A4:A6:93:2F:0D:1A:FD:7A:73:B2:14:48:BF:2E:33:AE:E8:22:68:15:5B:B2:FA:3F:4D:23:2D:55:33:AD:51:AC
Nov 19 22:30:29 <Krenair>	  "host-172-16-4-116.deployment-prep.eqiad.wmflabs"     (SHA256) 76:52:82:18:AF:CD:79:7D:50:63:5E:82:99:E9:6D:D7:D6:69:6F:6D:B6:A9:CD:01:BB:9E:83:9D:9C:B7:83:CB
Nov 19 22:30:29 <Krenair>	  "host-172-16-4-19.deployment-prep.eqiad.wmflabs"      (SHA256) 15:05:C5:7D:86:10:BA:ED:68:73:D2:DD:00:13:52:FB:CB:C5:BD:5A:E1:82:C6:D5:92:51:AC:AB:FA:F0:51:F2
Nov 19 22:30:29 <Krenair>	root@deployment-puppetmaster03:~# puppet cert sign deployment-p
Nov 19 22:31:07 <Krenair>	those must be mid-migration instances but why are they starting up and trying to get puppet certs with the wrong hostname?
Nov 19 23:42:52 <andrewbogott>	thank you for fixing puppet, Krenair.  Those puppet certs with bogus hostnames will probably keep creeping in (the migrated hosts boot once with the wrong hostname before getting fixed.)  I can clean them up at the end unless they're actively breaking things in the meantime.
Nov 19 23:44:15 <Krenair>	andrewbogott, they're not actively breaking things but is it really necessary to boot them with the wrong names?
Nov 19 23:47:20 <andrewbogott>	I don't know why it happens.  It's not a race, since dhcp has had the whole copy time to get up to date with the right names.
Nov 19 23:48:15 <andrewbogott>	they come up with their old IP, ask dhcp for a name and get host-172-whatever along with their new, correct IP.  Then after a reboot they get the right hostname (and already have the right IP from before).
Nov 19 23:48:25 <andrewbogott>	It's ugly but seems mostly harmless

There was also the issue of SSH host keys changing - I think on Jessie instances? - that was brought up again recently

Lets just decline this task based on @Andrew comment? Its transient and eventually instances get the proper hostname eventually.

bd808 subscribed.

Lets just decline this task based on @Andrew comment? Its transient and eventually instances get the proper hostname eventually.

Plus we are hoping to eliminate diamond (T210993)