Page MenuHomePhabricator

Service name needed for things wanting to write data to WMCS's statsd service
Closed, DeclinedPublic

Description

Hey, thanks for doing this but I really think this needs some sort of announcement, lots of tools in labs depend on this and one of our production services that had a beta version in beta cluster got broken because of this. It took me some time to find out what labmon1001 got changed to (wikitech documentation doesn't have anything either).

@Ladsgroup could you explain the dependency on the physical host name? It sounds to me like there may be a use case for a service name to be used instead.

It's being used in beta cluster like https://github.com/search?q=org%3Awikimedia+labmon1001&type=Code to be the place that receives statsd, even mediawiki-config uses it.

Folks (including past WMCS team members) have hard coded the labmon1001.eqiad.wmnet hostname as a destination for statsd traffic. This broke things for people when the physical host was renamed as part of it's Buster reimaging. It was a bad idea before that however as labmon1001 could have been made secondary at any time as well.

If it is reasonable for folks to be using this statsd endpoint we should create a service name for it that will be durable across host renaming as well as cluster state change. If it is not reasonable for folks to be using it outside of Puppet managed data collection we should find a way to limit access so that failures happen during development rather than at runtime.

Event Timeline

bd808 triaged this task as Medium priority.Jan 7 2020, 10:16 PM
bd808 moved this task from Needs discussion to Inbox on the cloud-services-team (Kanban) board.

Discussed in 2020-01-07 WMCS team meeting. Nobody seemed especially happy about the idea of providing statsd/graphite as a generic service to Cloud VPS projects, but we also did not have a strong reason to block folks from doing so. Adding a service name as a simple DNS CNAME record is possible. Or main concern with that is ensuring that the CNAME is updated as various changes happen to the cloudmetrics cluster hosts.

For what it’s worth, my Tool-inteGraality uses WMCS statsd service, with an hardcoded host cloudmetrics1001. After T297444, and because the pystatsd library plain crashes if the host is unreachable, the entire web service was down − see T325936: cloudmetrics1001 is unreachable, preventing integraality webserver to start for details.