We should set a different service name ('restbase-staging') for the staging cluster, so that metrics, logs, and monitoring are separated from production. This will avoid confusion and alerts when load testing in staging.
Description
Details
Related Objects
Event Timeline
The disagreement here seems to be about whether we should use different mechanisms for graphite vs. ganglia, nagios and salt.
My proposal (as discussed on the patch) is to use the puppet cluster variable consistently for all of them, as is already the case for ganglia, nagios and salt.
So to summarize: 6 months after the original changeset was created, production and staging metrics (and logging, etc) are still combined. I can't help but point out that this seems like the epitome of Perfect is the enemy of good. What do we need to move this forward?
I'm disappointed by the delay as well. The main issue has been work and discussion stalling.
Giuseppe described how we can update nagios and graphite on the patch. Among other things, this would avoid the false staging alerts in the ops channel. It would also solve the issue consistently for all service::node instances.
I think it's worth looking into this solution, if only to fix the alerts. If it turns out too complex for some reason, then we can still go with a custom variable.
I'd have described the main issue as a lack of consensus about how it should be done.
Giuseppe described how we can update nagios and graphite on the patch, it's just that nobody has gotten around to do it.
@fgiunchedi followed up afterward expressing a desire to keep them separated for now, so I don't think this follows.
I'm happy to resurrect this, but let's wait until we can circle @fgiunchedi back in on the how.
Change 272989 had a related patch set uploaded (by Filippo Giunchedi):
restbase: move test/staging to its own cluster
https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own $cluster (as puppet calls it), save for that I still think it is a good idea not to have restbase logging and statsd depend on $cluster, which is the point of https://gerrit.wikimedia.org/r/#/c/238431 and https://gerrit.wikimedia.org/r/#/c/272536. If anything, the conventions are different (e.g. - vs _ as a separator). Also icinga wise we're not using service/host groups to route notifications afaics. Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.
Change 273051 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster
Change 273051 abandoned by Eevans:
restbase: override statsd metric prefix for restbase test cluster
Reason:
Try again to sort the dependency.
Change 273052 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster
Change 273061 had a related patch set uploaded (by Eevans):
restbase: override logging name
Ok, so to summarize:
Icinga is a red herring here as far as notifications go; We don't group notifications by {host,service}group. However, putting staging in its own cluster will result in proper host grouping (which will arrange things nicely in the UI, if nothing else).
Changing $cluster for the staging nodes will also change the grouping in Ganglia (Good), but will also result in the metrics being reset for them (Not As Good).
Does this sound about right? If so, it means that a) we need a separate changeset to update Icinga notifications accordingly, and b) we need to be OK with losing Ganglia metrics history for staging (I am personally OK with this; the benefits out-weight IMO).
Otherwise we currently have:
- https://gerrit.wikimedia.org/r/#/c/238431 (configurable statsd prefix)
- https://gerrit.wikimedia.org/r/#/c/273052 (override statsd prefix in staging)
- https://gerrit.wikimedia.org/r/#/c/272989 (move staging to "restbase_test" cluster)
- https://gerrit.wikimedia.org/r/#/c/273061 (override logging name)
... which should cover everything else.
If anything, the conventions are different (e.g. - vs _ as a separator).
For the staging cluster, using _ as the separator should be fine, IMHO. We use both hypen and underscore as graphite metric separators.
Also icinga wise we're not using service/host groups to route notifications afaics.
We need to separate notifications between staging & production *somehow*, and host groups seem to be a reasonable way to achieve this. Do you see an easier option?
Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.
That's fine, as it should only affect staging hosts.
Change 272989 merged by Filippo Giunchedi:
restbase: move test/staging to its own cluster
ok I've deployed https://gerrit.wikimedia.org/r/#/c/272989 this morning so hosts are now separated, I've moved the per-host rrd files so that is kept but the cluster summary graph isn't backfilled.
next up is https://gerrit.wikimedia.org/r/#/c/273061/1 and https://gerrit.wikimedia.org/r/#/c/273052/1
Change 273052 merged by Filippo Giunchedi:
restbase: override statsd metric prefix for restbase test cluster
Mentioned in SAL [2016-02-29T15:30:34Z] <urandom> forcing puppet run in restbase staging ((noop) config deploy) : T103124
Mentioned in SAL [2016-02-29T15:32:32Z] <urandom> forcing puppet run in restbase clustter ((noop) config deploy) : T103124
Mentioned in SAL [2016-02-29T15:33:59Z] <urandom> Perform rolling restart of restbase in staging cluster : T103124
Mentioned in SAL [2016-02-29T15:42:07Z] <urandom> Rolling restart of restbase staging complete : T103124
Mentioned in SAL [2016-02-29T15:43:12Z] <urandom> Perform rolling restart of restbase in production cluster : T103124
Mentioned in SAL [2016-02-29T15:50:33Z] <urandom> Rolling restart of restbase production complete : T103124
I believe Icinga notifications (staging vs. production) are also in-scope for this ticket. I'll try to have a look at this today.
Re: Icinga notifications
For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.
@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?
I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing. It's... not an impressive sight, in part I think because there hasn't been any traffic on Staging since the rename, and because some of the graphed metrics might not be applicable for staging.
Cassandra:
@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?
I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing.
Makes sense. Should we remove https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup?
@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?
This is FYI.
host-specific icinga checks (e.g. CQL port (not) responding) are not affected by the metrics split (i.e. they'll continue to go to team-services contact)
for graphite-based checks (the ones in role::restbase::alerts) now the prefix is different, thus we can define (a subset of) checks against a different prefix
@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.
How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?
AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.
So if I understand correctly: Icinga monitoring at the WMF (as configured in Puppet) assumes a certain uniformity in how notifications of are handled (i.e. it lacks the flexibility to do what's described above). Basically, that if a service is worth monitoring for availability, any notifications of failures are expected to be broadcast (minimally) to admins. If true, that would seem to imply that the RESTBase staging/test environment is exceptional in this regard. Do we have any other "non-production" systems on the production network monitored by Icinga? Is there any impetus outside this issue to change things?
TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?
yes restbase staging/test is exceptional in which we don't have similar non-production systems in the production network afaik, re: impetus is what you mentioned, namely T124179
TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?
I'm going to resolve this for now since the bulk of it is done, we can revisit based on T124179
It would seem this issue is still incomplete; RESTBase logging for the staging cluster continues to be grouped with production.
Logstash still contains entries from staging, cf. https://logstash.wikimedia.org/#dashboard/temp/AVSZVMRu0z-7ykXO0IvV
Change 290922 had a related patch set uploaded (by Mobrovac):
RESTBase: use the appropriate logger name
Change 290922 merged by Alexandros Kosiaris:
RESTBase: use the appropriate logger name