Page MenuHomePhabricator

Separate metrics, logs, and monitoring between staging and production
Closed, ResolvedPublic

Description

We should set a different service name ('restbase-staging') for the staging cluster, so that metrics, logs, and monitoring are separated from production. This will avoid confusion and alerts when load testing in staging.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke subscribed.
GWicke set Security to None.
GWicke edited subscribers, added: mobrovac, Eevans, Pchelolo; removed: Aklapper.
Eevans renamed this task from Separate metrics & logs between staging and production to Separate metrics, logs, and monitoring between staging and production.Feb 23 2016, 9:19 PM
Eevans updated the task description. (Show Details)
Eevans added a subscriber: fgiunchedi.

The disagreement here seems to be about whether we should use different mechanisms for graphite vs. ganglia, nagios and salt.

My proposal (as discussed on the patch) is to use the puppet cluster variable consistently for all of them, as is already the case for ganglia, nagios and salt.

So to summarize: 6 months after the original changeset was created, production and staging metrics (and logging, etc) are still combined. I can't help but point out that this seems like the epitome of Perfect is the enemy of good. What do we need to move this forward?

I'm disappointed by the delay as well. The main issue has been work and discussion stalling.

Giuseppe described how we can update nagios and graphite on the patch. Among other things, this would avoid the false staging alerts in the ops channel. It would also solve the issue consistently for all service::node instances.

I think it's worth looking into this solution, if only to fix the alerts. If it turns out too complex for some reason, then we can still go with a custom variable.

I'm disappointed by the delay as well. The main issue has been work and discussion stalling.

I'd have described the main issue as a lack of consensus about how it should be done.

Giuseppe described how we can update nagios and graphite on the patch, it's just that nobody has gotten around to do it.

@fgiunchedi followed up afterward expressing a desire to keep them separated for now, so I don't think this follows.


I'm happy to resurrect this, but let's wait until we can circle @fgiunchedi back in on the how.

Change 272989 had a related patch set uploaded (by Filippo Giunchedi):
restbase: move test/staging to its own cluster

https://gerrit.wikimedia.org/r/272989

https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own $cluster (as puppet calls it), save for that I still think it is a good idea not to have restbase logging and statsd depend on $cluster, which is the point of https://gerrit.wikimedia.org/r/#/c/238431 and https://gerrit.wikimedia.org/r/#/c/272536. If anything, the conventions are different (e.g. - vs _ as a separator). Also icinga wise we're not using service/host groups to route notifications afaics. Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

Change 273051 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273051

Change 273051 abandoned by Eevans:
restbase: override statsd metric prefix for restbase test cluster

Reason:
Try again to sort the dependency.

https://gerrit.wikimedia.org/r/273051

Change 273052 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273052

Change 273061 had a related patch set uploaded (by Eevans):
restbase: override logging name

https://gerrit.wikimedia.org/r/273061

https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own $cluster (as puppet calls it), save for that I still think it is a good idea not to have restbase logging and statsd depend on $cluster, which is the point of https://gerrit.wikimedia.org/r/#/c/238431 and https://gerrit.wikimedia.org/r/#/c/272536. If anything, the conventions are different (e.g. - vs _ as a separator). Also icinga wise we're not using service/host groups to route notifications afaics. Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

Ok, so to summarize:

Icinga is a red herring here as far as notifications go; We don't group notifications by {host,service}group. However, putting staging in its own cluster will result in proper host grouping (which will arrange things nicely in the UI, if nothing else).

Changing $cluster for the staging nodes will also change the grouping in Ganglia (Good), but will also result in the metrics being reset for them (Not As Good).

Does this sound about right? If so, it means that a) we need a separate changeset to update Icinga notifications accordingly, and b) we need to be OK with losing Ganglia metrics history for staging (I am personally OK with this; the benefits out-weight IMO).

Otherwise we currently have:

... which should cover everything else.

If anything, the conventions are different (e.g. - vs _ as a separator).

For the staging cluster, using _ as the separator should be fine, IMHO. We use both hypen and underscore as graphite metric separators.

Also icinga wise we're not using service/host groups to route notifications afaics.

We need to separate notifications between staging & production *somehow*, and host groups seem to be a reasonable way to achieve this. Do you see an easier option?

Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

That's fine, as it should only affect staging hosts.

Change 272989 merged by Filippo Giunchedi:
restbase: move test/staging to its own cluster

https://gerrit.wikimedia.org/r/272989

ok I've deployed https://gerrit.wikimedia.org/r/#/c/272989 this morning so hosts are now separated, I've moved the per-host rrd files so that is kept but the cluster summary graph isn't backfilled.

next up is https://gerrit.wikimedia.org/r/#/c/273061/1 and https://gerrit.wikimedia.org/r/#/c/273052/1

Change 273061 merged by Filippo Giunchedi:
restbase: override logging name

https://gerrit.wikimedia.org/r/273061

Change 273052 merged by Filippo Giunchedi:
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273052

Mentioned in SAL [2016-02-29T15:30:34Z] <urandom> forcing puppet run in restbase staging ((noop) config deploy) : T103124

Mentioned in SAL [2016-02-29T15:32:32Z] <urandom> forcing puppet run in restbase clustter ((noop) config deploy) : T103124

Mentioned in SAL [2016-02-29T15:33:59Z] <urandom> Perform rolling restart of restbase in staging cluster : T103124

Mentioned in SAL [2016-02-29T15:42:07Z] <urandom> Rolling restart of restbase staging complete : T103124

Mentioned in SAL [2016-02-29T15:43:12Z] <urandom> Perform rolling restart of restbase in production cluster : T103124

Mentioned in SAL [2016-02-29T15:50:33Z] <urandom> Rolling restart of restbase production complete : T103124

I believe Icinga notifications (staging vs. production) are also in-scope for this ticket. I'll try to have a look at this today.

Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

See also these existing staging dashboards:

I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing. It's... not an impressive sight, in part I think because there hasn't been any traffic on Staging since the rename, and because some of the graphed metrics might not be applicable for staging.

Cassandra:

@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?

I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing.

Makes sense. Should we remove https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup?

@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?

This is FYI.

Yup, done; For posterity sake I attached an export.


Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

host-specific icinga checks (e.g. CQL port (not) responding) are not affected by the metrics split (i.e. they'll continue to go to team-services contact)

for graphite-based checks (the ones in role::restbase::alerts) now the prefix is different, thus we can define (a subset of) checks against a different prefix

Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

host-specific icinga checks (e.g. CQL port (not) responding) are not affected by the metrics split (i.e. they'll continue to go to team-services contact)

[ ... ]

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

So if I understand correctly: Icinga monitoring at the WMF (as configured in Puppet) assumes a certain uniformity in how notifications of are handled (i.e. it lacks the flexibility to do what's described above). Basically, that if a service is worth monitoring for availability, any notifications of failures are expected to be broadcast (minimally) to admins. If true, that would seem to imply that the RESTBase staging/test environment is exceptional in this regard. Do we have any other "non-production" systems on the production network monitored by Icinga? Is there any impetus outside this issue to change things?

TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

So if I understand correctly: Icinga monitoring at the WMF (as configured in Puppet) assumes a certain uniformity in how notifications of are handled (i.e. it lacks the flexibility to do what's described above). Basically, that if a service is worth monitoring for availability, any notifications of failures are expected to be broadcast (minimally) to admins. If true, that would seem to imply that the RESTBase staging/test environment is exceptional in this regard. Do we have any other "non-production" systems on the production network monitored by Icinga? Is there any impetus outside this issue to change things?

yes restbase staging/test is exceptional in which we don't have similar non-production systems in the production network afaik, re: impetus is what you mentioned, namely T124179

TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?

I'm going to resolve this for now since the bulk of it is done, we can revisit based on T124179

It would seem this issue is still incomplete; RESTBase logging for the staging cluster continues to be grouped with production.

moving to @Eevans for the logstash/restbase part

Eevans moved this task from Backlog to Next on the Cassandra board.

Change 290922 had a related patch set uploaded (by Mobrovac):
RESTBase: use the appropriate logger name

https://gerrit.wikimedia.org/r/290922

Change 290922 merged by Alexandros Kosiaris:
RESTBase: use the appropriate logger name

https://gerrit.wikimedia.org/r/290922

Eevans closed this task as Resolved.EditedAug 3 2016, 3:06 PM

I think this is now complete; Closing.