Separate metrics, logs, and monitoring between staging and production
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Jun 19 2015, 4:37 PM

Description

We should set a different service name ('restbase-staging') for the staging cluster, so that metrics, logs, and monitoring are separated from production. This will avoid confusion and alerts when load testing in staging.

Details

Subject	Repo	Branch	Lines +/-
RESTBase: use the appropriate logger name	operations/puppet	production	+2 -2
restbase: override statsd metric prefix for restbase test cluster	operations/puppet	production	+2 -0
restbase: override logging name	operations/puppet	production	+8 -2
restbase: move test/staging to its own cluster	operations/puppet	production	+13 -0
restbase: override statsd metric prefix for restbase test cluster	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects

Mentioned In: rOPUP5bd28d29535d: RESTBase: use the appropriate logger name
T124179: Improve access to and control over incident and metrics monitoring infrastructure
Mentioned Here: T124179: Improve access to and control over incident and metrics monitoring infrastructure

Event Timeline

• GWicke created this task.Jun 19 2015, 4:37 PM

• GWicke raised the priority of this task from to Medium.

• GWicke updated the task description. (Show Details)

• GWicke added a project: RESTBase.

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 19 2015, 4:37 PM

• GWicke moved this task from Backlog to Blocked / others on the RESTBase board.Jun 29 2015, 5:23 PM

WIP patch and discussion at https://gerrit.wikimedia.org/r/#/c/238431/.

• GWicke assigned this task to fgiunchedi.Jan 27 2016, 12:39 AM

• GWicke set Security to None.

• GWicke edited subscribers, added: • mobrovac, Eevans, • Pchelolo; removed: Aklapper.

• GWicke added a project: Blocked-on-Operations.Feb 12 2016, 5:26 PM

• GWicke merged a task: T127747: Separate metrics for RESTBase staging cluster.Feb 23 2016, 4:01 PM

• GWicke added subscribers: gerritbot, Aklapper, StudiesWorld.

Eevans renamed this task from Separate metrics & logs between staging and production to Separate metrics, logs, and monitoring between staging and production.Feb 23 2016, 9:19 PM

Eevans updated the task description. (Show Details)

Eevans added a subscriber: fgiunchedi.

• GWicke merged a task: T112644: restbase staging cluster uses the same metric name as production cluster.Feb 23 2016, 9:55 PM

• GWicke added a subscriber: Matanya.

The disagreement here seems to be about whether we should use different mechanisms for graphite vs. ganglia, nagios and salt.

My proposal (as discussed on the patch) is to use the puppet cluster variable consistently for all of them, as is already the case for ganglia, nagios and salt.

So to summarize: 6 months after the original changeset was created, production and staging metrics (and logging, etc) are still combined. I can't help but point out that this seems like the epitome of Perfect is the enemy of good. What do we need to move this forward?

I'm disappointed by the delay as well. The main issue has been work and discussion stalling.

Giuseppe described how we can update nagios and graphite on the patch. Among other things, this would avoid the false staging alerts in the ops channel. It would also solve the issue consistently for all service::node instances.

I think it's worth looking into this solution, if only to fix the alerts. If it turns out too complex for some reason, then we can still go with a custom variable.

In T103124#2057480, @GWicke wrote:

I'm disappointed by the delay as well. The main issue has been work and discussion stalling.

I'd have described the main issue as a lack of consensus about how it should be done.

Giuseppe described how we can update nagios and graphite on the patch, it's just that nobody has gotten around to do it.

@fgiunchedi followed up afterward expressing a desire to keep them separated for now, so I don't think this follows.

I'm happy to resurrect this, but let's wait until we can circle @fgiunchedi back in on the how.

Change 272989 had a related patch set uploaded (by Filippo Giunchedi):
restbase: move test/staging to its own cluster

https://gerrit.wikimedia.org/r/272989

gerritbot added a project: Patch-For-Review.Feb 24 2016, 3:36 PM

https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own $cluster (as puppet calls it), save for that I still think it is a good idea not to have restbase logging and statsd depend on $cluster, which is the point of https://gerrit.wikimedia.org/r/#/c/238431 and https://gerrit.wikimedia.org/r/#/c/272536. If anything, the conventions are different (e.g. - vs _ as a separator). Also icinga wise we're not using service/host groups to route notifications afaics. Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

Change 273051 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273051

Change 273051 abandoned by Eevans:
restbase: override statsd metric prefix for restbase test cluster

Reason:
Try again to sort the dependency.

https://gerrit.wikimedia.org/r/273051

Change 273052 had a related patch set uploaded (by Eevans):
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273052

Change 273061 had a related patch set uploaded (by Eevans):
restbase: override logging name

https://gerrit.wikimedia.org/r/273061

In T103124#2060074, @fgiunchedi wrote:

https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own $cluster (as puppet calls it), save for that I still think it is a good idea not to have restbase logging and statsd depend on $cluster, which is the point of https://gerrit.wikimedia.org/r/#/c/238431 and https://gerrit.wikimedia.org/r/#/c/272536. If anything, the conventions are different (e.g. - vs _ as a separator). Also icinga wise we're not using service/host groups to route notifications afaics. Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

Ok, so to summarize:

Icinga is a red herring here as far as notifications go; We don't group notifications by {host,service}group. However, putting staging in its own cluster will result in proper host grouping (which will arrange things nicely in the UI, if nothing else).

Changing $cluster for the staging nodes will also change the grouping in Ganglia (Good), but will also result in the metrics being reset for them (Not As Good).

Does this sound about right? If so, it means that a) we need a separate changeset to update Icinga notifications accordingly, and b) we need to be OK with losing Ganglia metrics history for staging (I am personally OK with this; the benefits out-weight IMO).

Otherwise we currently have:

https://gerrit.wikimedia.org/r/#/c/238431 (configurable statsd prefix)
https://gerrit.wikimedia.org/r/#/c/273052 (override statsd prefix in staging)
https://gerrit.wikimedia.org/r/#/c/272989 (move staging to "restbase_test" cluster)
https://gerrit.wikimedia.org/r/#/c/273061 (override logging name)

... which should cover everything else.

If anything, the conventions are different (e.g. - vs _ as a separator).

For the staging cluster, using _ as the separator should be fine, IMHO. We use both hypen and underscore as graphite metric separators.

Also icinga wise we're not using service/host groups to route notifications afaics.

We need to separate notifications between staging & production *somehow*, and host groups seem to be a reasonable way to achieve this. Do you see an easier option?

Also IIRC due to the way ganglia works when an host moves clusters it'll have its metrics reset.

That's fine, as it should only affect staging hosts.

Change 272989 merged by Filippo Giunchedi:
restbase: move test/staging to its own cluster

https://gerrit.wikimedia.org/r/272989

ok I've deployed https://gerrit.wikimedia.org/r/#/c/272989 this morning so hosts are now separated, I've moved the per-host rrd files so that is kept but the cluster summary graph isn't backfilled.

next up is https://gerrit.wikimedia.org/r/#/c/273061/1 and https://gerrit.wikimedia.org/r/#/c/273052/1

Change 273061 merged by Filippo Giunchedi:
restbase: override logging name

https://gerrit.wikimedia.org/r/273061

Change 273052 merged by Filippo Giunchedi:
restbase: override statsd metric prefix for restbase test cluster

https://gerrit.wikimedia.org/r/273052

Mentioned in SAL [2016-02-29T15:30:34Z] <urandom> forcing puppet run in restbase staging ((noop) config deploy) : T103124

Mentioned in SAL [2016-02-29T15:32:32Z] <urandom> forcing puppet run in restbase clustter ((noop) config deploy) : T103124

Mentioned in SAL [2016-02-29T15:33:59Z] <urandom> Perform rolling restart of restbase in staging cluster : T103124

Mentioned in SAL [2016-02-29T15:42:07Z] <urandom> Rolling restart of restbase staging complete : T103124

Mentioned in SAL [2016-02-29T15:43:12Z] <urandom> Perform rolling restart of restbase in production cluster : T103124

Mentioned in SAL [2016-02-29T15:50:33Z] <urandom> Rolling restart of restbase production complete : T103124

• mobrovac closed this task as Resolved.Feb 29 2016, 3:59 PM

• mobrovac edited projects, added Services; removed Patch-For-Review, Blocked-on-Operations.

• mobrovac removed a subscriber: gerritbot.

I believe Icinga notifications (staging vs. production) are also in-scope for this ticket. I'll try to have a look at this today.

Kibana dashboard: https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase-test

Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

See also these existing staging dashboards:

https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup

Cassandra:

In T103124#2072799, @GWicke wrote:

See also these existing staging dashboards:

https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup

I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing. It's... not an impressive sight, in part I think because there hasn't been any traffic on Staging since the rename, and because some of the graphed metrics might not be applicable for staging.

Cassandra:

https://grafana.wikimedia.org/dashboard/db/restbase-staging-cassandra

https://grafana.wikimedia.org/dashboard/db/restbase-staging-cassandra-cf-sstables-per-read

@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?

I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing.

Makes sense. Should we remove https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup?

@GWicke: was this meant to be informational, or was there something about these dashboards that needs to change?

This is FYI.

In T103124#2073440, @GWicke wrote:

I created https://grafana-admin.wikimedia.org/dashboard/db/restbase-staging by exporting https://grafana-admin.wikimedia.org/dashboard/db/restbase, munging the metric names, and importing.

Makes sense. Should we remove https://grafana.wikimedia.org/dashboard/db/restbase-staging-backup?

Yup, done; For posterity sake I attached an export.

RESTBase staging (backup)-145677715172221 KBDownload

In T103124#2072775, @Eevans wrote:

Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

host-specific icinga checks (e.g. CQL port (not) responding) are not affected by the metrics split (i.e. they'll continue to go to team-services contact)

for graphite-based checks (the ones in role::restbase::alerts) now the prefix is different, thus we can define (a subset of) checks against a different prefix

In T103124#2085057, @fgiunchedi wrote:

In T103124#2072775, @Eevans wrote:

Re: Icinga notifications

For Services I assume it is true that we'd want to continue receiving email notifications for failures in staging, and that this boils down to the two classes of notifications being somehow distinguishable.

@fgiunchedi do you have any suggestions for how to best accomplish this (via puppet)?

host-specific icinga checks (e.g. CQL port (not) responding) are not affected by the metrics split (i.e. they'll continue to go to team-services contact)

[ ... ]

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

In T103124#2095349, @Eevans wrote:

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

In T103124#2102762, @fgiunchedi wrote:

In T103124#2095349, @Eevans wrote:

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

So if I understand correctly: Icinga monitoring at the WMF (as configured in Puppet) assumes a certain uniformity in how notifications of are handled (i.e. it lacks the flexibility to do what's described above). Basically, that if a service is worth monitoring for availability, any notifications of failures are expected to be broadcast (minimally) to admins. If true, that would seem to imply that the RESTBase staging/test environment is exceptional in this regard. Do we have any other "non-production" systems on the production network monitored by Icinga? Is there any impetus outside this issue to change things?

TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?

Eevans mentioned this in T124179: Improve access to and control over incident and metrics monitoring infrastructure.Mar 10 2016, 4:04 PM

In T103124#2107201, @Eevans wrote:

In T103124#2102762, @fgiunchedi wrote:

In T103124#2095349, @Eevans wrote:

@fgiunchedi I think the general idea is that for the staging servers, it doesn't make sense to Sound The Alarm at the same intensity as for the production nodes, nor does it make sense to send them to the wider audience. These nodes will only ever contribute to pager fatigue.

How difficult would it be to limit the contact group to team-services for staging (or perhaps to an alternate contact definition altogether, to simplify filtering)? How difficult would it be to exclude these hosts from #wikimedia-operations channel-spam?

AFAICS we don't have a way around default irc notifications because the minimum is contact_groups admins and that's routed to -operations too.

So if I understand correctly: Icinga monitoring at the WMF (as configured in Puppet) assumes a certain uniformity in how notifications of are handled (i.e. it lacks the flexibility to do what's described above). Basically, that if a service is worth monitoring for availability, any notifications of failures are expected to be broadcast (minimally) to admins. If true, that would seem to imply that the RESTBase staging/test environment is exceptional in this regard. Do we have any other "non-production" systems on the production network monitored by Icinga? Is there any impetus outside this issue to change things?

yes restbase staging/test is exceptional in which we don't have similar non-production systems in the production network afaik, re: impetus is what you mentioned, namely T124179

TL;DR Is this worth following through on, or should we resolve this issue for the time being and leave Icinga notifications as-is?

I'm going to resolve this for now since the bulk of it is done, we can revisit based on T124179

It would seem this issue is still incomplete; RESTBase logging for the staging cluster continues to be grouped with production.

moving to @Eevans for the logstash/restbase part

Eevans added a project: Cassandra.Apr 29 2016, 8:32 PM

Eevans moved this task from Backlog to Next on the Cassandra board.

Logstash still contains entries from staging, cf. https://logstash.wikimedia.org/#dashboard/temp/AVSZVMRu0z-7ykXO0IvV

Change 290922 had a related patch set uploaded (by Mobrovac):
RESTBase: use the appropriate logger name

https://gerrit.wikimedia.org/r/290922

gerritbot added a project: Patch-For-Review.May 26 2016, 12:24 PM

Change 290922 merged by Alexandros Kosiaris:
RESTBase: use the appropriate logger name

https://gerrit.wikimedia.org/r/290922

• mobrovac mentioned this in rOPUP5bd28d29535d: RESTBase: use the appropriate logger name.May 26 2016, 12:33 PM

The logs from staging are now stored in the restbase-test dashboard.

I think this is now complete; Closing.

	F3457593: RESTBase staging (backup)-1456777151722
	Feb 29 2016, 8:21 PM

Separate metrics, logs, and monitoring between staging and productionClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Separate metrics, logs, and monitoring between staging and production
Closed, ResolvedPublic
Actions