Page MenuHomePhabricator

Grafana, icinga, prometheus in cloud-analytics project
Closed, DeclinedPublic

Description

Since the cloud-analytics Cloud VPS project will host a service used by Cloud VPS users, we should be able to monitor and maintain it much in the same way that we would in production.

Is there a Prometheus and/or Icinga and/or Grafana instance in Cloud VPS that we can use, or do we also need to configure and maintain these inside of Cloud VPS?

Event Timeline

Ottomata triaged this task as High priority.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptDec 10 2018, 11:06 PM

There isn't a great monitoring solution available for VMs. I can add your project to shinken so that basic VM stats (up/down/puppet failures, etc.) have monitoring and alerting.

There are other piecemeal solutions here and there but I don't know a ton about what people are doing.

Ottomata updated the task description. (Show Details)Dec 11 2018, 2:28 PM
Nuria added a subscriber: bd808.Dec 11 2018, 5:37 PM

cc @bd808 who was asking about this issue with monitoring

bd808 added a comment.EditedDec 11 2018, 6:01 PM

https://grafana-labs.wikimedia.org/ (and https://grafana-labs-admin.wikimedia.org) exist. There are prometheus instances in the tools and deployment-prep projects, but no multi-tenant prometheus deployment for all projects to share today. As @Andrew mentioned in T211640#4812373 we do have a multi-tenant Shinken service that is the icinga equivalent for instances today. It is used at least by the tools and deployment-prep projects today to generate email & irc alerts for basic instance health.

We have wishes/dreams for more shared services in this area (T194333: [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants), but we haven't been able to allocate time and people to work on them to date. One of the challenges here is that not many of the FOSS monitoring solutions are ready for multi-tenant use out of the box. As we are moving more and more shared Cloud VPS infrastructure (mail relays, puppetmasters, etc) into virtual instances we may need to invest in a set of deployed tools for monitoring them that could be extended beyond the cloudinfra project to select projects like this one in the absence of truly shared infrastructure for all tenants.

Ah ha, and IIRC, I need to get Puppet exported resources to work in my project, right? I'm not using a custom self hosted puppetmaster. Will exported resources work for me?

I'll also probably need Cumin set up, in order to use get_clusters via cumin::selector (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810/ and T204088: Prometheus resources in deployment-prep to create grafana graphs of EventLogging).

Change 479030 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Prometheus server for cloud-analytics project

https://gerrit.wikimedia.org/r/479030

bd808 added a comment.Dec 11 2018, 7:23 PM

Ah ha, and IIRC, I need to get Puppet exported resources to work in my project, right? I'm not using a custom self hosted puppetmaster. Will exported resources work for me?

I'll also probably need Cumin set up, in order to use get_clusters via cumin::selector (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462810/ and T204088: Prometheus resources in deployment-prep to create grafana graphs of EventLogging).

I think both exported resources and cumin will require you to have a project local puppetmaster with puppetdb enabled. One by one you are about to rediscover all the nice things in production network that are not included as out of the box services for Cloud VPS customers. :/

@faidon @chasemp AAHHHH CAN WE GO BACK TO PROD AHHHHHH :( :)

fdans lowered the priority of this task from High to Normal.

Looks like this ticket can be closed, from our meeting after reviewing the state of monitoring in labs we think it would be best to run presto in prod behind LDAP, we will need "tool-access" for LDAP so tools like quarry can also connect to the datastore. labs users will really see no difference , presto will be in prod the same way the analytics replicas on labs are in prod. While this is a technical solution is a bit sad that our use case (which is making data access for cloud users much easier) cannot be aacomodated on the cloud platform.

Nuria closed this task as Declined.Jan 26 2019, 12:17 AM

Change 479030 abandoned by Ottomata:
[WIP] Prometheus server for cloud-analytics project

https://gerrit.wikimedia.org/r/479030