Page MenuHomePhabricator

Create dashboards for beta cluster maps instances
Closed, ResolvedPublic

Description

It would be good to have monitoring dashboards for our beta cluster maps instance(s) similar to the ones in production.

Copying over related discussion from T172090:

Should we connect the test servers to a dashboard like the production one where we could see stats for tile generation rates, etc? (excuse me if it is already somewhere, couldn't find it). Seems like it could be a good idea to check on the servers before rolling to production when we make changes.

There is some info here, not very friendly tho: https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org

If you go to https://grafana.wikimedia.org/ and you log in (click wikimedia logo top left, and login with ldap credentials):

Screen Shot 2018-08-17 at 16.48.54.png (200×199 px, 19 KB)

Once logged in click on the Home button top left:

Screen Shot 2018-08-17 at 16.50.07.png (56×123 px, 8 KB)

You will see a "New dashboard" button:

Screen Shot 2018-08-17 at 16.50.17.png (289×1 px, 65 KB)

There you get a GUI where you can add panels with different types of graphs and new rows to the dashboard. Whenever I've looked into it I've explored more than followed any tutorial.

Screen Shot 2018-08-17 at 16.50.41.png (243×393 px, 27 KB)

Once done, remember to save your dashboard:

Screen Shot 2018-08-17 at 16.50.41.png (243×393 px, 27 KB)

For the name prefix it with "Maps" so that it shows grouped with the other ones in the big list.

One thing I've done too before is look at how an existing dashboard is made. For example if you go to the Maps performance one, you can poke the panels and edit them to see how they are configured. For that click on the title of the pane and a submenu appears, then you can click "Edit"

Screen Shot 2018-08-17 at 16.55.43.png (369×634 px, 80 KB)

Then you will see the graph fullscreen and the configuration with the queries at the bottom. Feel free to poke around and change things, there is autocomplete and it is quite approachable:

Screen Shot 2018-08-17 at 16.56.31.png (459×722 px, 55 KB)

NOTE: Don't save your changes if you are just poking around!

Another option is to clone a dashboard, which could be useful for what we are doing which is mirror the dashboard but with the test environment. For that click on the settings icon and then "Save as":

Screen Shot 2018-08-17 at 16.58.43.png (380×648 px, 82 KB)

Then you can make the appropriate changes after cloning it.

Regarding the data sources, we'll need to check that the deployment is configured to log to the appropriate statsd/graphite (more general info here, there seems to be a beta cluster graphite instance so maybe we should use that). I'm not sure exactly how we proceed here, depends on how things are logged and configured.

cc/ @Gehel for feedback about how we go on about doing this appropriately

Event Timeline

@Mholloway and @Jhernandez I picked this task from the backlog because we might need to debug some stuff before updating the production cluster with the newest versions the software.

I tried to get all the useful information that suits better for the Beta Cluster. The result is the following dashboard, let me know if is that what you had in mind.

https://grafana-labs.wikimedia.org/dashboard/db/maps-beta?orgId=1

Looks great to me. Moving to Sign Off column for @Jhernandez 's signoff.

Minor comment on the dashboard: the "Cassandra memory usage" graph does not name the different pools in the legend (we see multiple entries for "deployment-maps03.memory_pool_usages").

Otherwise, looks good!

@Gehel nice catch. thanks! Just changed that graph legend.

@Mholloway I am marking this task as resolved if you think it's not ready for sign-off feel free to undo my change.