Page MenuHomePhabricator

Upgrade grafana to 5.x
Closed, ResolvedPublic

Description

Please take a look at grafana-beta.wikimedia.org and verify any dashboards you care about are working correctly. (Also note that any changes made on grafana-beta will not be preserved once the upgrade is completed.)

Here's 5.4.0 serving stats from prometheus about its own machine: https://grafana-beta.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=grafana1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc

5.x includes many UI upgrades (drag and drop rearranging of panels!) and new features (folders! stable URLs! teams!)

Notable new features between 4.6 and 5.4 are below. Recommendations for stuff to change about our configuration are bolded.

  • Major UI changes, including drag-n-drop to move graphs around, and a new layout engine that claims easier sizing and placement of panels. Video and screenshots here
  • Grafana now has a notion of 'teams', a group of users that can be used in dashboard ACLs, or for setting a default home dashboard. (Looks like automatic linking between Grafana teams and LDAP groups is locked behind Grafana Enterprise, though?)
  • Dashboards can now be placed into a hierarchy of folders. We should figure out a hierarchy that makes some sense. (but I'm quite happy to call this part of T178690)
  • Dashboard URLs are now stable across name changes. For the time being, old URLs will still work, as long as they are not renamed -- but it's a good idea to update any links into grafana to use the new URL scheme, as the old URLs are deprecated and support will be removed in some future release
  • Data sources can now be 'provisioned' -- specified by JSON files, which we could puppetize. This makes them read-only in the UI, which is probably a good idea anyway.
  • Similarly, dashboards can be provisioned. Dashboards configured from JSON are editable in the UI, but can't be saved -- instead the UI offers you a JSON dump which you can check back into source control. At first glance, seems not bad to me? Should we start moving 'core' SRE dashboards to JSON in Puppet?
  • Heatmap UI and Prometheus histograms now work together
  • Grafana has its own native annotations support, including a on-dashboard UI and an HTTP API for adding them
  • Grafana-native alerts now support reminders (I believe the performance team wanted this)

Current plan: no known issues with the upgrade. Will announce one last time at the SRE meeting on Monday, then proceed with the upgrade Monday afternoon (US Eastern time).

Event Timeline

cdanis-test-grafana5-stretch1.monitoring.eqiad.wmflabs is alive!

So I thought it would be simple to create a VM, use Horizon to enable the Puppet role used for grafana on it (role::webserver_misc_apps), and then install the updated deb manually (since messing with reprepro for seems both unnecessary and scary).

I thought that this would 'just work' and I wouldn't have to manually provide any hiera to set grafana's configuration, since I see values specified in hieradata/role/common/webserver_misc_apps.yaml.
However this is not the case, I'm told for similar reasons as https://phabricator.wikimedia.org/T210497 (thanks Fillipo!)

Ok. So added some hiera for my instance in Horizon by hand.

Now I am getting this:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: File[/etc/apache2/mods-available/status.conf] is already declared in file /etc/puppet/modules/httpd/manifests/init.pp:95; cannot redeclare at /etc/puppet/modules/apache/manifests/mod.pp:174 at /etc/puppet/modules/apache/manifests/mod.pp:174:5 at /etc/puppet/modules/apache/manifests/mod.pp:135 on node cdanis-test-grafana5-stretch1.monitoring.eqiad.wmflabs

Fixed above by manually adding a pile of hiera in Horizon.

Puppet became happy once I manually installed a grafana 5.3.4 deb. It could not install one itself because we package grafana in the WMF repo only for jessie and not for stretch. (I guess we could take advantage of this by doing 4->5 and jessie->stretch at the same time?)

Set up an SSH tunnel and an /etc/hosts entry so I use the grafana instance in my browser. Then spent an embarrassing amount of time trying to figure out why this was still serving me the default "it works!" apache2-on-Debian HTML page. Then found out about this wonderful command:

root@cdanis-test-grafana5-stretch1:/var/log/apache2# apache2ctl -S
VirtualHost configuration:
[::1]:80               localhost (/etc/apache2/conf-enabled/50-server-status.conf:14)
127.0.0.1:80           localhost (/etc/apache2/conf-enabled/50-server-status.conf:14)
*:80                   is a NameVirtualHost
         default server cdanis-test-grafana5-stretch1.monitoring.eqiad.wmflabs (/etc/apache2/sites-enabled/50-cdanis-test-grafana5-stretch1-monitoring-eqiad-wmflabs.conf:4)
         port 80 namevhost cdanis-test-grafana5-stretch1.monitoring.eqiad.wmflabs (/etc/apache2/sites-enabled/50-cdanis-test-grafana5-stretch1-monitoring-eqiad-wmflabs.conf:4)
         port 80 namevhost iegreview.wikimedia.org (/etc/apache2/sites-enabled/50-iegreview-wikimedia-org.conf:6)
         port 80 namevhost racktables.wikimedia.org (/etc/apache2/sites-enabled/50-racktables-wikimedia-org.conf:8)
         port 80 namevhost scholarships.wikimedia.org (/etc/apache2/sites-enabled/50-scholarships-wikimedia-org.conf:6)
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www/html"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex watchdog-callback: using_defaults
Mutex rewrite-map: using_defaults
Mutex ssl-stapling-refresh: using_defaults
Mutex ssl-stapling: using_defaults
Mutex ldap-cache: using_defaults
Mutex proxy: using_defaults
Mutex ssl-cache: using_defaults
Mutex default: dir="/var/run/apache2/" mechanism=default 
Mutex mpm-accept: using_defaults
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33

My plan from here had been to try naively copying grafana's database from krypton to my VM.

However, it seems that the prometheus LVS service IPs (e.g. prometheus.svc.eqiad.wmnet) are not accessible from cloud VPS. Discussed quickly with chasemp on IRC and it sounds like fixing this is not easy.

I guess I can imagine a couple options from here:

  • choosing a spare prod host on which to run a new grafana instance?
  • perform some questionable network-duct-taping using SSH tunnels traversing my laptop for testing

Not feeling super enthused about either of those, especially not without input from the more clueful.

Talked some with volans. Seems like the best thing to do is probably to make a VM in Ganeti running stretch, set it up with a new puppet role just for grafana, copy the DB over, verify that it works well, and then switch over grafana.wikimedia.org to point there.

Please give this a good look! I've never done this before...

Steps I think I need to do to produce a new ganeti VM grafana1001.eqiad.wmnet:

  • update server naming conventions wikitech page with a grafana prefix
  • pick an IP address (how? looks like it should be somewhere in the 10.64.0.0/22 block, but I'm not sure exactly where)
    • add to wmnet and 10.in-addr.arpa DNS files
  • run makevm. the existing host krypton has 1 VCPU, 4G RAM, 20G disk; will reuse these numbers
    • after above outputs MAC address, add to DHCP linux-host-entries.ttyS0-115200
  • update netboot.cfg and preseed.cfg adding grafana1001 to the long list of hostnames for other ganeti hosts (which use partman/flat.cfg and virtual.cfg)
  • gnt-instance start on new instance
  • set boot order back to disk per ganeti docs

In parallel with getting the instance up and running:

  • configure reprepro to pull grafana from upstream repository for stretch-wikimedia
  • create new puppet role role::grafana (or maybe role::webserver_grafana?) for running just a grafana server; assign that role to grafana1001

Change 477286 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] add grafana1001 host in row C, which appears to have more free capacity in Ganeti

https://gerrit.wikimedia.org/r/477286

Change 477286 merged by CDanis:
[operations/dns@master] add grafana1001 host in row C, which appears to have more free capacity in Ganeti

https://gerrit.wikimedia.org/r/477286

Change 477293 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add DHCP and autoinstall options for grafana1001

https://gerrit.wikimedia.org/r/477293

Change 477293 merged by CDanis:
[operations/puppet@production] Add DHCP and autoinstall options for grafana1001

https://gerrit.wikimedia.org/r/477293

Change 477311 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Fix MAC address for grafana1001

https://gerrit.wikimedia.org/r/477311

Change 477311 merged by CDanis:
[operations/puppet@production] Fix MAC address for grafana1001

https://gerrit.wikimedia.org/r/477311

Change 477535 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add grafana to wikimedia-stretch apt repo.

https://gerrit.wikimedia.org/r/477535

Change 477535 merged by CDanis:
[operations/puppet@production] Add grafana to wikimedia-stretch apt repo.

https://gerrit.wikimedia.org/r/477535

Mentioned in SAL (#wikimedia-operations) [2018-12-04T13:34:51Z] <cdanis> T210416: adding grafana 5 to wikimedia-stretch: reprepro --restrict grafana update stretch-wikimedia

Change 477546 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] On wikimedia-stretch, add repository thirdparty/grafana

https://gerrit.wikimedia.org/r/477546

Change 477546 merged by CDanis:
[operations/puppet@production] On wikimedia-stretch, add repository thirdparty/grafana

https://gerrit.wikimedia.org/r/477546

Change 477557 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add role::grafana and switch grafana1001.eqiad to it

https://gerrit.wikimedia.org/r/477557

Change 477579 had a related patch set uploaded (by CDanis; owner: CDanis):
[labs/private@master] Copy necessary hieradata from role::webserver_misc_apps to role::grafana

https://gerrit.wikimedia.org/r/477579

Change 477579 merged by CDanis:
[labs/private@master] Copy necessary hieradata from role::webserver_misc_apps to role::grafana

https://gerrit.wikimedia.org/r/477579

Change 477557 merged by CDanis:
[operations/puppet@production] Add role::grafana and switch grafana1001.eqiad to it

https://gerrit.wikimedia.org/r/477557

The first Puppet run with the new configuration on grafana1001 failed. I invoked puppet again and it immediately succeeded.

Here's a snippet of logs from the first run showing the failed resource:

Dec  4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Bacula::Client/File[/etc/bacula/bacula-fd.conf]) Scheduling refresh of Service[bacula-fd]
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Bacula::Client/File[/etc/bacula/bacula-fd.conf]) Scheduling refresh of Service[bacula-fd]
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Bacula::Client/File[/etc/bacula/bacula-fd.conf]) Scheduling refresh of Service[bacula-fd]
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Bacula::Client/Service[bacula-fd]) Triggered 'refresh' from 3 events
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /etc/grafana/ldap.toml20181204-27200-9detc.lock at /etc/puppet/modules/grafana/manifests/init.pp:91
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /etc/grafana/ldap.toml20181204-27200-9detc.lock at /etc/puppet/modules/grafana/manifests/init.pp:91
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: Wrapped exception:
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: No such file or directory @ dir_s_mkdir - /etc/grafana/ldap.toml20181204-27200-9detc.lock
Dec  4 20:06:22 grafana1001 puppet-agent[27200]: (/Stage[main]/Grafana/File[/etc/grafana/ldap.toml]/ensure) change from absent to file failed: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /etc/grafana/ldap.toml20181204-27200-9detc.lock at /etc/puppet/modules/grafana/manifests/init.pp:91
Dec  4 20:06:25 grafana1001 puppet-agent[27200]: (/Stage[main]/Packages::Python_sqlalchemy/Package[python-sqlalchemy]/ensure) created
Dec  4 20:06:27 grafana1001 puppet-agent[27200]: (/Stage[main]/Packages::Rsyslog_kafka/Package[rsyslog-kafka]/ensure) created
Dec  4 20:06:27 grafana1001 puppet-agent[27200]: (/Stage[main]/Profile::Rsyslog::Kafka_shipper/File[/etc/rsyslog.lookup.d]/ensure) created
Dec  4 20:06:27 grafana1001 puppet-agent[27200]: (/Stage[main]/Profile::Rsyslog::Kafka_shipper/File[/etc/rsyslog.lookup.d/lookup_table_output.json]/ensure) defined content as '{md5}f4a1cfcfcacfb53fb398875666e5153a'
Dec  4 20:06:27 grafana1001 puppet-agent[27200]: (/Stage[main]/Profile::Grafana/File[/usr/local/sbin/grafana_create_anon_user]/ensure) defined content as '{md5}25d52a9a1e3abc70700c3cde5c02e72a'

I'm guessing some sort of race condition in the base grafana module?

Change 477632 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Fix race condition in ::grafana puppet module

https://gerrit.wikimedia.org/r/477632

Change 477632 merged by CDanis:
[operations/puppet@production] Fix race condition in ::grafana puppet module

https://gerrit.wikimedia.org/r/477632

Okay, grafana1001.eqiad is up and running, with the database (and the plugins tree) copied over from krypton. At first glance it seems to work.

Going to set up an external service (e.g. grafana-beta.wikimedia.org) so others can play with it. Also going to document the new features in 5.x that look interesting to us.

Change 478062 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] grafana-beta.wikimedia.org: add hiera for text varnishes

https://gerrit.wikimedia.org/r/478062

Change 478067 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] grafana-beta.wikimedia.org: add DNS entry for text varnishes

https://gerrit.wikimedia.org/r/478067

Change 478062 merged by CDanis:
[operations/puppet@production] grafana-beta.wikimedia.org: add hiera for text varnishes

https://gerrit.wikimedia.org/r/478062

Change 478067 merged by CDanis:
[operations/dns@master] grafana-beta.wikimedia.org: add DNS entry for text varnishes

https://gerrit.wikimedia.org/r/478067

Change 478099 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Temporarily override grafan1001's HTTP serving domain to grafana-beta.wikimedia.org. Once we are happy with the migration we can re-point grafana.w.o Varnishes to it and simply remove this file.

https://gerrit.wikimedia.org/r/478099

Change 478099 merged by CDanis:
[operations/puppet@production] grafana1001: answer for grafana-beta.wikimedia.org

https://gerrit.wikimedia.org/r/478099

Krinkle triaged this task as Medium priority.
Krinkle edited projects, added Performance-Team; removed Patch-For-Review.
Peter subscribed.

I've checked the WebPageTest and WebPageReplay dashboards and they look ok. There's some GUI fine-tuning I need to do: The new version always displays the row title (Dashboard Row by default) so I need to edit the dashboards so either remove the title or add a meaningful name, but I'll do that when we done the switch.

I've heard of no issues with Grafana 5, and will be upgrading today.

  • copy current grafana DB to the new grafana1001 host one last time
  • spot-check several dashboards and inspect logs for errors
  • re-point varnishes so grafana1001 backs grafana.wikimedia.org

Thanks to @ema who found a bug in 5.4.0 -- the tag filter UI seems quite broken.
Reported upstream: https://github.com/grafana/grafana/issues/14437

For now this can be worked around by going to the /dashboards URL (accessible via menu Dashboards -> Manage). We agreed that this bug was annoying but shouldn't block the upgrade. Hopefully upstream fixes quickly -- seems likely it's simple Javascript errors.

This bug may have been fixed in 5.4.1. Going to grab that version into wikimedia-stretch.

Mentioned in SAL (#wikimedia-operations) [2018-12-10T19:58:09Z] <cdanis> T210416: updating grafana to 5.4.1 in stretch-wikimedia: reprepro --restrict grafana update stretch-wikimedia

One part of the bug is fixed. The other (typing in the tag filter dropdown box) is not.

Proceeding as discussed with @ema .

Change 478763 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Revert "grafana1001: answer for grafana-beta.wikimedia.org"

https://gerrit.wikimedia.org/r/478763

Change 478765 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Switch grafana.wikimedia.org to point to grafana1001

https://gerrit.wikimedia.org/r/478765

Change 478763 merged by CDanis:
[operations/puppet@production] Revert "grafana1001: answer for grafana-beta.wikimedia.org"

https://gerrit.wikimedia.org/r/478763

Mentioned in SAL (#wikimedia-operations) [2018-12-10T20:20:07Z] <cdanis> T210416: setting grafana.wikimedia.org (currently served by krypton) to read-only and copying to grafana1001 (serving grafana-beta)

Mentioned in SAL (#wikimedia-operations) [2018-12-10T20:26:18Z] <cdanis> T210416: switching grafana.wikimedia.org to point to grafana1001.eqiad.wmnet

Change 478771 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] grafana1001 answers for grafana.wikimedia.org (the default)

https://gerrit.wikimedia.org/r/478771

Change 478771 merged by CDanis:
[operations/puppet@production] grafana1001 answers for grafana.wikimedia.org (the default)

https://gerrit.wikimedia.org/r/478771

Change 478765 merged by CDanis:
[operations/puppet@production] Switch grafana.wikimedia.org to point to grafana1001

https://gerrit.wikimedia.org/r/478765

Mentioned in SAL (#wikimedia-operations) [2018-12-10T20:35:44Z] <cdanis> T210416: grafana.wikimedia.org switch to point to grafana1001.eqiad.wmnet (running grafana 5.4.1)

Need to create some other tasks to track work that should be done with new 5.x features but marking this as done :)

Change 532702 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] logstash: replace krypton with grafana1001 in collector ferm rules

https://gerrit.wikimedia.org/r/532702

Change 532702 merged by Dzahn:
[operations/puppet@production] logstash: replace krypton with grafana1001 in collector ferm rules

https://gerrit.wikimedia.org/r/532702