Page MenuHomePhabricator

Migrate labmon* to Buster
Open, MediumPublic

Description

There's some discussion to move labmon inside Cloud VPS (T207543), but independent of that labmon* should probably catch up the the same OS we use for the production Prometheus servers (which use Stretch)

We have some docs! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring

Definition of done:

  • OS migration from Jessie to Buster
  • OS hostname rename (labmon -> cloudmetrics)

This ticket has 3 parts:

Puppet codes

labmon1002 migration

labmon1001 migration

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:53 AM
bd808 added a subscriber: bd808.

Moving to Stretch would be a good time to also rename these hosts to get rid of the "lab" qualifier. These hosts seem to be running:

  • grafana
  • graphite
  • prometheus
  • statsite
Andrew renamed this task from Migrate labmon* to Stretch to Migrate labmon* to Stretch (or Buster, better yet!).Jul 29 2019, 3:14 PM
aborrero updated the task description. (Show Details)Jul 29 2019, 3:21 PM
Phamhi claimed this task.Sep 25 2019, 12:15 PM
Phamhi added a comment.Oct 3 2019, 4:12 PM

I commented the lines related to puppetdb (34 to 44 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/class_config.pp) and was able to run it locally on a test labmon VM with buster installed

The latest run result is below

Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for phamhi-labmon.testlabs.eqiad.wmflabs
Info: Applying configuration version '1570118860'
Notice: openstack::clientpackages::vms::mitaka::buster: no special configuration yet
Notice: /Stage[main]/Openstack::Clientpackages::Vms::Mitaka::Buster/Notify[openstack::clientpackages::vms::mitaka::buster: no special configuration yet]/message: defined 'message' as 'openstack::clientpackages::vms::mitaka::buster: no special configuration yet'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python-whisper' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Package python-whisper is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python-whisper' has no installation candidate
Error: /Stage[main]/Graphite/Package[python-whisper]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python-whisper' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Package python-whisper is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python-whisper' has no installation candidate
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/bin/graphite-manage", line 7, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     import graphite.settings # Assumed to be in the same directory.
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/lib/python3/dist-packages/graphite/settings.py", line 213, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     from local_settings import *  # noqa
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/etc/graphite/local_settings.py", line 61, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     MIDDLEWARE_CLASSES += (
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: NameError: name 'MIDDLEWARE_CLASSES' is not defined
Error: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]
Error: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]
Notice: openstack::clientpackages::mitaka::buster: no special configuration yet
Notice: /Stage[main]/Openstack::Clientpackages::Mitaka::Buster/Notify[openstack::clientpackages::mitaka::buster: no special configuration yet]/message: defined 'message' as 'openstack::clientpackages::mitaka::buster: no special configuration yet'
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install grafana' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package grafana
Error: /Stage[main]/Grafana/Package[grafana]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install grafana' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package grafana
Notice: /Stage[main]/Grafana/File[/etc/grafana/grafana.ini]: Dependency Package[grafana] has failures: true
Warning: /Stage[main]/Grafana/File[/etc/grafana/grafana.ini]: Skipping because of failed dependencies
Warning: /Stage[main]/Grafana/File[/etc/grafana/provisioning/dashboards/provision-puppet-dashboards.yaml]: Skipping because of failed dependencies
Warning: /Stage[main]/Grafana/File[/var/lib/grafana/dashboards]: Skipping because of failed dependencies
Warning: /Stage[main]/Grafana/File[/etc/grafana/ldap.toml]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/File[/usr/share/grafana/public/dashboards/home.json]: Skipping because of failed dependencies
Warning: /Stage[main]/Grafana/Service[grafana-server]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/File[/usr/share/grafana/public/img/grafana_icon.svg]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Exec[/usr/local/sbin/grafana_create_anon_user --create]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Git::Clone[operations/software/grafana/simple-json-datasource]/File[/usr/share/grafana/public/app/plugins/datasource/simple-json-datasource]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Git::Clone[operations/software/grafana/simple-json-datasource]/Exec[git_clone_operations/software/grafana/simple-json-datasource]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/File[/usr/share/grafana/public/app/plugins/datasource/datasource-plugin-genericdatasource]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Httpd::Site[test2]/Httpd::Conf[test2]/File[/etc/apache2/sites-available/50-test2.conf]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Httpd::Site[test2]/Httpd::Conf[test2]/File[/etc/apache2/sites-enabled/50-test2.conf]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Httpd::Site[test]/Httpd::Conf[test]/File[/etc/apache2/sites-available/50-test.conf]: Skipping because of failed dependencies
Warning: /Stage[main]/Profile::Grafana/Httpd::Site[test]/Httpd::Conf[test]/File[/etc/apache2/sites-enabled/50-test.conf]: Skipping because of failed dependencies
Notice: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns:   File "/usr/local/sbin/graphite-auth", line 19, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns:     import django  # noqa: E402
Notice: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns: ImportError: No module named django
Error: '/usr/local/sbin/graphite-auth set admin this_isnt_a_real_password' returned 1 instead of one of [0]
Error: /Stage[main]/Graphite::Web/Exec[create_graphite_admin]/returns: change from 'notrun' to ['0'] failed: '/usr/local/sbin/graphite-auth set admin this_isnt_a_real_password' returned 1 instead of one of [0]
Warning: /Stage[main]/Httpd/Service[apache2]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 6.86 seconds
Phamhi added a comment.EditedOct 4 2019, 2:43 PM

It looks like both grafana and python-whisper are not available on buster

labmon1002$ apt-cache policy python-whisper
python-whisper:
  Installed: 0.9.15-1
  Candidate: 0.9.15-1
  Version table:
 *** 0.9.15-1 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/backports amd64 Packages
        100 /var/lib/dpkg/status
     0.9.12-1 0
        500 http://mirrors.wikimedia.org/debian/ jessie/main amd64 Packages
labmon1002$ apt-cache policy grafana       
grafana:
  Installed: 5.4.5
  Candidate: 6.3.4
  Version table:
     6.3.4 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/thirdparty amd64 Packages
 *** 5.4.5 0
        100 /var/lib/dpkg/status

@fgiunchedi can you give @Phamhi any tips on trying to get our role::wmcs::monitoring working on a Buster host? Is it a lost cause or something that should be possible with a bit of work? Would we be better off using Stretch instead of Buster today?

@Phamhi:

  1. Grafana is installed from an external repository. There's already a config to pull in the new deb package for our Buster repository (Chris is working on setting up a Grafana 6 instance on Buster), but it hasn't been imported to our apt.wikimedia.org repository yet. Best to sync up with him on that, as I'm not 100% sure whether it's best to import grafana 5 or 6 initially.
  1. python-whisper is used by graphite-web and graphite-carbon. In Buster they switched to using Python 3 (and the python-whisper package in Buster only supports Python 3 now (i.e. it's renamed to python3-whisper)). The Puppet code in the graphite class which installs python-whisper (L20 in modules/graphite/manifests/init.pp) isn't really useful to begin with: python-whisper is a dependency of the carbon packages and installing the carbon packages and installing them will already install the correct version of Whisper (python-whisper on jessie/stretch and python3-whisper on buster). As such, simply remove the package declaration for python-whisper and Puppet/apt will do the right thing.

@bd808: I looked over the class and in addition to what I wrote above I don't see any issues which would prevent using Buster: Grafana is an external deb and compatible with Buster and Graphite and Prometheus are shipped in Debian itself (the Prometheus version in buster is the same (2.7.1) as currently already used on labmon and in the rest of prod. There'll be smaller Puppet changes to adapt repositories etc, but those would be necessary for Stretch as well.

@bd808 I'm echoing what @MoritzMuehlenhoff said (thanks!) and going with Buster seems worthwhile to me. Specifically Grafana 6 is a safe upgrade AFAIK (cc @CDanis) and ditto for graphite. @Phamhi I'd be happy to help reviewing patches for Buster support!

Hi @CDanis, could you please let me know the timeline for getting Grafana package on Buster repo?

Please disregard my last comment. Moritz has just let us know that Grafana6 is now available on Buster... https://debmonitor.wikimedia.org/packages/grafana

Phamhi added a comment.EditedMon, Nov 18, 9:04 PM

I'm getting the following

Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for phamhi-labmon.testlabs.eqiad.wmflabs
Info: Applying configuration version '(720c83c41e) Ema - ATS: network settings for ats-be'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/bin/graphite-manage", line 7, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     import graphite.settings # Assumed to be in the same directory.
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/lib/python3/dist-packages/graphite/settings.py", line 212, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     from local_settings import *  # noqa
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/etc/graphite/local_settings.py", line 61, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     MIDDLEWARE_CLASSES += (
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: NameError: name 'MIDDLEWARE_CLASSES' is not defined
Error: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]
Error: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]

Running /usr/bin/graphite-manage alone also yields an error

# /usr/bin/graphite-manage 
Traceback (most recent call last):
  File "/usr/bin/graphite-manage", line 7, in <module>
    import graphite.settings # Assumed to be in the same directory.
  File "/usr/lib/python3/dist-packages/graphite/settings.py", line 212, in <module>
    from local_settings import *  # noqa
  File "/etc/graphite/local_settings.py", line 61, in <module>
    MIDDLEWARE_CLASSES += (
NameError: name 'MIDDLEWARE_CLASSES' is not defined

It looks like this is due to a django version change: https://docs.djangoproject.com/en/2.2/releases/1.10/

New-style middleware
A new style of middleware is introduced to solve the lack of strict request/response layering of the old-style of middleware described in DEP 0005. You’ll need to adapt old, custom middleware and switch from the MIDDLEWARE_CLASSES setting to the new MIDDLEWARE setting to take advantage of the improvements.

Renaming MIDDLEWARE_CLASSES to MIDDLEWARE seems to have fixed this problem.

Change 552107 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: add compatibility in buster

https://gerrit.wikimedia.org/r/552107

Phamhi added a comment.EditedWed, Nov 20, 6:30 PM

The following changes have been made to be compatible with Buster:

  • Require package change from python-whisper to python3-whisper
  • Due to recent django framework change, in /etc/local_settings.py, change MIDDLEWARE_CLASSES variable name to MIDDLEWARE
  • Both graphite-auth.py and graphite-index.py now look for python3 binary
  • Minor code change in graphite-index.py code to be compatible with python3

I have tested the new codes in both stretch and buster

As per suggestion, I have created different python files (no longer template) for different release.

Phamhi updated the task description. (Show Details)Mon, Nov 25, 11:49 AM

The scope of this request has been extended to OS rename as well. New OS hostname will be determined.

aborrero added a subscriber: aborrero.

In the WMCS team meeting, we decided to rename these servers to better reflect what they do and to avoid naming clashes:

  • from labmon1001 to cloudmetrics1001 and
  • from labmon1002 to cloudmetrics1002
This comment was removed by Phamhi.

Change 552107 merged by Phamhi:
[operations/puppet@production] labmon: add compatibility in buster

https://gerrit.wikimedia.org/r/552107

Phamhi updated the task description. (Show Details)Thu, Nov 28, 4:13 AM
Phamhi updated the task description. (Show Details)
Phamhi updated the task description. (Show Details)Thu, Nov 28, 4:20 AM

Change 553441 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553441

Two nits: When reimaging the servers (or when it's done), please also update the Cumin aliases and update https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions

Change 553467 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553467

Change 553441 merged by Phamhi:
[operations/puppet@production] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553441

Change 553467 merged by Phamhi:
[operations/dns@master] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553467

Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts:

labmon1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1002.eqiad.wmnet']

and were ALL successful.

Phamhi updated the task description. (Show Details)Thu, Nov 28, 4:36 PM
Phamhi updated the task description. (Show Details)Thu, Nov 28, 4:39 PM

I am looking into a recently discovered issue with uwsgi-graphite-web.service

Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: Traceback (most recent call last):
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]:   File "/usr/share/graphite-web/graphite.wsgi", line 11, in <module>
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]:     from django.conf import settings
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: ImportError: No module named django.conf
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: unable to load app 0 (mountpoint='') (callable not found or import error)

Change 554114 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554114

Change 554115 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554115

Change 554114 abandoned by Phamhi:
labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554114

Phamhi added a comment.EditedMon, Dec 2, 6:09 PM

For new gerrit patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/554115, I have updated the following to fix the uwsgi-graphite-web.service issue:

Tested on both stretch and buster VMs and confirmed graphite-web is now working (also tested graph generator function)

Phamhi updated the task description. (Show Details)Mon, Dec 2, 9:03 PM

Change 554115 merged by Phamhi:
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554115

Phamhi added a comment.Tue, Dec 3, 2:59 PM

Confirmed that the latest patch 554115 fixed the graphite-web issue.

MoritzMuehlenhoff renamed this task from Migrate labmon* to Stretch (or Buster, better yet!) to Migrate labmon* to Buster.Tue, Dec 3, 3:24 PM

Change 554844 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/554844

Change 554853 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: use hiera for labmon/cloudmetrics instead of harcoded values

https://gerrit.wikimedia.org/r/554853

Change 555565 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555565

Change 555570 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555570