Page MenuHomePhabricator

Migrate labmon* to Buster
Closed, ResolvedPublic

Description

There's some discussion to move labmon inside Cloud VPS (T207543), but independent of that labmon* should probably catch up the the same OS we use for the production Prometheus servers (which use Stretch)

We have some docs! https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring

Definition of done:

  • OS migration from Jessie to Buster
  • OS hostname rename (labmon -> cloudmetrics)

This ticket has 3 parts:

Puppet codes

labmon1002 migration

labmon1001 migration

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Phamhi:

  1. Grafana is installed from an external repository. There's already a config to pull in the new deb package for our Buster repository (Chris is working on setting up a Grafana 6 instance on Buster), but it hasn't been imported to our apt.wikimedia.org repository yet. Best to sync up with him on that, as I'm not 100% sure whether it's best to import grafana 5 or 6 initially.
  1. python-whisper is used by graphite-web and graphite-carbon. In Buster they switched to using Python 3 (and the python-whisper package in Buster only supports Python 3 now (i.e. it's renamed to python3-whisper)). The Puppet code in the graphite class which installs python-whisper (L20 in modules/graphite/manifests/init.pp) isn't really useful to begin with: python-whisper is a dependency of the carbon packages and installing the carbon packages and installing them will already install the correct version of Whisper (python-whisper on jessie/stretch and python3-whisper on buster). As such, simply remove the package declaration for python-whisper and Puppet/apt will do the right thing.

@bd808: I looked over the class and in addition to what I wrote above I don't see any issues which would prevent using Buster: Grafana is an external deb and compatible with Buster and Graphite and Prometheus are shipped in Debian itself (the Prometheus version in buster is the same (2.7.1) as currently already used on labmon and in the rest of prod. There'll be smaller Puppet changes to adapt repositories etc, but those would be necessary for Stretch as well.

@bd808 I'm echoing what @MoritzMuehlenhoff said (thanks!) and going with Buster seems worthwhile to me. Specifically Grafana 6 is a safe upgrade AFAIK (cc @CDanis) and ditto for graphite. @Phamhi I'd be happy to help reviewing patches for Buster support!

Hi @CDanis, could you please let me know the timeline for getting Grafana package on Buster repo?

Please disregard my last comment. Moritz has just let us know that Grafana6 is now available on Buster... https://debmonitor.wikimedia.org/packages/grafana

I'm getting the following

Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for phamhi-labmon.testlabs.eqiad.wmflabs
Info: Applying configuration version '(720c83c41e) Ema - ATS: network settings for ats-be'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: Traceback (most recent call last):
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/bin/graphite-manage", line 7, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     import graphite.settings # Assumed to be in the same directory.
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/usr/lib/python3/dist-packages/graphite/settings.py", line 212, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     from local_settings import *  # noqa
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:   File "/etc/graphite/local_settings.py", line 61, in <module>
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns:     MIDDLEWARE_CLASSES += (
Notice: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: NameError: name 'MIDDLEWARE_CLASSES' is not defined
Error: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]
Error: /Stage[main]/Graphite::Web/Exec[graphite_syncdb]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/graphite-manage syncdb --noinput' returned 1 instead of one of [0]

Running /usr/bin/graphite-manage alone also yields an error

# /usr/bin/graphite-manage 
Traceback (most recent call last):
  File "/usr/bin/graphite-manage", line 7, in <module>
    import graphite.settings # Assumed to be in the same directory.
  File "/usr/lib/python3/dist-packages/graphite/settings.py", line 212, in <module>
    from local_settings import *  # noqa
  File "/etc/graphite/local_settings.py", line 61, in <module>
    MIDDLEWARE_CLASSES += (
NameError: name 'MIDDLEWARE_CLASSES' is not defined

It looks like this is due to a django version change: https://docs.djangoproject.com/en/2.2/releases/1.10/

New-style middleware
A new style of middleware is introduced to solve the lack of strict request/response layering of the old-style of middleware described in DEP 0005. You’ll need to adapt old, custom middleware and switch from the MIDDLEWARE_CLASSES setting to the new MIDDLEWARE setting to take advantage of the improvements.

Renaming MIDDLEWARE_CLASSES to MIDDLEWARE seems to have fixed this problem.

Change 552107 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: add compatibility in buster

https://gerrit.wikimedia.org/r/552107

The following changes have been made to be compatible with Buster:

  • Require package change from python-whisper to python3-whisper
  • Due to recent django framework change, in /etc/local_settings.py, change MIDDLEWARE_CLASSES variable name to MIDDLEWARE
  • Both graphite-auth.py and graphite-index.py now look for python3 binary
  • Minor code change in graphite-index.py code to be compatible with python3

I have tested the new codes in both stretch and buster

As per suggestion, I have created different python files (no longer template) for different release.

The scope of this request has been extended to OS rename as well. New OS hostname will be determined.

aborrero subscribed.

In the WMCS team meeting, we decided to rename these servers to better reflect what they do and to avoid naming clashes:

  • from labmon1001 to cloudmetrics1001 and
  • from labmon1002 to cloudmetrics1002
This comment was removed by Phamhi.

Change 552107 merged by Phamhi:
[operations/puppet@production] labmon: add compatibility in buster

https://gerrit.wikimedia.org/r/552107

Change 553441 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553441

Two nits: When reimaging the servers (or when it's done), please also update the Cumin aliases and update https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions

Change 553467 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553467

Change 553441 merged by Phamhi:
[operations/puppet@production] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553441

Change 553467 merged by Phamhi:
[operations/dns@master] cloudvps: rename+reimage labmon1002 as cloudmetrics1002

https://gerrit.wikimedia.org/r/553467

Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts:

labmon1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201911281559_phamhi_182318_labmon1002_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1002.eqiad.wmnet']

and were ALL successful.

I am looking into a recently discovered issue with uwsgi-graphite-web.service

Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: Traceback (most recent call last):
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]:   File "/usr/share/graphite-web/graphite.wsgi", line 11, in <module>
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]:     from django.conf import settings
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: ImportError: No module named django.conf
Nov 28 20:14:12 phamhi-labmon uwsgi-graphite-web[19372]: unable to load app 0 (mountpoint='') (callable not found or import error)

Change 554114 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554114

Change 554115 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554115

Change 554114 abandoned by Phamhi:
labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554114

For new gerrit patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/554115, I have updated the following to fix the uwsgi-graphite-web.service issue:

Tested on both stretch and buster VMs and confirmed graphite-web is now working (also tested graph generator function)

Change 554115 merged by Phamhi:
[operations/puppet@production] labmon: update graphite-web to be compatible with buster/stretch

https://gerrit.wikimedia.org/r/554115

Confirmed that the latest patch 554115 fixed the graphite-web issue.

MoritzMuehlenhoff renamed this task from Migrate labmon* to Stretch (or Buster, better yet!) to Migrate labmon* to Buster.Dec 3 2019, 3:24 PM

Change 554844 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/554844

Change 554853 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: use hiera for labmon/cloudmetrics instead of harcoded values

https://gerrit.wikimedia.org/r/554853

Change 555565 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555565

Change 555570 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555570

Change 554853 merged by Phamhi:
[operations/puppet@production] wmcs: use hiera for labmon/cloudmetrics instead of harcoded values

https://gerrit.wikimedia.org/r/554853

Change 556215 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: fix hiera lookup for primary labmon host

https://gerrit.wikimedia.org/r/556215

Change 556215 merged by Phamhi:
[operations/puppet@production] wmcs: fix hiera lookup for primary labmon host

https://gerrit.wikimedia.org/r/556215

Change 556225 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: fix hiera lookup for primary labmon host (typo correction)

https://gerrit.wikimedia.org/r/556225

Change 556225 merged by Phamhi:
[operations/puppet@production] wmcs: fix hiera lookup for primary labmon host (typo correction)

https://gerrit.wikimedia.org/r/556225

Change 554844 merged by Phamhi:
[operations/puppet@production] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/554844

As part of the failover test from labmon1001 to cloundmetrics1002, we have discovered that cloudmetrics1002's 10.64.4.15 IP is missing in the network devices ACL. Ticket https://phabricator.wikimedia.org/T240456 has been created to remedy this issue.

Change 557017 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: cleanup labmon1002 dns records

https://gerrit.wikimedia.org/r/557017

Change 557017 merged by Phamhi:
[operations/dns@master] cloudvps: cleanup labmon1002 dns records

https://gerrit.wikimedia.org/r/557017

Change 557103 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: monitoring: remove rssh

https://gerrit.wikimedia.org/r/557103

The package rssh is still being used in labmon servers to restrict the type of shell and/or commands can been invoked during rsync from primary to standby server. There's a few problems with this.

I believe it's best to remove it altogether and stick with command restriction via the authorized_keys file. This protection is already in place currently (see private/modules/secret/secrets/ssh/wmcs/monitoring/wmcs_monitoring_rsync.pub).

$ cat private/modules/secret/secrets/ssh/wmcs/monitoring/wmcs_monitoring_rsync.pub
command="rsync --server --sender -logDtprSe.Lsfxd . /srv/carbon/whisper/"  ... </truncated>

https://gerrit.wikimedia.org/r/c/operations/puppet/+/557103

Change 557103 merged by Phamhi:
[operations/puppet@production] wmcs: monitoring: remove rssh

https://gerrit.wikimedia.org/r/557103

Change 558506 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/558506

It looks like cloudmetrics1002 IP 10.64.4.15 is already in the ACL. We will try https://gerrit.wikimedia.org/r/c/operations/puppet/+/558506 again

Change 558506 merged by Phamhi:
[operations/puppet@production] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/558506

Change 559448 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/559448

Change 559448 merged by Phamhi:
[operations/dns@master] wmcs: make cloudmetrics1002 the primary instead of labmon1001

https://gerrit.wikimedia.org/r/559448

Change 555565 merged by Phamhi:
[operations/puppet@production] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555565

Change 555570 merged by Phamhi:
[operations/dns@master] cloudvps: rename+reimage labmon1001 as cloudmetrics1001

https://gerrit.wikimedia.org/r/555570

Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts:

labmon1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201912191353_phamhi_153387_labmon1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudmetrics1001.eqiad.wmnet']

and were ALL successful.

Change 559535 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/dns@master] cloudvps: cleanup labmon1001 dns records

https://gerrit.wikimedia.org/r/559535

Change 559535 merged by Phamhi:
[operations/dns@master] cloudvps: cleanup labmon1001 dns records

https://gerrit.wikimedia.org/r/559535

Hey, thanks for doing this but I really think this needs some sort of announcement, lots of tools in labs depend on this and one of our production services that had a beta version in beta cluster got broken because of this. It took me some time to find out what labmon1001 got changed to (wikitech documentation doesn't have anything either).

Hi @Ladsgroup, noted; I apologize for the inconvenience. I have updated the docs. Please let me know if you need me on anything.

Hey, thanks for doing this but I really think this needs some sort of announcement, lots of tools in labs depend on this and one of our production services that had a beta version in beta cluster got broken because of this. It took me some time to find out what labmon1001 got changed to (wikitech documentation doesn't have anything either).

@Ladsgroup could you explain the dependency on the physical host name? It sounds to me like there may be a use case for a service name to be used instead.

Hey, thanks for doing this but I really think this needs some sort of announcement, lots of tools in labs depend on this and one of our production services that had a beta version in beta cluster got broken because of this. It took me some time to find out what labmon1001 got changed to (wikitech documentation doesn't have anything either).

@Ladsgroup could you explain the dependency on the physical host name? It sounds to me like there may be a use case for a service name to be used instead.

It's being used in beta cluster like https://github.com/search?q=org%3Awikimedia+labmon1001&type=Code to be the place that receives statsd, even mediawiki-config uses it.

Some issues that might be related to this: T241462. Regards.