Page MenuHomePhabricator

setup netmon1002.wikimedia.org
Closed, ResolvedPublic

Description

This task will track the racking, initial setup, and deployment of services onto netmon1002.

This system was requested on T156040, and is being placed to order via T139416. Once the system arrives on site, it will be received in by @Cmjohnson on T139416.

Once the system is racked by @Cmjohnson and

  • - system racked (rack seems to be immaterial, but network admins may disagree. It also may be easier on network admins to put this in the same rack as netmon1001, since then its a simple find/replace for any networking level changes. ( racked this in a4)
  • - update switch config to place in public vlan.
  • - update T159757 with the switch port assignment, so the other related router changes can be made for this system.
  • - mgmt dns entries for both asset tag and hostname (netmon1002)
  • - bios/drac setup and tested [end on-site steps, can hand off to @RobH]
  • - production dns entry in public vlan
  • - install OS (jessie)
  • - sign/accept salt/puppet (don't set a role yet!)
  • - setup IPv6 on host
  • - step by step migration of each service will need to be reviewed, since they are migrating from jessie to stretch (was: from precise to jessie but that was already done in place)

The migration of services to netmon1002 from netmon1001 will require a full review of each service/role assigned to the host, since it is also a migration between OS/distros.

Presently, netmon1001 runs the following, all of which will need to be migrated:

  • - rancid::server (done, migrated)
  • - librenms (in progress)
  • - servermon::wmf (done, moved to netmon1003 for now)
  • - network::monitor (done)
  • - torrus (done, removed)
  • - smokeping (done, migrated)
  • - ganglia::monitor::aggregator (for codfw & eqiad) (done, removed)

The inclusions shouldn't typically cause an issue between precise and jessie at this point, but in case they do:

  • - include ::standard
  • - include ::passwords::network
  • - include ::base::firewall

also see: T166180 (setup netmon2001, the codfw equivalent)

Details

Related Gerrit Patches:
operations/puppet : productionnetmon1002: re-enable Letsencrypt cert creation
operations/puppet : productionkeyholder: add stretch support, fix key validity check
operations/puppet : productionlibrenms: rsync rrd data from netmon1001 to netmon1002
operations/puppet : productionlibrenms: active_server param, don't pull data from multi servers
operations/dns : masterswitch librenms from netmon1001 to netmon1002
operations/puppet : productionrancid: change the rsync direction
operations/puppet : productionscap/dsh: replace netmon1001->netmont1002 for librenms
operations/puppet : productionnetmon: remove librenms from netmon1001
operations/puppet : productionrancid: add rsync::quickdatacopy to sync /var/lib/rancid
operations/puppet : productionservermon: add missing package python-mysqldb
operations/puppet : productionlibrenms: add missing Apache headers module
operations/puppet : productionlibrenms: ensure install_dir exists
operations/puppet : productionlibrenms: move php5-ldap package to others, fix for stretch
operations/puppet : productionlibrenms: use libapache2-mod-php7.0 if on stretch
operations/puppet : productionapache: add class for mod_php with PHP 7.0 for stretch
operations/puppet : productionlibrenms: add support for stretch, adjust (PHP) packages
operations/puppet : productionsite: remove smokeping role from netmon1001
operations/puppet : productioncache::misc: add backend for netmon1002
operations/puppet : productioncache::misc/smokeping: switch smokeping backend to netmon1002
operations/puppet : productionsmokeping: allow rsync of data from netmon1001 to netmon1002
operations/puppet : productionnetmon1002: add smokeping role
operations/puppet : productionrancid: add APT pin to jessie-backports release
operations/puppet : productioninstall_server: switch netmon1002 to stretch
operations/puppet : productionrancid: drop "server" suffix, apply on netmon1002
operations/dns : masteradd IPv6 for netmon1002, forward and reverse records
operations/puppet : productioninstall_server: add netmon1002 to DHCP, partman
operations/dns : masterfix typo, "netmont1002.mgmt" -> "netmon1002.mgmt"
operations/puppet : productionadd netmon1002 to site

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn added a comment.Jun 12 2017, 9:35 PM

@akosiaris @RobH Is our goal to shut down netmon1001 after this is done?

RobH added a comment.Jun 12 2017, 9:40 PM

My understanding is netmon1001 will have a new task made for decommission once netmon1002 replaces it. netmon1001 is out of warranty.

Mentioned in SAL (#wikimedia-operations) [2017-06-12T22:01:06Z] <mutante> netmon1002 - apt-get -t jessie-backports install rancid (upgrade from 2.3.8 to 3.6.2 to match version on netmon1001) - rancid version is not specified in puppet so even though backports gets enabled the older version gets installed and this manual step is needed unless we start specifying the version in the manifest (T159756)

My understanding is netmon1001 will have a new task made for decommission once netmon1002 replaces it. netmon1001 is out of warranty.

That's mine as well.

Change 358645 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rancid: add APT pin to jessie-backports release

https://gerrit.wikimedia.org/r/358645

Change 358647 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch netmon1002 to stretch

https://gerrit.wikimedia.org/r/358647

Change 358647 merged by Dzahn:
[operations/puppet@production] install_server: switch netmon1002 to stretch

https://gerrit.wikimedia.org/r/358647

Mentioned in SAL (#wikimedia-operations) [2017-06-13T19:08:24Z] <mutante> netmon1002 - reinstallled with stretch, revoked puppet cert, salt key, signing new cert, accepting new key, initial puppet run (T159756)

Change 358645 abandoned by Dzahn:
rancid: add APT pin to jessie-backports release

Reason:
not needed anymore. now using stretch and got 3.6.2-2 without any changes

https://gerrit.wikimedia.org/r/358645

Change 358884 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] keyholder: add stretch support, fix key validity check

https://gerrit.wikimedia.org/r/358884

Could not "keyholder arm" on stretch. Reason turns out to be the output of ssh-keygen changed, which keyholder relies on:

21:46 mutante: netmon1002 - was able to "keyholder arm" after stretch install after applying https://gerrit.wikimedia.org/r/358884 as hotfix

now keyholder is armed

Mentioned in SAL (#wikimedia-operations) [2017-06-13T23:04:59Z] <mutante> netmon1001/1002: rsynced /var/lib/rancid/CVS and /var/lib/rancid/GIT from 1001 to 1002 for rancid migration (T159756)

Change 358884 merged by Dzahn:
[operations/puppet@production] keyholder: add stretch support, fix key validity check

https://gerrit.wikimedia.org/r/358884

Mentioned in SAL (#wikimedia-operations) [2017-06-14T01:00:26Z] <mutante> netmon1002 - locally "git clone /var/lib/rancid/GIT/core" into /var/lib/rancid (i rsynced that but it's a bare repository without a work tree. work tree is /var/lib/rancid/core (after this) (T159756)

Mentioned in SAL (#wikimedia-operations) [2017-06-14T01:20:42Z] <mutante> netmon1002 - copied missing router.db, routers.all/.down/.up over from netmon1001 to /var/lib/rancid/core. routers.db is an untracked file, the others are in .gitignore. this is all like on netmon1001 as well. adding routers.db to .gitignore file on both, like the other router* files already were (T159756)

Dzahn added a comment.Jun 14 2017, 2:41 AM

< mutante> !log netmon1002 - chown rancid:rancid /var/lib/rancid ; touch /var/lib/rancid/.gitconfig, let rancid write to config, then git config --global user.email and user.name as the rancid user | fix permissions on .git/objects files, let rancid user own them all | re-commit .gitingore change ...

SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/lib/rancid/bin/rancid-run as user "rancid" runs finally but still hangs later without exiting

Dzahn added a comment.Jun 22 2017, 2:56 AM

re: rancid migration

Everything seems to be fine, but for some reason the rancid-run on netmon1002 always hangs at the end and doesn't finish, while it does on netmon1001. The keyholder is armed, i can see that it logs in on switches.. watching it with tcpdump making outgoing ssh connections to port 22 .. then at some point it stops.. debugging...

Change 361191 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netmon1002: add smokeping role

https://gerrit.wikimedia.org/r/361191

Dzahn added a comment.Jun 26 2017, 6:35 PM

summary of the issue i'm seeing.

The context is "rancid". What it does is ssh to all the switches, download config, compare to local git repo, if there are diffs then send email about it.

existing setup that works on netmon1001, there is this cron:

[netmon1001:~] $ cat /etc/cron.d/rancid
# Run config differ hourly
1 * * * *	rancid	SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/lib/rancid/bin/rancid-run
# Clean out rancid logs
50 23 * * *	rancid	/usr/bin/find /var/log/rancid -type f -mtime +2 -exec rm {} \;

The rancid-run command is executed as user rancid, using the keyholder auth sock, so it uses that key to SSH to all the switches to download their config.

The keyholder is armed:

[netmon1001:~] $ sudo keyholder status
keyholder-agent: active
- 4096 fd:fc:1f:e6:14:ce:a8:51:29:c7:d1:12:7e:7a:6b:01 /etc/keyholder.d/rancid (RSA)

The command runs regularly, logs in on all the switches, pulls down config, compares to local git repo, checks for diffs..

In /var/log/rancid the latest logfile starts like this:

1 starting: Mon Jun 26 18:01:01 UTC 2017

and ends like this:

21 nothing added to commit but untracked files present
22 Everything up-to-date
23 
24 ending: Mon Jun 26 18:03:36 UTC 2017

There are no long-running processes besides the current one during a run. (ps aux | grep rancid)

Now compare to netmon1002:

  • same cron
  • keyholder armed
  • confirmed with tcpdump that it actively talks to the switches

last logfile looks like:

1 starting: Mon Jun 26 18:01:01 UTC 2017
2 
3 hourly config diffs failed: /tmp/.core.run.lock exists
4 -rw-r----- 1 rancid rancid 5 Jun 22 04:01 /tmp/.core.run.lock

When looking at process list:

rancid    1516  0.0  0.0   6336   728 ?        Ss   Jun22   0:00 /bin/sh -c SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/lib/rancid/bin/rancid-run
rancid    1517  0.0  0.0   6336   764 ?        S    Jun22   0:00 /bin/sh /usr/lib/rancid/bin/rancid-run
rancid    1519  0.0  0.0   6336   112 ?        S    Jun22   0:00 /bin/sh /usr/lib/rancid/bin/rancid-run
rancid    1522  0.0  0.0   6336  1624 ?        S    Jun22   0:00 /bin/sh /usr/lib/rancid/bin/control_rancid core
rancid    2871  0.0  0.0  27976  3804 ?        Sl   Jun22   0:00 git push
rancid    2872  0.0  0.0   6336   716 ?        S    Jun22   0:00 /bin/sh -c git-receive-pack '/var/lib/rancid/GIT/core' git-receive-pack '/var/lib/rancid/GIT/core'
rancid    2873  0.0  0.0  27336  3632 ?        Sl   Jun22   0:00 git-receive-pack /var/lib/rancid/GIT/core

When killing everything (killall -u rancid) and removing lockfile and manually running it (with -u rancid AND -H !)

sudo -u rancid -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/lib/rancid/bin/rancid-run

starting: Mon Jun 26 18:28:34 UTC 2017
..
Trying to get all of the configs.
All routers sucessfully completed.
[master dfcb798] updates
 6 files changed, 51 insertions(+), 9 deletions(-)

And then it just seems to stop and sit there. And you see these processes again:

As you can see it happens at the git part.

rancid   19931  0.0  0.0  27976  3572 pts/0    Sl+  18:30   0:00 git push
rancid   19932  0.0  0.0   6336   756 pts/0    S+   18:30   0:00 /bin/sh -c git-receive-pack '/var/lib/rancid/GIT/core' git-receive-pack '/var/lib/rancid/GIT/core'
rancid   19933  0.0  0.0  27336  1072 pts/0    Sl+  18:30   0:00 git-receive-pack /var/lib/rancid/GIT/core

Please not that the following 2 issues seem _not_ related since they happen on netmon1001 just as well, but the rest works there.

4 fatal: pathspec '.placeholder' did not match any files
5 error: pathspec '.placeholder' did not match any file(s) known to git.
6 Deleted .placeholder
18 Untracked files:
19     router.db

Next thing i was going to just wipe out the git repo. Why does it never end on netmon1002?

Dzahn added a comment.EditedJun 26 2017, 6:49 PM

strace -f with one of those ends in:

[pid 24567] close(1)                    = 0
[pid 24567] fstat(1, 0x7fff45095130)    = -1 EBADF (Bad file descriptor)
[pid 24567] exit_group(0)               = ?
[pid 24567] +++ exited with 0 +++
[pid 24562] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24567
[pid 24562] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24567, si_uid=998, si_status=0, si_utime=15, si_stime=1} ---
[pid 24562] read(3,

logfile at this point is past "nothing added to commit but untracked files present" but process does not exit. (even though it says "exited with 0" above ?)

Dzahn added a comment.Jun 26 2017, 7:47 PM

Issue has been found. Permissions in git bare repo and working dir. some files not owned by rancid:rancid as they should. Works now. thanks Apergos for seeing it. It looks all good now.

Mentioned in SAL (#wikimedia-operations) [2017-06-26T22:27:48Z] <mutante> netmon1001 - deactivate rancid crons - now running on netmon1002 instead - avoid duplicate mails (T159756)

Change 361191 merged by Dzahn:
[operations/puppet@production] netmon1002: add smokeping role

https://gerrit.wikimedia.org/r/361191

Change 361606 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: allow rsync of data from netmon1001 to netmon1002

https://gerrit.wikimedia.org/r/361606

Change 361608 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc/smokeping: switch smokeping backend to netmon1002

https://gerrit.wikimedia.org/r/361608

Change 361606 merged by Dzahn:
[operations/puppet@production] smokeping: allow rsync of data from netmon1001 to netmon1002

https://gerrit.wikimedia.org/r/361606

Change 361608 merged by Dzahn:
[operations/puppet@production] cache::misc/smokeping: switch smokeping backend to netmon1002

https://gerrit.wikimedia.org/r/361608

Change 361614 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc: add backend for netmon1002

https://gerrit.wikimedia.org/r/361614

Change 361614 merged by Dzahn:
[operations/puppet@production] cache::misc: add backend for netmon1002

https://gerrit.wikimedia.org/r/361614

Mentioned in SAL (#wikimedia-operations) [2017-06-27T03:35:06Z] <mutante> smokeping - stop/rsync/fix permissions/start one more time to minimize gaps in graphs - now fully migrated netmon1001->netmon1002, historic data has been copied (T159756)

Change 361615 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: remove smokeping role from netmon1001

https://gerrit.wikimedia.org/r/361615

Change 361615 merged by Dzahn:
[operations/puppet@production] site: remove smokeping role from netmon1001

https://gerrit.wikimedia.org/r/361615

Dzahn added a comment.Jun 28 2017, 8:32 PM

I tried to move on with the servermon role next, tested the role on stretch on a labs instance. Problem:

Package python-django-south is not available, but is referred to by another package.

That package is described as "Intelligent schema migrations for django apps" (https://packages.debian.org/jessie/python-django-south)

I asked #debian and #django a bit about it and i was told first that the package was probably dropped because those migrations features have been moved to core, which sounded like good news, like we might get away without this package and just python-django is sufficient, but then there was also bad news, because:

"< hylje> going from south to django migrations involves modifying the project's code"

So yea.. eh @akosiaris any idea how much work that kind of change would be?

Dzahn updated the task description. (Show Details)Jun 28 2017, 8:36 PM
Dzahn updated the task description. (Show Details)

Change 362014 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: add support for stretch, adjust (PHP) packages

https://gerrit.wikimedia.org/r/362014

Change 362014 merged by Dzahn:
[operations/puppet@production] librenms: add support for stretch, adjust (PHP) packages

https://gerrit.wikimedia.org/r/362014

Change 362119 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] apache: add class for mod_php with PHP 7.0 for stretch

https://gerrit.wikimedia.org/r/362119

Change 362123 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: use libapache2-mod-php7.0 if on stretch

https://gerrit.wikimedia.org/r/362123

So yea.. eh @akosiaris any idea how much work that kind of change would be?

An alternative to get unblocked now would be to upload python-django-south internally to stretch-wikimedia, assuming it doesn't drag a lot of other dependencies and it works out of the box on stretch.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T20:46:30Z] <mutante> APT - reprepro copy stretch-wikimedia jessie-wikimedia prometheus-snmp-exporter (to make it available on stretch for netmon1002) (T159756)

Mentioned in SAL (#wikimedia-operations) [2017-06-29T21:51:13Z] <mutante> APT - uploading python-django-south from jessie to wikimedia-stretch for librenms on stretch (T159756)

Change 362119 merged by Dzahn:
[operations/puppet@production] apache: add class for mod_php with PHP 7.0 for stretch

https://gerrit.wikimedia.org/r/362119

Change 362123 merged by Dzahn:
[operations/puppet@production] librenms: use libapache2-mod-php7.0 if on stretch

https://gerrit.wikimedia.org/r/362123

Dzahn added a comment.Jun 30 2017, 7:15 PM

Alright, so:

next up:

  • librenms: missing unit file for systemd, paladox already found the official example unit file and is working on a change
  • servermon: test on labs instance now that php7 module is loaded
Dzahn updated the task description. (Show Details)Jun 30 2017, 8:24 PM
Dzahn updated the task description. (Show Details)

Change 362528 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: move php5-ldap package to others, fix for stretch

https://gerrit.wikimedia.org/r/362528

Change 362528 merged by Dzahn:
[operations/puppet@production] librenms: move php5-ldap package to others, fix for stretch

https://gerrit.wikimedia.org/r/362528

Change 362590 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: ensure install_dir exists, add it as required resource

https://gerrit.wikimedia.org/r/362590

Change 362591 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: add missing Apache headers module

https://gerrit.wikimedia.org/r/362591

Change 362590 merged by Dzahn:
[operations/puppet@production] librenms: ensure install_dir exists

https://gerrit.wikimedia.org/r/362590

Change 362591 merged by Dzahn:
[operations/puppet@production] librenms: add missing Apache headers module

https://gerrit.wikimedia.org/r/362591

Change 362598 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] servermon: add missing package python-mysqldb

https://gerrit.wikimedia.org/r/362598

Change 362598 merged by Dzahn:
[operations/puppet@production] servermon: add missing package python-mysqldb

https://gerrit.wikimedia.org/r/362598

We have the following blocker for servermon:

Using this systemd unit file (https://gerrit.wikimedia.org/r/#/c/362455/) works but:

1[2017-06-30 23:07:34 +0000] [9718] [INFO] Worker exiting (pid: 9718)
2[2017-06-30 23:07:34 +0000] [9719] [ERROR] Exception in worker process
3Traceback (most recent call last):
4 File "/usr/lib/python2.7/dist-packages/gunicorn/arbiter.py", line 557, in spawn_worker
5 worker.init_process()
6 File "/usr/lib/python2.7/dist-packages/gunicorn/workers/base.py", line 126, in init_process
7 self.load_wsgi()
8 File "/usr/lib/python2.7/dist-packages/gunicorn/workers/base.py", line 136, in load_wsgi
9 self.wsgi = self.app.wsgi()
10 File "/usr/lib/python2.7/dist-packages/gunicorn/app/base.py", line 67, in wsgi
11 self.callable = self.load()
12 File "/usr/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 65, in load
13 return self.load_wsgiapp()
14 File "/usr/lib/python2.7/dist-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
15 return util.import_app(self.app_uri)
16 File "/usr/lib/python2.7/dist-packages/gunicorn/util.py", line 384, in import_app
17 __import__(module)
18 File "/srv/deployment/servermon/servermon-cache/revs/4a2288f7f2723ec50aa143b04661abedc3335d17/servermon/wsgi.py", line 26, in <module>
19 application = get_wsgi_application()
20 File "/usr/lib/python2.7/dist-packages/django/core/wsgi.py", line 13, in get_wsgi_application
21 django.setup(set_prefix=False)
22 File "/usr/lib/python2.7/dist-packages/django/__init__.py", line 22, in setup
23 configure_logging(settings.LOGGING_CONFIG, settings.LOGGING)
24 File "/usr/lib/python2.7/dist-packages/django/conf/__init__.py", line 53, in __getattr__
25 self._setup(name)
26 File "/usr/lib/python2.7/dist-packages/django/conf/__init__.py", line 41, in _setup
27 self._wrapped = Settings(settings_module)
28 File "/usr/lib/python2.7/dist-packages/django/conf/__init__.py", line 97, in __init__
29 mod = importlib.import_module(self.SETTINGS_MODULE)
30 File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
31 __import__(name)
32 File "/srv/deployment/servermon/servermon-cache/revs/4a2288f7f2723ec50aa143b04661abedc3335d17/servermon/settings.py", line 88, in <module>
33 TEMPLATE_CONTEXT_PROCESSORS = global_settings.TEMPLATE_CONTEXT_PROCESSORS + (
34AttributeError: 'module' object has no attribute 'TEMPLATE_CONTEXT_PROCESSORS'

This error doesn't happen with python-django 1.7 but it does happen with python-django 1.10.

https://stackoverflow.com/questions/41288653/django-attributeerror-module-object-has-no-attribute-template-context-proc

So it seems the answer is:

<paladox> It looks like you have upgraded Django to 1.10, which has removed global_settings.TEMPLATE_CONTEXT_PROCESSORS. Either downgrade Django, or update your settings. This question should help. – Alasdair Dec 22 '16 at 17:38 "

Dzahn added a comment.Jul 1 2017, 12:47 AM

blocker for librenms: https://github.com/librenms/librenms/issues/6818 "php-net-ipv4 not available any more on current debian and ubuntu"

Mentioned in SAL (#wikimedia-operations) [2017-07-01T00:54:04Z] <mutante> APT - importing php-net-ipv4 to stretch (for librenms) T159756

faidon moved this task from Backlog to In progress on the observability board.
Dzahn added a comment.Jul 11 2017, 8:59 PM

as @RobH pointed out this used the wrong partman recipe and needs to be reinstalled to use both SSDs?!

Change 364617 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch librenms from netmon1002 to netmon1002

https://gerrit.wikimedia.org/r/364617

Change 364620 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rancid: add rsync::quickdatacopy to sync /var/lib/rancid

https://gerrit.wikimedia.org/r/364620

Change 364620 merged by Dzahn:
[operations/puppet@production] rancid: add rsync::quickdatacopy to sync /var/lib/rancid

https://gerrit.wikimedia.org/r/364620

Dzahn added a comment.Jul 15 2017, 3:05 AM

servermon is now on netmon1003 (T170653) (ganeti jessie) VM instead, so it will not move to netmon1002 (stretch) yet. therefore checking that box here.

Dzahn updated the task description. (Show Details)Jul 15 2017, 3:06 AM
Dzahn added a comment.Jul 18 2017, 6:44 PM

11:42 < mutante> !log netmon1002 - reinstall OS - didn't use the right partman recipe - didn't have md0 - revoke old puppet cert , salt-key, scheduled downtime, services over at netmon2001

reinstalling so that we have RAID

Mentioned in SAL (#wikimedia-operations) [2017-07-18T23:53:33Z] <mutante> netmon1002 - copied Letsencrypt cert/key for librenms from netmon1001 for migration after netmon1002 has been reinstalled and now has RAID. (T159756)

Change 364617 merged by Dzahn:
[operations/dns@master] switch librenms from netmon1001 to netmon1002

https://gerrit.wikimedia.org/r/364617

Dzahn updated the task description. (Show Details)Jul 19 2017, 12:15 AM

Change 366178 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netmon: remove librenms from netmon1001

https://gerrit.wikimedia.org/r/366178

Change 366178 merged by Dzahn:
[operations/puppet@production] netmon: remove librenms from netmon1001

https://gerrit.wikimedia.org/r/366178

Change 366181 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap/dsh: replace netmon1001->netmont1002 for librenms

https://gerrit.wikimedia.org/r/366181

Dzahn added a comment.Jul 19 2017, 1:18 AM

https://gerrit.wikimedia.org/r/366180 - switch prometheus snmp-exporter , netmon1001 -> netmon1002 @fgiunchedi

Change 366181 merged by Dzahn:
[operations/puppet@production] scap/dsh: replace netmon1001->netmont1002 for librenms

https://gerrit.wikimedia.org/r/366181

Dzahn closed this task as Resolved.Jul 19 2017, 1:22 AM

Change 366185 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rancid: change the rsync direction

https://gerrit.wikimedia.org/r/366185

Change 366185 merged by Dzahn:
[operations/puppet@production] rancid: change the rsync direction

https://gerrit.wikimedia.org/r/366185

Change 366310 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: active_server param, don't pull data from multi servers

https://gerrit.wikimedia.org/r/366310

Change 366310 merged by Dzahn:
[operations/puppet@production] librenms: active_server param, don't pull data from multi servers

https://gerrit.wikimedia.org/r/366310

Dzahn reopened this task as Open.Jul 19 2017, 8:07 PM

Change 366324 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] librenms: rsync rrd data from netmon1001 to netmon1002

https://gerrit.wikimedia.org/r/366324

Change 366324 merged by Dzahn:
[operations/puppet@production] librenms: rsync rrd data from netmon1001 to netmon1002

https://gerrit.wikimedia.org/r/366324

Mentioned in SAL (#wikimedia-operations) [2017-07-20T02:22:24Z] <mutante> netmon1001 - rsyncing librenms rrd data to netmon1002 - T159756

Dzahn added a comment.Jul 20 2017, 4:56 AM

21:42 < mutante> !log netmon1002 - restarted Apache for LDAP issue - librenms.wm.org switched back to it, after rsyncing rrd data, re-enabling puppet
21:45 < mutante> !let netmon1002 - disable puppet again - crons for librenms running, crons for rancid stopped, rsynced data one last time

Dzahn closed this task as Resolved.Jul 21 2017, 10:12 PM

There was an issue with rancid logging in on switches/routers.

ssh-agent refused operation. thanks to thcipriani pointing out sometimes you have to kill the proxy part of it, i killed the ssh-agent-proxy process and ran puppet to start it again and that fixed it.

updates work again, puppet is enabled again

Change 381267 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netmon1002: re-enable Letsencrypt cert creation

https://gerrit.wikimedia.org/r/381267

Change 381267 merged by Dzahn:
[operations/puppet@production] netmon1002: re-enable Letsencrypt cert creation

https://gerrit.wikimedia.org/r/381267