netmon1001 is on precise, but should be jessie
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
Resolved | fgiunchedi | T125020 upgrade netmon1001 to jessie | |||
Resolved | RobH | T156040 hardware request for netmon1001 replacement | |||
Unknown Object (Task) | |||||
Resolved | Dzahn | T159756 setup netmon1002.wikimedia.org | |||
Resolved | ayounsi | T159757 netmon1002 networking setup |
Event Timeline
Change 344112 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] librenms: use base::service_unit
Change 344113 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] apache 2.4 compat for librenms/servermon/smokeping
Change 344112 merged by Filippo Giunchedi:
[operations/puppet] librenms: use base::service_unit
Change 344113 merged by Filippo Giunchedi:
[operations/puppet] apache 2.4 compat for librenms/servermon/smokeping
Change 344168 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] install_server: switch netmon to jessie
Change 344168 merged by Dzahn:
[operations/puppet@production] install_server: switch netmon to jessie
Change 344184 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] temp copy netmon1001 app data to gerrit2001 for migration
Change 344184 merged by Dzahn:
[operations/puppet@production] temp copy netmon1001 app data to gerrit2001 for migration
Mentioned in SAL (#wikimedia-operations) [2017-03-22T19:44:07Z] <mutante> rsyncing /srv of netmon1001 to /srv/netmon1001 on gerrit2001 (T125020)
also tested on a seperate jessie labs instance:
role rancid::server - no issues, just finishes run
role librenms - package snmp-mibs-downloader can't be downloaded because it's in non-free which isn't enabled on labs, but it is in prod, errors because tin can't be reached as deployment server but that is expected as of now
role torrus - will be replaced by prometheus
role servermon::wmf - only error because tin / deployment in labs as above with librenms, rest appears good though, incl. Apache config, apache restart fine
role smokeping - just works, no issues
role network::monitor - Could not find data item prometheus_nodes in any Hiera data file
@fgiunchedi i rsynced /srv to gerrit2001. to update next week before shutdown, push: root@netmon1001:/srv# rsync -avp /srv/ rsync://gerrit2001.wikimedia.org/netmon1001
Change 344292 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001 migration: separate rsync modules for each app
Change 344292 merged by Dzahn:
[operations/puppet@production] netmon1001 migration: separate rsync modules for each app
Change 344298 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001 migration: additional paths to backup
Change 344299 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001: add host to Bacula backups
Change 344298 merged by Dzahn:
[operations/puppet@production] netmon1001 migration: additional paths to backup
Change 344299 merged by Dzahn:
[operations/puppet@production] netmon1001: add host to Bacula backups
Change 344300 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] bacula: add file sets for librenms,torrus,smokeping
Change 344300 merged by Dzahn:
[operations/puppet@production] bacula: add file sets for librenms,torrus,smokeping
Needs one more patch to add backup sets to role but my laptop is about to shut down and i could not upload for some reason...
Change 344560 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] librenms/smokeping/torrus: add Bacula backup sets
Change 344561 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] bacula: add /srv/librenms to librenms backup file set
Change 344560 merged by Dzahn:
[operations/puppet@production] librenms/smokeping/torrus: add Bacula backup sets
Change 344561 merged by Dzahn:
[operations/puppet@production] bacula: add /srv/librenms to librenms backup file set
checked that the backups now appear on helium (bacula server):
how to restore:
[helium:~] $ sudo bconsole Connecting to Director helium.eqiad.wmnet:9101 1000 OK: 102 helium.eqiad.wmnet Version: 7.4.3 (18 June 2016) Enter a period to cancel a command. type "restore" *restore Automatically selected Catalog: production Using Catalog "production" First you select one or more JobIds that contain files to be restored. You will be presented several methods of specifying the JobIds. Then you will be allowed to select which files from those JobIds are to be restored. To select the JobIds, you have the following choices: 1: List last 20 Jobs run <many other choices> ... select: 5: Select the most recent backup for a client Defined Clients: <a lot of clients are listed> select: 60: netmon1001.wikimedia.org-fd Select the Client (1-95): 60 The defined FileSet resources are: 1: librenms 2: smokeping 3: torrus select your tool bacula starts to build a directory tree: Select FileSet resource (1-3): 1 +--------+-------+----------+----------------+---------------------+----------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +--------+-------+----------+----------------+---------------------+----------------+ | 50,322 | F | 17,687 | 20,571,901,482 | 2017-03-25 04:31:53 | production0047 | ... Building directory tree for JobId(s) 50322,50390,50458 ... +++++++++++++++++++++++++++++++++++++++++++++++++ 17,584 files inserted into the tree. You are now entering file selection mode where you add (mark) and remove (unmark) files to be restored. No files are initially added, unless you used the "all" keyword on the command line. Enter "done" to leave this mode. cwd is: / $ from here on you can use "pwd", "cd", "ls" as in a shell. pwd cwd is: /srv/librenms/rrd/ $ now you will use the "mark" command to mark files/directories to be restored. lsmark list the marked files in and below the cd mark mark dir/file to be restored recursively, wildcards allowed markdir mark directory name to be restored (no files) There is also a "help" at this step with more commands. Once done marking files, type "done". Something like this will appear: 1 file selected to be restored. Using Catalog "production" Run Restore job JobName: RestoreFiles Bootstrap: /var/lib/bacula/helium.eqiad.wmnet.restore.1.bsr Where: /var/tmp/bacula-restores Replace: Always FileSet: root Backup Client: netmon1001.wikimedia.org-fd Restore Client: netmon1001.wikimedia.org-fd Storage: helium-FileStorage1 When: 2017-03-27 17:51:43 Catalog: production Priority: 1 Plugin Options: OK to run? (yes/mod/no): At this step type "mod" to change if you want to restore to a different server instead of the selected one, or to a different path than /var/tmp/bacula-restores. Or yes, if ok. It should display the job is queued. Job queued. JobId=50484 Now check back a while later in the expected restore location. Files should appear. "exit" to quit bconsole. I confirmed restored of one random file on netmon1001 itself. root@netmon1001:/var/tmp/bacula-restores/var/lib/smokeping/DNS# file ns0.rrd ns0.rrd: RRDTool DB version 0003 ^ this just got created by bacula.
We were talking about maybe making a wrapper script for the above that makes this less involved for the most common cases.
10:42 <godog> and also reminded me I wanted a less involved restore process, looks like bconsole can run somewhat non-interactively
10:44 <mutante> ok, let's look at that, maybe can make a wrapper around it that is just the most common one. "restore latest to default path"
10:47 <godog> yeah that'd be really nice, most of the time we are really interested in just restoring the latest version anyway
10:47 <godog> I think the common use cases would be:
10:48 <godog> - restore this file set back into machine x
10:48 <godog> - restore this file set from machine x into machine y
Mentioned in SAL (#wikimedia-operations) [2017-03-29T01:16:21Z] <mutante> rsyncing librenms/torrus/smokeping app data from netmon1001 to gerrit2001. adding alias "syncit" to do it all at once (T125020)
@fgiunchedi To rsync app data tomorrow you can simply type syncit on netmon1001 now.
The long form is:
rsync -avp /var/lib/librenms/ rsync://gerrit2001.wikimedia.org/librenms-lib rsync -avp /var/lib/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-lib rsync -avp /var/cache/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-cache rsync -avp /var/lib/torrus/ rsync://gerrit2001.wikimedia.org/torrus-lib rsync -avp /var/cache/torrus/ rsync://gerrit2001.wikimedia.org/torrus-cache
syncitback to copy back after upgrade will be:
rsync -avp rsync://gerrit2001.wikimedia.org/librenms-lib /var/lib/librenms/ rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-lib /var/lib/smokeping/ rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-cache /var/cache/smokeping/ rsync -avp rsync://gerrit2001.wikimedia.org/torrus-lib /var/lib/torrus/ rsync -avp rsync://gerrit2001.wikimedia.org/torrus-cache /var/cache/torrus/
Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts:
['netmon1001.wikimedia.org']
The log can be found in /var/log/wmf-auto-reimage/201703291025_filippo_6564.log.
Completed auto-reimage of hosts:
['netmon1001.wikimedia.org']
and were ALL successful.
Change 345337 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] netmon: post jessie reimage fixes
Change 345337 merged by Filippo Giunchedi:
[operations/puppet@production] netmon: post jessie reimage fixes
Change 345368 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] rancid: back up /var/lib/rancid
Change 345368 merged by Dzahn:
[operations/puppet@production] rancid: back up /var/lib/rancid
Mentioned in SAL (#wikimedia-operations) [2017-04-07T23:16:40Z] <mutante> gerrit2001 - deleting netmon1001 backup (/srv/netmon1001), stop rsyncd, remove rsyncd config (T125020)