netmon1001 is on precise, but should be jessie
|Resolved||Dzahn||T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production|
|Resolved||fgiunchedi||T125020 upgrade netmon1001 to jessie|
|Resolved||RobH||T156040 hardware request for netmon1001 replacement|
|Unknown Object (Task)|
|Resolved||Dzahn||T159756 setup netmon1002.wikimedia.org|
|Resolved||ayounsi||T159757 netmon1002 networking setup|
also tested on a seperate jessie labs instance:
role rancid::server - no issues, just finishes run
role librenms - package snmp-mibs-downloader can't be downloaded because it's in non-free which isn't enabled on labs, but it is in prod, errors because tin can't be reached as deployment server but that is expected as of now
role torrus - will be replaced by prometheus
role servermon::wmf - only error because tin / deployment in labs as above with librenms, rest appears good though, incl. Apache config, apache restart fine
role smokeping - just works, no issues
role network::monitor - Could not find data item prometheus_nodes in any Hiera data file
checked that the backups now appear on helium (bacula server):
how to restore:
[helium:~] $ sudo bconsole Connecting to Director helium.eqiad.wmnet:9101 1000 OK: 102 helium.eqiad.wmnet Version: 7.4.3 (18 June 2016) Enter a period to cancel a command. type "restore" *restore Automatically selected Catalog: production Using Catalog "production" First you select one or more JobIds that contain files to be restored. You will be presented several methods of specifying the JobIds. Then you will be allowed to select which files from those JobIds are to be restored. To select the JobIds, you have the following choices: 1: List last 20 Jobs run <many other choices> ... select: 5: Select the most recent backup for a client Defined Clients: <a lot of clients are listed> select: 60: netmon1001.wikimedia.org-fd Select the Client (1-95): 60 The defined FileSet resources are: 1: librenms 2: smokeping 3: torrus select your tool bacula starts to build a directory tree: Select FileSet resource (1-3): 1 +--------+-------+----------+----------------+---------------------+----------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +--------+-------+----------+----------------+---------------------+----------------+ | 50,322 | F | 17,687 | 20,571,901,482 | 2017-03-25 04:31:53 | production0047 | ... Building directory tree for JobId(s) 50322,50390,50458 ... +++++++++++++++++++++++++++++++++++++++++++++++++ 17,584 files inserted into the tree. You are now entering file selection mode where you add (mark) and remove (unmark) files to be restored. No files are initially added, unless you used the "all" keyword on the command line. Enter "done" to leave this mode. cwd is: / $ from here on you can use "pwd", "cd", "ls" as in a shell. pwd cwd is: /srv/librenms/rrd/ $ now you will use the "mark" command to mark files/directories to be restored. lsmark list the marked files in and below the cd mark mark dir/file to be restored recursively, wildcards allowed markdir mark directory name to be restored (no files) There is also a "help" at this step with more commands. Once done marking files, type "done". Something like this will appear: 1 file selected to be restored. Using Catalog "production" Run Restore job JobName: RestoreFiles Bootstrap: /var/lib/bacula/helium.eqiad.wmnet.restore.1.bsr Where: /var/tmp/bacula-restores Replace: Always FileSet: root Backup Client: netmon1001.wikimedia.org-fd Restore Client: netmon1001.wikimedia.org-fd Storage: helium-FileStorage1 When: 2017-03-27 17:51:43 Catalog: production Priority: 1 Plugin Options: OK to run? (yes/mod/no): At this step type "mod" to change if you want to restore to a different server instead of the selected one, or to a different path than /var/tmp/bacula-restores. Or yes, if ok. It should display the job is queued. Job queued. JobId=50484 Now check back a while later in the expected restore location. Files should appear. "exit" to quit bconsole. I confirmed restored of one random file on netmon1001 itself. root@netmon1001:/var/tmp/bacula-restores/var/lib/smokeping/DNS# file ns0.rrd ns0.rrd: RRDTool DB version 0003 ^ this just got created by bacula.
We were talking about maybe making a wrapper script for the above that makes this less involved for the most common cases.
10:42 <godog> and also reminded me I wanted a less involved restore process, looks like bconsole can run somewhat non-interactively
10:44 <mutante> ok, let's look at that, maybe can make a wrapper around it that is just the most common one. "restore latest to default path"
10:47 <godog> yeah that'd be really nice, most of the time we are really interested in just restoring the latest version anyway
10:47 <godog> I think the common use cases would be:
10:48 <godog> - restore this file set back into machine x
10:48 <godog> - restore this file set from machine x into machine y
@fgiunchedi To rsync app data tomorrow you can simply type syncit on netmon1001 now.
The long form is:
rsync -avp /var/lib/librenms/ rsync://gerrit2001.wikimedia.org/librenms-lib rsync -avp /var/lib/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-lib rsync -avp /var/cache/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-cache rsync -avp /var/lib/torrus/ rsync://gerrit2001.wikimedia.org/torrus-lib rsync -avp /var/cache/torrus/ rsync://gerrit2001.wikimedia.org/torrus-cache
syncitback to copy back after upgrade will be:
rsync -avp rsync://gerrit2001.wikimedia.org/librenms-lib /var/lib/librenms/ rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-lib /var/lib/smokeping/ rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-cache /var/cache/smokeping/ rsync -avp rsync://gerrit2001.wikimedia.org/torrus-lib /var/lib/torrus/ rsync -avp rsync://gerrit2001.wikimedia.org/torrus-cache /var/cache/torrus/
Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts:
The log can be found in /var/log/wmf-auto-reimage/201703291025_filippo_6564.log.