Page MenuHomePhabricator

upgrade netmon1001 to jessie
Closed, ResolvedPublic

Description

netmon1001 is on precise, but should be jessie

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added a project: SRE.
Dzahn set Security to None.
Dzahn added subscribers: Krenair, Ricordisamoa, hashar and 4 others.
Dzahn added a subscriber: akosiaris.
Andrew triaged this task as Medium priority.
Dzahn removed Dzahn as the assignee of this task.Apr 18 2016, 6:57 PM
faidon added a subscriber: faidon.

@akosiaris said he would do this at our last monitoring meeting :)

Change 344112 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] librenms: use base::service_unit

https://gerrit.wikimedia.org/r/344112

Change 344113 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet] apache 2.4 compat for librenms/servermon/smokeping

https://gerrit.wikimedia.org/r/344113

Change 344112 merged by Filippo Giunchedi:
[operations/puppet] librenms: use base::service_unit

https://gerrit.wikimedia.org/r/344112

Change 344113 merged by Filippo Giunchedi:
[operations/puppet] apache 2.4 compat for librenms/servermon/smokeping

https://gerrit.wikimedia.org/r/344113

Change 344168 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] install_server: switch netmon to jessie

https://gerrit.wikimedia.org/r/344168

Change 344168 merged by Dzahn:
[operations/puppet@production] install_server: switch netmon to jessie

https://gerrit.wikimedia.org/r/344168

Change 344184 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] temp copy netmon1001 app data to gerrit2001 for migration

https://gerrit.wikimedia.org/r/344184

Change 344184 merged by Dzahn:
[operations/puppet@production] temp copy netmon1001 app data to gerrit2001 for migration

https://gerrit.wikimedia.org/r/344184

Mentioned in SAL (#wikimedia-operations) [2017-03-22T19:44:07Z] <mutante> rsyncing /srv of netmon1001 to /srv/netmon1001 on gerrit2001 (T125020)

also tested on a seperate jessie labs instance:

role rancid::server - no issues, just finishes run

role librenms - package snmp-mibs-downloader can't be downloaded because it's in non-free which isn't enabled on labs, but it is in prod, errors because tin can't be reached as deployment server but that is expected as of now

role torrus - will be replaced by prometheus

role servermon::wmf - only error because tin / deployment in labs as above with librenms, rest appears good though, incl. Apache config, apache restart fine

role smokeping - just works, no issues

role network::monitor - Could not find data item prometheus_nodes in any Hiera data file

@fgiunchedi i rsynced /srv to gerrit2001. to update next week before shutdown, push: root@netmon1001:/srv# rsync -avp /srv/ rsync://gerrit2001.wikimedia.org/netmon1001

Change 344292 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001 migration: separate rsync modules for each app

https://gerrit.wikimedia.org/r/344292

Change 344292 merged by Dzahn:
[operations/puppet@production] netmon1001 migration: separate rsync modules for each app

https://gerrit.wikimedia.org/r/344292

Change 344298 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001 migration: additional paths to backup

https://gerrit.wikimedia.org/r/344298

Change 344299 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] netmon1001: add host to Bacula backups

https://gerrit.wikimedia.org/r/344299

Change 344298 merged by Dzahn:
[operations/puppet@production] netmon1001 migration: additional paths to backup

https://gerrit.wikimedia.org/r/344298

Change 344299 merged by Dzahn:
[operations/puppet@production] netmon1001: add host to Bacula backups

https://gerrit.wikimedia.org/r/344299

Change 344300 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] bacula: add file sets for librenms,torrus,smokeping

https://gerrit.wikimedia.org/r/344300

Change 344300 merged by Dzahn:
[operations/puppet@production] bacula: add file sets for librenms,torrus,smokeping

https://gerrit.wikimedia.org/r/344300

Needs one more patch to add backup sets to role but my laptop is about to shut down and i could not upload for some reason...

Change 344560 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] librenms/smokeping/torrus: add Bacula backup sets

https://gerrit.wikimedia.org/r/344560

Change 344561 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] bacula: add /srv/librenms to librenms backup file set

https://gerrit.wikimedia.org/r/344561

Change 344560 merged by Dzahn:
[operations/puppet@production] librenms/smokeping/torrus: add Bacula backup sets

https://gerrit.wikimedia.org/r/344560

Change 344561 merged by Dzahn:
[operations/puppet@production] bacula: add /srv/librenms to librenms backup file set

https://gerrit.wikimedia.org/r/344561

checked that the backups now appear on helium (bacula server):

how to restore:

[helium:~] $ sudo bconsole
Connecting to Director helium.eqiad.wmnet:9101
1000 OK: 102 helium.eqiad.wmnet Version: 7.4.3 (18 June 2016)
Enter a period to cancel a command.

type "restore"

*restore

Automatically selected Catalog: production
Using Catalog "production"

First you select one or more JobIds that contain files
to be restored. You will be presented several methods
of specifying the JobIds. Then you will be allowed to
select which files from those JobIds are to be restored.

To select the JobIds, you have the following choices:
     1: List last 20 Jobs run
     <many other choices>
     ...

select:      5: Select the most recent backup for a client

Defined Clients:
<a lot of clients are listed>

select:     60: netmon1001.wikimedia.org-fd

Select the Client (1-95): 60
The defined FileSet resources are:
     1: librenms
     2: smokeping
     3: torrus

select your tool

bacula starts to build a directory tree:

Select FileSet resource (1-3): 1
+--------+-------+----------+----------------+---------------------+----------------+
| JobId  | Level | JobFiles | JobBytes       | StartTime           | VolumeName     |
+--------+-------+----------+----------------+---------------------+----------------+
| 50,322 | F     |   17,687 | 20,571,901,482 | 2017-03-25 04:31:53 | production0047 |
...
Building directory tree for JobId(s) 50322,50390,50458 ...  +++++++++++++++++++++++++++++++++++++++++++++++++
17,584 files inserted into the tree.

You are now entering file selection mode where you add (mark) and
remove (unmark) files to be restored. No files are initially added, unless
you used the "all" keyword on the command line.
Enter "done" to leave this mode.

cwd is: /
$ 

from here on you can use "pwd", "cd", "ls" as in a shell.

pwd
cwd is: /srv/librenms/rrd/
$ 

now you will use the "mark" command to mark files/directories to be restored.

  lsmark     list the marked files in and below the cd
  mark       mark dir/file to be restored recursively, wildcards allowed
  markdir    mark directory name to be restored (no files)

There is also a "help" at this step with more commands.

Once done marking files, type "done".

Something like this will appear:

1 file selected to be restored.

Using Catalog "production"
Run Restore job
JobName:         RestoreFiles
Bootstrap:       /var/lib/bacula/helium.eqiad.wmnet.restore.1.bsr
Where:           /var/tmp/bacula-restores
Replace:         Always
FileSet:         root
Backup Client:   netmon1001.wikimedia.org-fd
Restore Client:  netmon1001.wikimedia.org-fd
Storage:         helium-FileStorage1
When:            2017-03-27 17:51:43
Catalog:         production
Priority:        1
Plugin Options:  
OK to run? (yes/mod/no): 

At this step type "mod" to change if you want to restore to a different server instead of the selected one, or to a different path than /var/tmp/bacula-restores. Or yes, if ok.

It should display the job is queued.

Job queued. JobId=50484

Now check back a while later in the expected restore location. Files should appear.

"exit" to quit bconsole.

I confirmed restored of one random file on netmon1001 itself.

root@netmon1001:/var/tmp/bacula-restores/var/lib/smokeping/DNS# file ns0.rrd 
ns0.rrd: RRDTool DB version 0003

^ this just got created by bacula.

We were talking about maybe making a wrapper script for the above that makes this less involved for the most common cases.

10:42 <godog> and also reminded me I wanted a less involved restore process, looks like bconsole can run somewhat non-interactively
10:44 <mutante> ok, let's look at that, maybe can make a wrapper around it that is just the most common one. "restore latest to default path"
10:47 <godog> yeah that'd be really nice, most of the time we are really interested in just restoring the latest version anyway

10:47 <godog> I think the common use cases would be:
10:48 <godog> - restore this file set back into machine x
10:48 <godog> - restore this file set from machine x into machine y

Mentioned in SAL (#wikimedia-operations) [2017-03-29T01:16:21Z] <mutante> rsyncing librenms/torrus/smokeping app data from netmon1001 to gerrit2001. adding alias "syncit" to do it all at once (T125020)

@fgiunchedi To rsync app data tomorrow you can simply type syncit on netmon1001 now.

The long form is:

rsync -avp /var/lib/librenms/ rsync://gerrit2001.wikimedia.org/librenms-lib
rsync -avp /var/lib/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-lib
rsync -avp /var/cache/smokeping/ rsync://gerrit2001.wikimedia.org/smokeping-cache
rsync -avp /var/lib/torrus/ rsync://gerrit2001.wikimedia.org/torrus-lib
rsync -avp /var/cache/torrus/ rsync://gerrit2001.wikimedia.org/torrus-cache

syncitback to copy back after upgrade will be:

rsync -avp rsync://gerrit2001.wikimedia.org/librenms-lib /var/lib/librenms/
rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-lib /var/lib/smokeping/
rsync -avp rsync://gerrit2001.wikimedia.org/smokeping-cache /var/cache/smokeping/
rsync -avp rsync://gerrit2001.wikimedia.org/torrus-lib /var/lib/torrus/
rsync -avp rsync://gerrit2001.wikimedia.org/torrus-cache /var/cache/torrus/

Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts:

['netmon1001.wikimedia.org']

The log can be found in /var/log/wmf-auto-reimage/201703291025_filippo_6564.log.

Completed auto-reimage of hosts:

['netmon1001.wikimedia.org']

and were ALL successful.

Change 345337 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] netmon: post jessie reimage fixes

https://gerrit.wikimedia.org/r/345337

Change 345337 merged by Filippo Giunchedi:
[operations/puppet@production] netmon: post jessie reimage fixes

https://gerrit.wikimedia.org/r/345337

Change 345368 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] rancid: back up /var/lib/rancid

https://gerrit.wikimedia.org/r/345368

Change 345368 merged by Dzahn:
[operations/puppet@production] rancid: back up /var/lib/rancid

https://gerrit.wikimedia.org/r/345368

This is completed!

Mentioned in SAL (#wikimedia-operations) [2017-04-07T23:16:40Z] <mutante> gerrit2001 - deleting netmon1001 backup (/srv/netmon1001), stop rsyncd, remove rsyncd config (T125020)