Page MenuHomePhabricator

labmon1002 as cold standby for labmon1001
Closed, ResolvedPublic

Description

We are looking for: if labmon1001 dies tomorrow we hve reasonable confidence we can get labmon1002 going in its place and it would be nice to not have lost more than a day of data

Event Timeline

There are several possible approaches for this:

  • sync data between the two hosts by means of a manual process (rsync, cron, etc)
  • having some shared filesystem between the 2 servers for the datasets
  • collect data in both at the same time (by means of a virtual IP address or something?)

I will investigate how other people are doing these stuff out there.

Meanwhile, we could simply apply role(labs::monitoring), but we are not sure if this could cause any issues.
There are several hiera keys that points to labmon1001 for different purposes.

Change 420019 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] site.pp: put labmon1002 into work

https://gerrit.wikimedia.org/r/420019

Change 420019 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] site.pp: put labmon1002 into work

https://gerrit.wikimedia.org/r/420019

Change 421006 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports

https://gerrit.wikimedia.org/r/421006

Change 421006 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports

https://gerrit.wikimedia.org/r/421006

Change 421014 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: remove apt pinning declaration

https://gerrit.wikimedia.org/r/421014

Change 421014 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: remove apt pinning declaration

https://gerrit.wikimedia.org/r/421014

Another cronspam:

Subject: Cron <_graphite@labmon1002> /usr/local/bin/archive-instances

Traceback (most recent call last):
  File "/usr/local/bin/archive-instances", line 137, in <module>
    level=logging.INFO)
  File "/usr/lib/python2.7/logging/__init__.py", line 1540, in basicConfig
    hdlr = FileHandler(filename, mode)
  File "/usr/lib/python2.7/logging/__init__.py", line 911, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python2.7/logging/__init__.py", line 936, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/var/log/graphite/instance-archiver.log

Change 422126 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: introduce hiera support for multiple labmon servers

https://gerrit.wikimedia.org/r/422126

Change 422417 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite

https://gerrit.wikimedia.org/r/422417

Change 422417 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite

https://gerrit.wikimedia.org/r/422417

Change 422126 abandoned by Arturo Borrero Gonzalez:
wmcs: introduce hiera support for multiple labmon servers

Reason:
No trying such a key factorization right now. Focusing in getting the rsync think done. So a partial merge of this change is to be found at: https://gerrit.wikimedia.org/r/#/c/422389/

https://gerrit.wikimedia.org/r/422126

Change 424273 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] graphite: archive-instances: import missing yaml python module

https://gerrit.wikimedia.org/r/424273

Change 424273 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] graphite: archive-instances: import missing yaml python module

https://gerrit.wikimedia.org/r/424273

aborrero added a subscriber: fgiunchedi.

If nobody has more comments on this, I would say the standby server is ready to take over in case the active one dies.

Docs at: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring

@fgiunchedi suggested we rather collect data in both nodes at the same time instead of running rsync. Perhaps we can have a look at this new approach in the future.