We are looking for: if labmon1001 dies tomorrow we hve reasonable confidence we can get labmon1002 going in its place and it would be nice to not have lost more than a day of data
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | aborrero | T189871 labmon1002 as cold standby for labmon1001 | |||
Resolved | aborrero | T190312 labmon: missing puppet configuration for keystone integration | |||
Resolved | aborrero | T190512 labmon: syncronize whisper files between labmon1001 and labmon1002 | |||
Resolved | aborrero | T190515 labmon: persist apache/httpd configuration |
Event Timeline
There are several possible approaches for this:
- sync data between the two hosts by means of a manual process (rsync, cron, etc)
- having some shared filesystem between the 2 servers for the datasets
- collect data in both at the same time (by means of a virtual IP address or something?)
I will investigate how other people are doing these stuff out there.
Meanwhile, we could simply apply role(labs::monitoring), but we are not sure if this could cause any issues.
There are several hiera keys that points to labmon1001 for different purposes.
Change 420019 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] site.pp: put labmon1002 into work
Change 420019 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] site.pp: put labmon1002 into work
Change 421006 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports
Change 421006 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports
Change 421014 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: remove apt pinning declaration
Change 421014 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: remove apt pinning declaration
Another cronspam:
Subject: Cron <_graphite@labmon1002> /usr/local/bin/archive-instances Traceback (most recent call last): File "/usr/local/bin/archive-instances", line 137, in <module> level=logging.INFO) File "/usr/lib/python2.7/logging/__init__.py", line 1540, in basicConfig hdlr = FileHandler(filename, mode) File "/usr/lib/python2.7/logging/__init__.py", line 911, in __init__ StreamHandler.__init__(self, self._open()) File "/usr/lib/python2.7/logging/__init__.py", line 936, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 13] Permission denied: '/var/log/graphite/instance-archiver.log
Change 422126 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: introduce hiera support for multiple labmon servers
Change 422417 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite
Change 422417 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite
Change 422126 abandoned by Arturo Borrero Gonzalez:
wmcs: introduce hiera support for multiple labmon servers
Reason:
No trying such a key factorization right now. Focusing in getting the rsync think done. So a partial merge of this change is to be found at: https://gerrit.wikimedia.org/r/#/c/422389/
Change 424273 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] graphite: archive-instances: import missing yaml python module
Change 424273 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] graphite: archive-instances: import missing yaml python module
If nobody has more comments on this, I would say the standby server is ready to take over in case the active one dies.
Docs at: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring
@fgiunchedi suggested we rather collect data in both nodes at the same time instead of running rsync. Perhaps we can have a look at this new approach in the future.