labmon1002 as cold standby for labmon1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Mar 16 2018, 1:01 PM

Description

We are looking for: if labmon1001 dies tomorrow we hve reasonable confidence we can get labmon1002 going in its place and it would be nice to not have lost more than a day of data

Details

Subject	Repo	Branch	Lines +/-
graphite: archive-instances: import missing yaml python module	operations/puppet	production	+1 -0
wmcs: introduce hiera support for multiple labmon servers	operations/puppet	production	+17 -10
labs: monitoring: fix permissions of /var/log/graphite	operations/puppet	production	+10 -0
prometheus: server: remove apt pinning declaration	operations/puppet	production	+10 -9
prometheus: server: install depends packages from jessie-backports	operations/puppet	production	+9 -0
site.pp: put labmon1002 into work	operations/puppet	production	+2 -7

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T189871 labmon1002 as cold standby for labmon1001
Resolved	aborrero	T190312 labmon: missing puppet configuration for keystone integration
Resolved	aborrero	T190512 labmon: syncronize whisper files between labmon1001 and labmon1002
Resolved	aborrero	T190515 labmon: persist apache/httpd configuration

Event Timeline

• chasemp created this task.Mar 16 2018, 1:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2018, 1:01 PM

• chasemp triaged this task as Medium priority.Mar 16 2018, 1:01 PM

• chasemp mentioned this in T165784: rack/setup/install labmon1002.

There are several possible approaches for this:

sync data between the two hosts by means of a manual process (rsync, cron, etc)
having some shared filesystem between the 2 servers for the datasets
collect data in both at the same time (by means of a virtual IP address or something?)

I will investigate how other people are doing these stuff out there.

Meanwhile, we could simply apply role(labs::monitoring), but we are not sure if this could cause any issues.
There are several hiera keys that points to labmon1001 for different purposes.

Change 420019 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] site.pp: put labmon1002 into work

https://gerrit.wikimedia.org/r/420019

gerritbot added a project: Patch-For-Review.Mar 16 2018, 1:07 PM

Change 420019 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] site.pp: put labmon1002 into work

https://gerrit.wikimedia.org/r/420019

Change 421006 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports

https://gerrit.wikimedia.org/r/421006

Change 421006 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: install depends packages from jessie-backports

https://gerrit.wikimedia.org/r/421006

Change 421014 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: server: remove apt pinning declaration

https://gerrit.wikimedia.org/r/421014

Change 421014 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: server: remove apt pinning declaration

https://gerrit.wikimedia.org/r/421014

aborrero closed subtask T190312: labmon: missing puppet configuration for keystone integration as Resolved.Mar 23 2018, 1:16 PM

aborrero closed subtask T190515: labmon: persist apache/httpd configuration as Resolved.Mar 23 2018, 5:27 PM

Another cronspam:

Subject: Cron <_graphite@labmon1002> /usr/local/bin/archive-instances

Traceback (most recent call last):
  File "/usr/local/bin/archive-instances", line 137, in <module>
    level=logging.INFO)
  File "/usr/lib/python2.7/logging/__init__.py", line 1540, in basicConfig
    hdlr = FileHandler(filename, mode)
  File "/usr/lib/python2.7/logging/__init__.py", line 911, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python2.7/logging/__init__.py", line 936, in _open
    stream = open(self.baseFilename, self.mode)
IOError: [Errno 13] Permission denied: '/var/log/graphite/instance-archiver.log

Change 422126 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: introduce hiera support for multiple labmon servers

https://gerrit.wikimedia.org/r/422126

Change 422417 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite

https://gerrit.wikimedia.org/r/422417

Change 422417 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labs: monitoring: fix permissions of /var/log/graphite

https://gerrit.wikimedia.org/r/422417

Change 422126 abandoned by Arturo Borrero Gonzalez:
wmcs: introduce hiera support for multiple labmon servers

Reason:
No trying such a key factorization right now. Focusing in getting the rsync think done. So a partial merge of this change is to be found at: https://gerrit.wikimedia.org/r/#/c/422389/

https://gerrit.wikimedia.org/r/422126

Change 424273 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] graphite: archive-instances: import missing yaml python module

https://gerrit.wikimedia.org/r/424273

Change 424273 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] graphite: archive-instances: import missing yaml python module

https://gerrit.wikimedia.org/r/424273

bd808 edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Apr 9 2018, 9:26 PM

aborrero closed subtask T190512: labmon: syncronize whisper files between labmon1001 and labmon1002 as Resolved.May 15 2018, 11:46 AM

bd808 moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.Jun 4 2018, 12:00 AM

bd808 moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

If nobody has more comments on this, I would say the standby server is ready to take over in case the active one dies.

Docs at: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring

@fgiunchedi suggested we rather collect data in both nodes at the same time instead of running rsync. Perhaps we can have a look at this new approach in the future.

labmon1002 as cold standby for labmon1001Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

labmon1002 as cold standby for labmon1001
Closed, ResolvedPublic
Actions

Related Objects
Search...