Page MenuHomePhabricator

Switch smokeping back to eqiad
Closed, ResolvedPublic

Description

In https://gerrit.wikimedia.org/r/#/c/365892/ smokeping has been pointed to codfw, it should be moved back to eqiad.

Another point is that even if the web interface is pointing to codfw, the actual data (RRDs) should show measurements made from eqiad, (for example latency of about one ms for https://smokeping.wikimedia.org/?target=eqiad.Core.cr1-eqiad ).

We need to double-check that the rsync configured in https://github.com/wikimedia/puppet/blob/fc2ed6ac60ea8f0395518d02a12e64870eb98544/modules/role/manifests/smokeping.pp#L17 is working properly.

Event Timeline

Change 392084 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: switch data rsync direction

https://gerrit.wikimedia.org/r/392084

Change 392086 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: switch backend to eqiad

https://gerrit.wikimedia.org/r/392086

Change 392090 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: enable cron to auto-rsync data

https://gerrit.wikimedia.org/r/392090

Mentioned in SAL (#wikimedia-operations) [2017-11-17T19:37:57Z] <mutante> T180812 - @netmon1002:# rsync -avp /var/lib/smokeping/ /root/backup/netmon1002/201711717/var/lib/smokeping/@netmon2001:/# rsync -avp /var/lib/smokeping/ /root/backup/netmon2001/201711717/var/lib/smokeping/

Change 392084 merged by Dzahn:
[operations/puppet@production] smokeping: switch data rsync direction

https://gerrit.wikimedia.org/r/392084

Mentioned in SAL (#wikimedia-operations) [2017-11-17T20:02:18Z] <mutante> T180812 copying smokeping data from 2001 to 1002 - netmon1002: /usr/bin/rsync -avp rsync://netmon2001.wikimedia.org/var-lib-smokeping /var/lib/smokeping/ | switching backend from codfw to eqiad

Change 392086 merged by Dzahn:
[operations/puppet@production] smokeping: switch backend to eqiad

https://gerrit.wikimedia.org/r/392086

Ok, so far i did:

  • make a local rsync backup of /var/lib/smokeping into /root/backup/.. on both 1002 and 2001 in case i mess something up
  • switch the rsync direction in puppet to src: 2001 dest: 1002 (this just sets up ferm/rsyncd ,doesn't actually run rsync command)
  • ( manually) rsync /var/lib/smokeping from 2001 over to 1002 once
  • switched backend from 2001 to 1002
  • switch the rsync direction in puppet to src: 1002 dest: 2001 (this just sets up ferm/rsyncd ,doesn't actually run rsync command)

What i haven't done yet:

  • enable automatic syncing by setting auto_sync to true

questions:

Do we want to declare the current web backend the "active server" and disable the smokeping service on that one and constantly rsync data from there to the "standby" server as a backup? Or do we just want to run the actual smokeping service on both the whole time? But in that case does it even make sense to rsync any data? What should be the ideal behaviour when then varnish director is flipped over?

Mentioned in SAL (#wikimedia-operations) [2017-11-17T20:35:21Z] <mutante> netmon1002 - rsync smokeping data back from local backup to show measurements made from eqiad as requested on T180812

Thanks !

Do we want to declare the current web backend the "active server" and disable the smokeping service on that one and constantly rsync data from there to the "standby" server as a backup?

Yes, that's the best option. Similar to LibreNMS.

What should be the ideal behaviour when then varnish director is flipped over?

Fail-over the polling as well.

Mentioned in SAL (#wikimedia-operations) [2017-11-17T20:35:21Z] <mutante> netmon1002 - rsync smokeping data back from local backup to show measurements made from eqiad as requested on T180812

correction:

to fulfill " if the web interface is pointing to codfw, the actual data (RRDs) should show measurements made from eqiad" i copied the data on netmon1002 back from local backup, so measurements made in eqiad are shown.

Change 392167 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smokeping: add $active_server parameter and use it

https://gerrit.wikimedia.org/r/392167

Change 392090 merged by Dzahn:
[operations/puppet@production] smokeping: enable cron to auto-rsync data

https://gerrit.wikimedia.org/r/392090

Change 392167 merged by Dzahn:
[operations/puppet@production] smokeping: add $active_server parameter and use it

https://gerrit.wikimedia.org/r/392167

After https://gerrit.wikimedia.org/r/#/c/392167/6 now:

  • smokeping uses the same $netmon_server setting in Hiera that was already used by librenms, netmon1002 is the $active_server
  • additonally there is $netmon_server_failover , set to netmon2001
  • puppet disables smokeping service if on $passive_server
  • on both there are rsyncd/ferm fragments allowing the other
  • only on passive server there is a cronjob that runs sync command and pulls from active server, confirmed that rsync command works

So eqiad is pinging and codfw is stopped and gets the eqiad data as backup.