Page MenuHomePhabricator

Netbox: move it to dedicated Ganeti VMs
Closed, ResolvedPublic

Description

Netbox is currently installed on the netmon hosts. Given the growing importance of this service it would be better to move it to a couple of dedicated VMs in Ganeti.

Along with the move we should also investigate:

  • Options to set it in active/active setup for the frontend part
  • Options to set it in HA for the DB side too (maybe within the same DC?)

Event Timeline

Volans triaged this task as Medium priority.May 14 2019, 3:40 PM

Change 514395 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] profile::netbox: Reorganize for splitting front and back-end.

https://gerrit.wikimedia.org/r/514395

To update this ticket with current situation.

  • There is a puppet patch which is largely/completely correct at this point (and tested in WMCS).
  • The support for the current configuration is marginal. In a conversation with Riccardo, we've agreed that disabling puppet on Netmon, and then deploying the new VMs would be the ideal scenario. If we need to revert, we will revert the patch, and then enable puppet on Netmon again.
  • The patch makes postgres replicas synchronous. This will make writes somewhat slower, but very safe.
  • Our general course for HA will be to first get two replicas in production (one per datacenter) of both frontend and database. The frontends will be given an external IP address, and netbox.wikimedia.org will CNAME to the eqiad one. To support future HA, we shall round-robin or geodns the frontends (or similar).
  • There are one or two minor changes that need to be made to the patch, but we should be ready to merge shortly.

Change 532502 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/dns@master] Add Netbox instance addresses

https://gerrit.wikimedia.org/r/532502

Change 532502 merged by CRusnov:
[operations/dns@master] Add Netbox instance addresses

https://gerrit.wikimedia.org/r/532502

Change 533487 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: add role spare::system to new VMs

https://gerrit.wikimedia.org/r/533487

Change 533487 merged by Volans:
[operations/puppet@production] netbox: add role spare::system to new VMs

https://gerrit.wikimedia.org/r/533487

@crusnov in case you missed my IRC ping yesterday, please re-install the two public ones before proceeding with the installation as they had no firewall (see above hotfix patch).

@Volans Ah hah thanks for this. I was given to believe the 'default' would include the ferm config and did'nt even think of looking.

Change 514395 merged by CRusnov:
[operations/puppet@production] profile::netbox: Reorganize for splitting front and back-end.

https://gerrit.wikimedia.org/r/514395

Mentioned in SAL (#wikimedia-operations) [2019-09-04T00:02:45Z] <chaomodus> installing and setting up netbox instances T223291

Change 514395 merged by CRusnov:
[operations/puppet@production] profile::netbox: Reorganize for splitting front and back-end.

https://gerrit.wikimedia.org/r/514395

@crusnov This commit broke Puppet on puppetdb2001.codfw.wmnet (which also uses postgresql::slave) as the date type for includes is switched from array to string. I checked other uses of postgres::slave and previously users of that class were passing both (so the newly introduced type annotation unveiled a real bug).

There's two ways to fix this:
a) Switch puppetmaster::puppetdb::database to pass the includes as a string
b) Switch all the other uses of that class to pass an array

I think the latter is the more correct fix, as

  • postgres::master also uses an array and we'd be inconsistent otherwise
  • There are probably valid use cases for passing more than one include file

Also adding @akosiaris for input as he initially wrote these classes.

Also adding @akosiaris for input as he initially wrote these classes.

The initial intention was for $includes to be an array indeed. So I favor b) as well.

Mentioned in SAL (#wikimedia-operations) [2019-09-05T21:42:17Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291

Mentioned in SAL (#wikimedia-operations) [2019-09-05T21:42:20Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T03:16:26Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T03:16:46Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing) (duration: 00m 20s)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T03:21:16Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291

Mentioned in SAL (#wikimedia-operations) [2019-09-06T03:21:31Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 14s)

Also adding @akosiaris for input as he initially wrote these classes.

The initial intention was for $includes to be an array indeed. So I favor b) as well.

@crusnov : Can you please fix this today, either by (partly) reverting https://gerrit.wikimedia.org/r/514395 or by adapting the type hints to use an array? This has prevented puppet runs on puppetdb2001 for ~ three days now and is blocking the setup of the new Buster-based puppetdb instances.

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:24:23Z] <crusnov@deploy1001> Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:25:52Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 01m 29s)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:25:58Z] <crusnov@deploy1001> Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:26:35Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2 (duration: 00m 37s)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:38:58Z] <crusnov@deploy1001> Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:39:19Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3 (duration: 00m 21s)

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:40:50Z] <crusnov@deploy1001> Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux

Mentioned in SAL (#wikimedia-operations) [2019-09-06T17:43:45Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 02m 55s)

Also adding @akosiaris for input as he initially wrote these classes.

The initial intention was for $includes to be an array indeed. So I favor b) as well.

@crusnov : Can you please fix this today, either by (partly) reverting https://gerrit.wikimedia.org/r/514395 or by adapting the type hints to use an array? This has prevented puppet runs on puppetdb2001 for ~ three days now and is blocking the setup of the new Buster-based puppetdb instances.

Ah my mistake, apologies. As we discussed on IRC, i shall untypehint the offending part and open a ticket to later address it.

Also adding @akosiaris for input as he initially wrote these classes.

The initial intention was for $includes to be an array indeed. So I favor b) as well.

@crusnov : Can you please fix this today, either by (partly) reverting https://gerrit.wikimedia.org/r/514395 or by adapting the type hints to use an array? This has prevented puppet runs on puppetdb2001 for ~ three days now and is blocking the setup of the new Buster-based puppetdb instances.

Ah my mistake, apologies. As we discussed on IRC, i shall untypehint the offending part and open a ticket to later address it.

PuppetDB is fixed, and other postgres::slaves seem to be noops.

FWIW netbox.wikimedia.org points at netbox1001.wikimedia.og now. I am working on fixing some minor remaining issues with reports and making backups be correct (database is currently backed-up correctly, but netbox proper needs dumps backed up).

crusnov updated the task description. (Show Details)

This has been completed modulo some growing pains.