Page MenuHomePhabricator

Improve Netbox active/passive failover process
Open, LowPublic

Description

Forking from T296452#8653039

during the recent DC switch over Netbox got moved to codfw and it was super slow. this means that in the current set up:

  • active/active may not be the best idea
  • we need to update the dc-switch cookbook to also failover the postgresdb

The issue comes from the extra latency of having the frontend in codfw and the DB in eqiad.

There are 2 main ways of solving the issue.

  • We always move the primary DB to where the primary frontend is
    • But this prevents doing active/active
  • We split reads and writes, reads are always done on the local node, and writes are done where the primary DB is
    • This means slower writes, but as Netbox is read heavy this is most likely fine
    • This permits active/active (and ensure all nodes are healthy)

Option 2 seems better to me, but doesn't have any builtin support in Netbox, I've been pointed to this django module: https://github.com/jbalogh/django-multidb-router but it haven't been updated since a while.
A cookbook to ease master switchover would be valuable in all cases.

Event Timeline

ayounsi triaged this task as Low priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We do also have to solve some puppet-related issues, things that are driven by which host is primary that are currently in hiera.

We should focus our efforts on improving the active/passive failover process.

ayounsi renamed this task from Netbox in codfw slowness issue to Improve Netbox active/passive failover process.Aug 23 2024, 10:17 AM