Page MenuHomePhabricator

Netbox and Redis
Open, LowPublic

Description

Follow up from https://gerrit.wikimedia.org/r/c/operations/dns/+/808198

Looking at making Netbox frontends active/active between eqiad and codfw.

While Netbox used to use the central Redis instance (see doc), it got moved to a local one with rOPUP461ff2f55b37: netbox: Adjust settings for supporting Netbox 2.9 series

as newer redis features are required and the redis servers previously depended on are not of a sufficiently new version.

Looking at Netbox's doc:

NetBox v2.9.0 and later require Redis v4.0 or higher.

While now rdb misc uses:

rdb1011:~$ redis-server -v
Redis server v=6.0.14

So we might be able to use a central Redis server again.
@akosiaris are there any reasons on the RDB side that would prevent us from using it for Netbox? For example the latency between eqiad and codfw? Or maybe it's just not made for that.

Some additional thoughts:

  • Allowing active/active should improve performances and ensure that all the Netbox frontends are healthy by seeing some traffic
    • One unknown is if it will degrade performances on the frontend not local to the DB primary.
  • It's better to use a centrally managed cluster as it benefits from a team's expertise, prevents from re-inventing the wheel and scales better. The alternative, a new cluster between the Netbox frontends Redis will make day to day management as well as upgrades more complex
    • I don't think there is a risk of circular dependency (Redis broken and the only way to fix it is through Netbox, which requires Redis). But if it's a real risk we could document a way to switch back to a local instance (eg. Hiera config knob) in an emergency

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi!

Thanks for this task. So, from what I gathered, netbox uses Redis for caching and task queuing purposes, and support using different databases per function (queuing vs caching). Is that understanding correct?

In both DCs, we have Redis_misc cluster (non-replicated across DCs) which does have at least 1 slot yet unallocated and available for this. I expect that caching wise this should be fine.

Task queuing wise, I guess it depends on where and how the task executors are configured.

If task executors are being executed in both DCs, then tasks can be scheduled in both DCs and executed in both DCs and we can just stick to DC-local redis for both use cases.

If only one of the 2 (eqiad/codfw) instances has the task executor, then we need to have the netbox task queue configuration talk to just that Redis_misc instance. That should add some latency to task executions. I am guessing they are asynchronous though, so that shouldn't be an issue.

Thanks for this task. So, from what I gathered, netbox uses Redis for caching and task queuing purposes, and support using different databases per function (queuing vs caching). Is that understanding correct?

Indeed! according to https://docs.netbox.dev/en/stable/installation/3-netbox/#redis

Note that NetBox requires the specification of two separate Redis databases: tasks and caching. These may both be provided by the same Redis service, however each should have a unique numeric database ID.

I think for queuing anything would works as this not not time sensitive.

What I'm wondering is the caching behavior.
If we have Netbox active/active and each DC uses the DC-local Redis instance (non replicated), I guess there is a high risk that clients hitting a different fFontend (in a different DC) will see different results?
For example client 1 writes to eqiad, client 2 fetch from codfw and doesn't see the eqiad write.
If this assumption is correct I guess the next questions are:

  • Is it a bad thing? How much in sync do both DCs need to be
    • For example someone editing Netbox UI in eqiad while running cookbooks from codfw's cumin
  • If it is indeed a bad thing (and we need close to real time sync), what are our options?
    • Is it possible to replicate Redis across DCs? (I guess not as it would extend the failure domain, but asking just in case :) )
    • Should we have the Netbox frontends active/active but talk to an active/passive Redis? Would there be any gain here?

Thanks for this task. So, from what I gathered, netbox uses Redis for caching and task queuing purposes, and support using different databases per function (queuing vs caching). Is that understanding correct?

Indeed! according to https://docs.netbox.dev/en/stable/installation/3-netbox/#redis

Note that NetBox requires the specification of two separate Redis databases: tasks and caching. These may both be provided by the same Redis service, however each should have a unique numeric database ID.

I think for queuing anything would works as this not not time sensitive.

OK cool.

What I'm wondering is the caching behavior.

The caching mechanism apparently uses https://github.com/Suor/django-cacheops. In order for caching to be effective in any way (vs a call to the database) it will have to be DC-local (otherwise the round trips will probably eat up the entirety of the speed up

If we have Netbox active/active and each DC uses the DC-local Redis instance (non replicated), I guess there is a high risk that clients hitting a different fFontend (in a different DC) will see different results?
For example client 1 writes to eqiad, client 2 fetch from codfw and doesn't see the eqiad write.

Yes, that's a distinct possibility. We have an entire infrastructure (eventgates+kafka+purged) to avoid that scenario for mediawiki edits and properly invalidate caches (and even that took years to get right).

However, in this case, invalidation appears to be entirely up to django-cacheops and per https://github.com/Suor/django-cacheops#invalidation it is both event driven and time driven.
15 minutes appear to be the default for the time based stuff.
Event driven means that cache is invalidated when an instance of a django model get changed by the app. That can NOT be propagated to the other DC.

If this assumption is correct I guess the next questions are:

  • Is it a bad thing? How much in sync do both DCs need to be
    • For example someone editing Netbox UI in eqiad while running cookbooks from codfw's cumin
  • If it is indeed a bad thing (and we need close to real time sync), what are our options?
    • Is it possible to replicate Redis across DCs? (I guess not as it would extend the failure domain, but asking just in case :) )
    • Should we have the Netbox frontends active/active but talk to an active/passive Redis? Would there be any gain here?

Some answers:

  • It is a bad thing, as 2 different SREs could see (within the configurable 15m window) different views of the same objects, proceeding into performing the same changes. Your cumin example makes it even worse as mass automated actions could be taken based on an old view of the state.
  • It is technically possible to replicate Redis across DCs and we did it for years. It was pretty painful and caused frequent painful and difficult to diagnose issues to the mw job queueing system. Enough that we stopped doing it and I think multiple people will urge against doing that today. It's also practically a leader/follower (or main/secondary if one prefers that terminology) setup and wouldn't particularly help here as writes to the follower aren't a good pattern.
  • Paying 40ms for the RTTs to talk to the redis to the active DC would probably negate most of the gains of having a cache in the first place. I doubt querying the database costs that much latency wise in most cases.

The risk of the "bad" thing happening can be lowered of course by trimming the 15m down to say 5m or even lower. At which point and given the amount of people that are going to be using netbox one starts to wonder if the cache would help with anything.

One question. Redis is definitely a requirement for netbox 2.6, I am guessing for the async task queueing stuff. Can netbox be configured without the redis cache? After all it just might not be worth it to have it.

Can netbox be configured without the redis cache? After all it just might not be worth it to have it.

Unfortunately not. According to https://docs.netbox.dev/en/stable/configuration/required-parameters/#redis Redis is required for both tasks and caching.

To sum up:

  • the replication and the "split view" are off the table
  • Redis is required
  • we have frontends in both DCs (for redundancy), they can be active/active or active/passive
  • we have primary/backup Postgres

Keeping in mind that the initial goal is to make the frontends active/active (for latency, and to make sure they're both operational) and to not use a "self hosted Redis on localhost" (but a centrally manged one instead)

From there I see two options:

  1. Keep the status quo, frontends stay active/passive, Redis stay local to the frontends
  2. Configure Netbox frontends as active/active and talk to a single "active" Redis instance (in the same site as the primary DB)

Option 2 still seems preferable, we won't get the latency benefits from the "distant frontend" or even a latency increase (eg. codfw frontend->Redis miss->DB primary). But the benefits of having active/active frontends and using a central Redis (instead as on localhost) seems like a win.
In other words losing some performance to gain on standardization/maintainability.

@akosiaris @jbond what do you think? It's not a strong opinion so happy to get my mind changed :)

I also see an option 3:

  • Keep frontends active/passive and talk to a single active Redis instance.

That way, we avoid the potential latency increases mentioned for option 2 and still use the central Redis.

Sounds like a great 1st step (if we want to test active active later on), if not final step (if we keep it as it).

So I guess onthe Netbox side we "just" need those config options (from the doc) after the DBs are created on the Redis side (and maybe ACLs updated as well).

REDIS = {
    'tasks': {
        'HOST': 'localhost',      # Redis server
        'PORT': 6379,             # Redis port
        'PASSWORD': '',           # Redis password (optional)
        'DATABASE': 0,            # Database ID
        'SSL': False,             # Use SSL (optional)
    },
    'caching': {
        'HOST': 'localhost',
        'PORT': 6379,
        'PASSWORD': '',
        'DATABASE': 1,            # Unique ID for second database
        'SSL': False,
    }
}

Related question: should we use the central Redis server for the dev netbox instance (netbox-next) as well? On one hand I'm not a fan of mixing prod and dev infra, on the other hand the risk of abuse is minimal and that would help having dev more similar to prod.

Silly question: do we have an idea of the size of the cached dataset? if it's small, do we need to keep redis remote to the VM where netbox runs, or should we install it as a local sidecar?

do we have an idea of the size of the cached dataset?

Good question! I guess this is small compared to https://grafana.wikimedia.org/d/000000174/redis?orgId=1&viewPanel=9

netbox1002:~$ redis-cli -p 6380 info
used_memory_human:1.08M
used_memory_rss_human:7.63M
used_memory_peak_human:64.93M

Redis is currently installed as a local sidecard, the idea here is to move it to a central Redis instance to simplify the Netbox setup and Redis management (1 instance managed by service ops instead of many included managed by I/F)

Just for historical context, when Netbox first introduced a requirement of Redis, it was configured to use the shared Redis installation and then moved to a local instance later on in one of the attempts to reduce some cache issues (both missing items and latency) that we had encountered back then.

Can netbox be configured without the redis cache? After all it just might not be worth it to have it.

Unfortunately not. According to https://docs.netbox.dev/en/stable/configuration/required-parameters/#redis Redis is required for both tasks and caching.

django-cacheops definitely has that functionality. There is CACHEOPS_ENABLED setting. And I see it being set in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox/+/refs/heads/master/netbox/netbox/settings.py#443

I 'd suggest we test that in production right now and use it to just disable the caching functionality, evaluating how much latency it shaves off. My gut feeling after looking at that 1.08M says not much in most cases. The 64.93M peak number says a different story, but that could very well be when automated tooling uses netbox, which might be ok if it is a tad slower (or it might not, let's test).

Related question: should we use the central Redis server for the dev netbox instance (netbox-next) as well? On one hand I'm not a fan of mixing prod and dev infra, on the other hand the risk of abuse is minimal and that would help having dev more similar to prod.

For testing, you can use the same host/port pair, but specify a different redis database (e.g. instead of 0 or 1, specify 2). That way you can keep them separated.

The git tree is a bit confusing and needs cleanup, but that file in master seems to be on the old 2.10 version.
You can see the 3.2.2 version there: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox/+/60c8a2e36c86e7484c69237bb0e15b98d1a1d302/netbox/netbox/settings.py#226
Relevant upstream pull request

So from the linked ticket, queryset caching has been removed in v3, but there is still some caching in action.
For example the "bug" I hit in the past: https://wikitech.wikimedia.org/wiki/Netbox#CablePath_matching_query_does_not_exist I think for example to cache cables end to end paths.

The git tree is a bit confusing and needs cleanup, but that file in master seems to be on the old 2.10 version.
You can see the 3.2.2 version there: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox/+/60c8a2e36c86e7484c69237bb0e15b98d1a1d302/netbox/netbox/settings.py#226
Relevant upstream pull request

oh, wow. So they 've dropped django-cacheops in 3.x. I 've been operating under now invalid knowledge/assumptions up to now then. This changes drastically the situation and cancels most of my comments above.

So from the linked ticket, queryset caching has been removed in v3, but there is still some caching in action.
For example the "bug" I hit in the past: https://wikitech.wikimedia.org/wiki/Netbox#CablePath_matching_query_does_not_exist I think for example to cache cables end to end paths.

They are using the standard django caching framework it appears, but with very specific caching usages (i.e. they aren't using some generic Django caching middleware but rather using specifically cache from django.core.cache). A look at the code tells me that they are caching very few things in fact, that is configuration (netbox/config/__init.py and netbox/extras/models/models.py) and latest version check (netbox/netbox/views/init__.py). That's about it.

Disabling Django's standard caching is easy (all it requires is setting the engine to dummy) but that probably is pointless given what they are caching. For the version check use case, it probably is quite irrelevant which DC is set in the configuration. For the configuration use case and IIRC, uwsgi/django integration means that configuration isn't loaded on every request as processes are rather long running, so the entire discussion is kinda moot as well. In both cases, there are no real risks in using the DC-local redis for caching and there are appear to be no gains either.

Under this new light, I am gonna say that the choice of which Redis (DC-local or remote) to use for the netbox caching mechanism is irrelevant. I 'd go with whatever is easier to configure and reason about

Awesome, thanks! Then let's stick to the plan of of remote Redis, as a risk of higher latency is better than a risk of split view :) And it's easier to configure.