Page MenuHomePhabricator

Upgrade Netbox to 4.x
Closed, ResolvedPublic

Description

Logical follow up from T314933: Upgrade Netbox to latest 3.2.

The 5 major releases come with their load of new features, from which I quickly selected the interesting ones:

3.3

  • L2VPN Modeling (#8157)
  • Toggle Custom Field Visibility (#9166)
  • #8511 - Enable custom fields and tags for circuit terminations
  • #8995 - Enable arbitrary ordering of REST API results
  • #6454 - Include contextual help when creating first objects in UI
  • #10039 - Add "assign FHRP group" option to actions dropdown in interfaces list
  • #10447 - Enable reassigning an inventory item from one device to another

3.4

  • New Global Search (#10560)
  • Virtual Device Contexts (#7854)
  • Saved Filters (#9623)
  • JSON/YAML Bulk Imports (#4347)
  • Scheduled Reports & Scripts (#8366)
  • API for Staged Changes (#10851)
  • #10600 - Allow custom object fields to reference a user or group

3.5

  • Customizable Dashboard (#9416)
  • Remote Data Sources (#11558)
  • ASN Ranges (#8550)
  • Provider Accounts (#9047)
  • #10759 - Support Markdown rendering for custom field descriptions

3.6

  • Custom Field Choice Sets (#12988)
  • #8137 - Add a field for designating the out-of-band (OOB) IP address for devices

3.7

  • Event Rules (#14132)
  • Object Protection Rules (#10244)
  • Improved Custom Field Visibility Controls (#13299)
  • VPN Tunnels (#9816)
  • #12135 - Avoid orphaned interfaces by preventing the deletion of interfaces which have children assigned

4

But also their load of breaking changes:

3.3

  • Device position, device type height, and rack unit values are now reported as decimals (e.g. 1.0 or 1.5) to support modeling half-height rack units.
    • Not impactful
  • The nat_outside relation on the IP address model now returns a list of zero or more related IP addresses, rather than a single instance (or None).
    • Not used
  • Several fields on the cable API serializers have been altered or removed to support multiple-object cable terminations: [see table on the link]
  • As with the cable model, several API fields on all objects to which cables can be connected (interfaces, circuit terminations, etc.) have been changed: [see table on the link]
  • The cable path serialization returned by the /paths/ endpoint for pass-through ports has been simplified, and the following fields removed: origin_type, origin, destination_type, destination. (Additionally, is_complete has been added.)
    • Removed fields not used

3.4

  • Device and virtual machine names are no longer case-sensitive. Attempting to create e.g. "device1" and "DEVICE1" within the same site will raise a validation error.
    • Not impactful
  • The asn, noc_contact, admin_contact, and portal_url fields have been removed from the provider model. Please replicate any data remaining in these fields to the ASN and contact models introduced in NetBox v3.1 prior to upgrading.
    • Updated manually
  • The content_type fields on the CustomLink and ExportTemplate models have been renamed to content_types and now support the assignment of multiple content types per object.
    • Not used
  • Within the Python API, the cf property on an object with custom fields now returns deserialized values. For example, a custom field referencing an object will return the object instance rather than its numeric ID. To access the raw serialized values, reference the object's custom_field_data attribute instead.
    • Not used
  • The NetBoxModelCSVForm class has been renamed to NetBoxModelImportForm. Backward compatability with the previous name has been retained for this release, but will be dropped in NetBox v3.5.
    • Not used

3.5

  • The account field has been removed from the provider model. This information is now tracked using the new provider account model. Multiple accounts can be assigned per provider.
    • Will be updated automatically, not used externally
  • A minimum length of 50 characters is now enforced for the SECRET_KEY configuration parameter.
  • The JobResult model has been moved from the extras app to core and renamed to Job. Accordingly, its REST API endpoint has been moved from /api/extras/job-results/ to /api/core/jobs/.
  • The obj_type field on the Job model (previously JobResult) has been renamed to object_type for consistency with other models.
  • The JOBRESULT_RETENTION configuration parameter has been renamed to JOB_RETENTION.
  • The obj context variable is no longer passed when rendering custom links: Use object instead.
    • Updated
  • The REST API schema is now generated using the OpenAPI 3.0 spec
    • Not impactful
  • The URLs for the REST API schema documentation have changed:
    • /api/docs/ is now /api/schema/swagger-ui/
    • /api/redoc/ is now /api/schema/redoc/
    • Not impactful

3.6

  • PostgreSQL 11 is no longer supported (dropped in Django 4.2). NetBox v3.6 requires PostgreSQL 12 or later.
    • We're on 13
  • The device_role field on the Device model has been renamed to role. The device_role field has been temporarily retained on the REST API serializer for devices for backward compatibility, but is read-only.
    • Will need patches
  • The choices array field has been removed from the CustomField model. Any defined choices are automatically migrated to CustomFieldChoiceSets, accessible via the new choice_set field on the CustomField model.
    • Not impacted
  • The napalm_driver and napalm_args fields (which were deprecated in v3.5) have been removed from the Platform model.
    • Feature not in use

3.7

  • The following fields have been removed from the Webhook model: content_types, type_create, type_update, type_delete, type_job_start, type_job_end, enabled, and conditions. Webhooks are now tied to events via event rules. Existing webhooks will have event rules created automatically upon upgrade.
    • Not impacted
  • The ui_visibility field on the custom field model has been replaced with two new fields: ui_visible and ui_editable. Existing values will be migrated automatically upon upgrade.
    • Not impacted
  • The FeatureQuery class for querying content types by model feature has been removed. Plugins should now use the new with_feature() manager method on NetBox's proxy model for ContentType.
    • Not impacted
  • The ConfigRevision model has been moved from extras to core. Configuration history will be retained throughout the upgrade process.
    • Not impacted
  • The L2VPN and L2VPNTermination models have been moved from the ipam app to the new vpn app. All object data will be retained, however please note that the relevant API endpoints have moved to /api/vpn/.
    • Not impacted
  • The CustomFieldsMixin, SavedFiltersMixin, and TagsMixin classes have moved from the extras.forms.mixins module to netbox.forms.mixins.
    • Not impacted

4.0

  • Support for Python 3.8 and 3.9 has been removed.
    • Requires to upgrade from bullseye to bookworm
  • The format for GraphQL query filters has changed. Please see the GraphQL documentation for details and examples.
  • The deprecated device_role & device_role_id filters for devices have been removed. (Use role and role_id instead.)
  • The obsolete device_role field has been removed from the REST API serializer for devices. (Use role instead.)
  • The legacy reports functionality has been dropped. Reports will be automatically converted to custom scripts on upgrade.
  • The parent and parent_id filters for locations now return only immediate children of the specified location. (Use ancestor and ancestor_id to return all descendants.)
    • Not impacted
  • The object_type field on the CustomField model has been renamed to related_object_type.
    • Not impacted
  • The utilities.utils module has been removed and its resources reorganized into separate modules organized by function.
    • Not impacted
  • The obsolete NullableCharField class has been removed. (Use Django's stock CharField class with null=True instead.)
    • Not impacted

Next steps:

TBD

Once Netbox-next is upgraded to 3.6:

  • Update cookbooks if needed
  • Update Homer (& wmf-netbox.py)

Once Netbox is upgraded:

  • Upgrade pynetbox to > 6.6 (eg. in Homer)
  • Setup a rsync between the /srv/netbox/customscripts on both frontends
  • Review all the TODO in Puppet for cleanups

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/cookbooksmaster+8 -2
operations/puppetproduction+3 -2
operations/cookbooksmaster+30 -30
operations/software/homer/deploymaster+25 -25
operations/puppetproduction+5 -2
operations/puppetproduction+10 -10
operations/puppetproduction+25 -5
operations/software/netbox-extrasmaster+3 -1
operations/puppetproduction+3 -4
operations/puppetproduction+7 -7
operations/puppetproduction+17 -22
operations/software/netbox-extrasmaster+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -2
operations/puppetproduction+4 -2
operations/puppetproduction+8 -2
operations/puppetproduction+2 -2
operations/software/homermaster+25 -23
operations/software/spicerackmaster+15 -4
operations/software/netbox-deploymain+12 -12
operations/puppetproduction+8 -0
operations/puppetproduction+3 -0
operations/puppetproduction+7 -1
operations/puppetproduction+28 -2
operations/software/netbox-deploymain+19 -19
operations/software/netbox-deploydev+24 -15
operations/software/netbox-deploydev+3 -0
operations/puppetproduction+2 -3
operations/dnsmaster+1 -1
operations/puppetproduction+2 -6
operations/puppetproduction+1 -0
operations/software/netbox-extrasdev+25 -51
operations/software/netbox-extrasdev+9 -9
operations/software/netbox-extrasdev+1 K -20
operations/puppetproduction+0 -1
operations/software/netbox-extrasdev+13 -12
operations/software/netbox-extrasdev+13 -17
operations/software/netbox-extrasdev+1 -1
operations/software/netbox-extrasdev+24 -24
operations/software/netbox-extrasmaster+43 -60
operations/puppetproduction+20 -1
operations/software/netbox-extrasdev+4 -3
operations/software/netbox-extrasmaster+4 -3
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/software/netbox-deploydev+162 -2 K
operations/puppetproduction+11 -9
operations/puppetproduction+40 -24
operations/puppetproduction+35 -23
operations/puppetproduction+4 -0
operations/puppetproduction+0 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-07-09T15:44:19Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275

Change #1053243 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/netbox-deploy@main] Upgrade Netbox to 4.0.7

https://gerrit.wikimedia.org/r/1053243

Change #1053266 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: prepare Puppet for new prod servers

https://gerrit.wikimedia.org/r/1053266

Change #1053266 merged by Ayounsi:

[operations/puppet@production] Netbox 4: prepare Puppet for new prod servers

https://gerrit.wikimedia.org/r/1053266

First Puppetization of new Netbox frontends :

  • sudo mkdir /srv/deployment/ was needed. TODO: Add to Puppet
  • And then this error, fixed with sudo mkdir /srv/netbox-exports/dns.git/hooks; sudo chown netbox:www-data /srv/netbox-exports/dns.git/hooks. TODO: Add to Puppet
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /srv/netbox-exports/dns.git/hooks/post-update20240711-10161-19bjwgu.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/netbox/manifests/autogit.pp, line: 51)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /srv/netbox-exports/dns.git/hooks/post-update20240711-10161-19bjwgu.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/netbox/manifests/autogit.pp, line: 51)
Wrapped exception:
No such file or directory - A directory component in /srv/netbox-exports/dns.git/hooks/post-update20240711-10161-19bjwgu.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Profile::Netbox::Automation/Netbox::Autogit[dns]/File[/srv/netbox-exports/dns.git/hooks/post-update]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /srv/netbox-exports/dns.git/hooks/post-update20240711-10161-19bjwgu.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/netbox/manifests/autogit.pp, line: 51)

Change #1053636 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: create parent directories

https://gerrit.wikimedia.org/r/1053636

Change #1053640 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Cumin aliases: hardcode current Netbox prod servers

https://gerrit.wikimedia.org/r/1053640

Change #1053640 merged by Ayounsi:

[operations/puppet@production] Cumin aliases: hardcode current Netbox prod servers

https://gerrit.wikimedia.org/r/1053640

Icinga downtime and Alertmanager silence (ID=05ca8c35-9b32-4c3a-9b80-5e01ef75b7f9) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=df7b46ee-b552-4bdd-9b54-9bed50fb98cd) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=4abde2ff-0621-44ff-ad19-09d19fe0d4a2) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=649a3ed0-08fc-40b8-a899-14c7a81aaa41) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb2003.codfw.wmnet

The initial puppet run on netboxdb1003 failed with:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for netboxdb2003.codfw.wmnet Resolv::DNS::Resource::IN::A (file: /srv/puppet_code/environments/production/modules/profile/manifests/netbox/db.pp, line: 36, column: 24) on node netboxdb1003.eqiad.wmnet

Issue avoided by creating the codfw VM.

Change #1053636 merged by Ayounsi:

[operations/puppet@production] Netbox 4: create parent directories

https://gerrit.wikimedia.org/r/1053636

Change #1048402 merged by Ayounsi:

[operations/puppet@production] Netbox 4: create customscript parent directory as well

https://gerrit.wikimedia.org/r/1048402

Icinga downtime and Alertmanager silence (ID=cc358df6-b5c1-490c-aad1-6454f09f0fc8) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=60bc5c40-0301-4c29-907d-b4e0eb5e3cb3) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=24d499e4-d334-4d4e-8fcd-fc9f2feed844) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet

Change #1053000 abandoned by Ayounsi:

[operations/puppet@production] python_deploy_venv.sh enable proxy support

https://gerrit.wikimedia.org/r/1053000

Change #1053243 merged by Ayounsi:

[operations/software/netbox-deploy@main] Upgrade Netbox to 4.0.7

https://gerrit.wikimedia.org/r/1053243

Mentioned in SAL (#wikimedia-operations) [2024-07-16T12:09:11Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.7 to netbox-next - ayounsi@cumin1002 - T336275

Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.7 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-16T12:10:28Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.7 to netbox-next - ayounsi@cumin1002 - T336275

Change #1050453 merged by jenkins-bot:

[operations/software/spicerack@master] Spicerack: fix Netbox 4 breaking changes

https://gerrit.wikimedia.org/r/1050453

Change #1055187 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: point prod service to new servers

https://gerrit.wikimedia.org/r/1055187

Icinga downtime and Alertmanager silence (ID=92ae15a3-d066-4959-9504-9286a87c9cd2) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=3f92060e-7cc5-42b2-b105-d3b395a0abd4) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet

Change #1050377 merged by jenkins-bot:

[operations/software/homer@master] Homer: fix Netbox 4 breaking changes

https://gerrit.wikimedia.org/r/1050377

Mentioned in SAL (#wikimedia-operations) [2024-07-22T09:00:23Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-22T09:03:45Z] <ayounsi@cumin1002> END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-22T09:21:32Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275

Deployed netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-22T09:30:21Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275

Change #1055187 merged by Ayounsi:

[operations/puppet@production] Netbox 4: point prod service to new servers

https://gerrit.wikimedia.org/r/1055187

Change #1055887 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Enable OIDC auth on new netbox

https://gerrit.wikimedia.org/r/1055887

Change #1055887 merged by Ayounsi:

[operations/puppet@production] Enable OIDC auth on new netbox

https://gerrit.wikimedia.org/r/1055887

Change #1055893 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: update cas_server_url to fit OIDC endpoint

https://gerrit.wikimedia.org/r/1055893

Change #1055893 merged by Ayounsi:

[operations/puppet@production] Netbox 4: update cas_server_url to fit OIDC endpoint

https://gerrit.wikimedia.org/r/1055893

Change #1055900 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: set correct oidc_service

https://gerrit.wikimedia.org/r/1055900

Change #1055900 merged by Ayounsi:

[operations/puppet@production] Netbox 4: set correct oidc_service

https://gerrit.wikimedia.org/r/1055900

Change #1055901 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] IDM: Set profile_format FLAT to netbox_oidc

https://gerrit.wikimedia.org/r/1055901

Change #1055901 merged by Ayounsi:

[operations/puppet@production] IDM: Set profile_format FLAT to netbox_oidc

https://gerrit.wikimedia.org/r/1055901

Change #1055924 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] ganeti-netbox-sync: Netbox 4 fix

https://gerrit.wikimedia.org/r/1055924

Change #1055924 merged by jenkins-bot:

[operations/software/netbox-extras@master] ganeti-netbox-sync: Netbox 4 fix

https://gerrit.wikimedia.org/r/1055924

Icinga downtime and Alertmanager silence (ID=67fa4e46-b51e-42b8-9853-92735f7f0f85) set by ayounsi@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Netbox 3 silencing

netbox2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=b31b9f0d-62c5-41f6-9791-fca68557c987) set by ayounsi@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Netbox 3 silencing

netbox1002.eqiad.wmnet

Change #1055932 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: ACLs + breaking changes

https://gerrit.wikimedia.org/r/1055932

Change #1055932 merged by Ayounsi:

[operations/puppet@production] Netbox 4: ACLs + breaking changes

https://gerrit.wikimedia.org/r/1055932

Change #1055940 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: fix script path vs. extra path

https://gerrit.wikimedia.org/r/1055940

Change #1055940 merged by Ayounsi:

[operations/puppet@production] Netbox 4: fix script path vs. extra path

https://gerrit.wikimedia.org/r/1055940

Icinga downtime and Alertmanager silence (ID=6e8b1723-decb-4086-9785-376414b41d2c) set by ayounsi@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=b306ac42-cfcf-4095-a53e-80b1fd183949) set by ayounsi@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet

Change #1056076 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/netbox-extras@master] Interface validator: fix connected_endpoints type

https://gerrit.wikimedia.org/r/1056076

Change #1056076 merged by jenkins-bot:

[operations/software/netbox-extras@master] Interface validator: fix connected_endpoints type

https://gerrit.wikimedia.org/r/1056076

Change #1056505 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Cumin Alias, add temp netbox4 and restore global netbox ones

https://gerrit.wikimedia.org/r/1056505

Change #1056785 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: add version flag

https://gerrit.wikimedia.org/r/1056785

Change #1056785 merged by Ayounsi:

[operations/puppet@production] Netbox 4: add version flag

https://gerrit.wikimedia.org/r/1056785

Change #1056901 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Move the /srv/netbox/ directory creation behind netbox4 flag

https://gerrit.wikimedia.org/r/1056901

Change #1056911 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Netbox 4: add transition flag for Redis database

https://gerrit.wikimedia.org/r/1056911

Change #1056901 merged by Ayounsi:

[operations/puppet@production] Move the /srv/netbox/ directory creation behind netbox4 flag

https://gerrit.wikimedia.org/r/1056901

Change #1056911 merged by Ayounsi:

[operations/puppet@production] Netbox 4: add transition flag for Redis database

https://gerrit.wikimedia.org/r/1056911

Change #1056989 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] netbox.netbox-extra: trigger syncdatasource

https://gerrit.wikimedia.org/r/1056989

Icinga downtime and Alertmanager silence (ID=f06322e6-0a92-414d-aff7-4acdca678dc9) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 2 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=2cf5df78-f1ca-4bf8-800b-9a731e1182f6) set by ayounsi@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:35:22Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:35:28Z] <ayounsi@cumin1002> END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:36:37Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:36:41Z] <ayounsi@cumin1002> END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:58:27Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T15:03:40Z] <ayounsi@cumin1002> END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T15:04:40Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-30T15:09:06Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.8 to netbox-next - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-31T12:25:31Z] <ayounsi@cumin1002> START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275

Deployed netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275

Mentioned in SAL (#wikimedia-operations) [2024-07-31T12:34:43Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.8 to future netbox prod - ayounsi@cumin1002 - T336275

Change #1050379 merged by Ayounsi:

[operations/software/homer/deploy@master] Homer wmf-netbox: fix Netbox 4 breaking changes

https://gerrit.wikimedia.org/r/1050379

Change #1050445 merged by Ayounsi:

[operations/cookbooks@master] Cookbooks: fix Netbox 4 breaking changes

https://gerrit.wikimedia.org/r/1050445

Icinga downtime and Alertmanager silence (ID=d8033fb3-d4d1-4e37-8764-0a7625abbe34) set by ayounsi@cumin1002 for 5 days, 0:00:00 on 2 host(s) and their services with reason: old netbox

netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet

Change #1056505 merged by Ayounsi:

[operations/puppet@production] Cumin Alias, add temp netbox4 and restore global netbox ones

https://gerrit.wikimedia.org/r/1056505

ayounsi added a parent task: Restricted Task.Aug 1 2024, 9:50 AM

Change #1056989 merged by jenkins-bot:

[operations/cookbooks@master] netbox.netbox-extra: trigger syncdatasource

https://gerrit.wikimedia.org/r/1056989

Notes from the Debrief meeting

What went wrong ?

  • Too optimistic :)
  • Huge quantity of breaking changes, some undocumented (inc. required an upgrade to bookworm)
  • Some issues only visible on production workloads/deployment
  • Some issues only visible on > 1 members clusters
  • Intermediate DB migration needed (3.2 -> 3.7 -> 4.0)

How to make it better for next upgrades ?

  • Upgrade more frequently - T371889: Upgrade Netbox to 4.3.x
  • Maybe make Netbox next a 2 frontends cluster ? And 2 DBs ?
  • Maybe use the central Redis cluster for -next (like for prod) ?
  • Use feature flag in Puppet when needed
  • For external scripts/tools, implement backward compatible fix to breaking changes (when possible)
  • Staging environment ? Pontoon ?
  • Write (find?) documentation on how to run spicerack/homer/etc pointed to -next

What is left to do ?

ayounsi claimed this task.