Page MenuHomePhabricator

Upgrade Netbox and accompanying to the 2.9 series (Tracking Task)
Closed, ResolvedPublic

Description

We are ready to release after testing

Event Timeline

crusnov created this task.

Tasks to do:

  • Upgrade -next to 2.9 series
  • Port and test scripts
  • Upgrade production to 2.9 series

Not sure why there are 2 similar tasks :)

Could you re-import the database from prod? So -next has more recent data?
Thx

Change 643444 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/netbox-extras@master] Make scripts compatible with Netbox 2.9

https://gerrit.wikimedia.org/r/643444

Change 643681 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/homer/deploy@master] Make Homer compatible with Netbox 2.9

https://gerrit.wikimedia.org/r/643681

Note for myself, once prod is upgraded to 2.9 we can move the virtual-chassis FQDN from the domain to the new name field, then probably simplify some existing scripts.

We need to finalize changes in the pipe for 2.9 because scripts and things are one of two things blocking deployment. We should check these off if the script has an open patch (with a link) or we have a reasonable certainty it'll work on 2.9

  • customscripts/getstats.py (should work, will verify)
  • customscripts/interface_automation.py
  • customscripts/offline_device.py (should work, will verify)
  • reports/accounting.py (should work, will verify)
  • reports/cables.py (should work, will verify)
  • reports/coherence.py
  • reports/librenms.py (should work, will verify)
  • reports/management.py (should work, will verify)
  • tools/custom_script_proxy.py (verified works)
  • tools/dumpbackup.py (needs table list update, works otherwise)
  • tools/ganeti-netbox-sync.py (needs major rework)
  • tools/import-mgmt-dns.py (this is example code and does not need to be ported)
  • dns/generate_dns_snippets.py (needs to be validated but should work)

Tested all of the low hanging fruit.

  • customscripts/getstats.py (should work, will verify)

Verified on netbox-next.

  • customscripts/offline_device.py (should work, will verify)

Appears to work as expected. Would like @ayounsi or @Volans to also verify.

  • reports/accounting.py (should work, will verify)

Does not work since the reports internal API has changed and overloading run() is different now.

  • reports/cables.py (should work, will verify)

Works.

  • reports/coherence.py

Works.

  • reports/librenms.py (should work, will verify)

can't test because the database isn't open for -next.

  • reports/management.py (should work, will verify)

Works.

Missed one!

  • reports/puppetdb.py (Works.)

Mentioned in SAL (#wikimedia-operations) [2020-12-21T21:09:43Z] <chaomodus> merging change 643354 for Netbox 2.9 support, puppet disabled on production machines until testing completed T266487

Change 651268 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox: Fix dependency loop introduced in previous patch

https://gerrit.wikimedia.org/r/651268

Change 651268 merged by CRusnov:
[operations/puppet@production] netbox: Fix dependency loop introduced in previous patch

https://gerrit.wikimedia.org/r/651268

Mentioned in SAL (#wikimedia-operations) [2020-12-21T22:18:40Z] <chaomodus> Re-enabling puppet on Netbox production instances after havintg tested netbox2001 with new puppet code T266487

Hello, with DNS generation verified (patch out) we are ready to deploy. Here is the deployment plan, which will necessitate some Netbox downtime:

Done!

We should prefer to do this early PST so that a maximum amount of day is available in case anything goes wrong (we don't predict anything will go wrong, but just in case).

If there's a good day for DCOps when this won't be too much of an interruption, please let us know! We can proceed at any time.

  • Disable puppet on netbox[1001, 2001] => this must be the first step before merging anything
  • Ensure puppet enabled on -dev1001 => typo: 2001, should also be run puppet?
  • Enable puppet on netbox[1001,2001] => and run puppet
  • Disable puppet on netbox[1001, 2001] => this must be the first step before merging anything

We can merge -extras without disabling puppet, since it is not involved with puppet at all, but ordering disabling puppet first is the same outcome so that's cool.

  • Ensure puppet enabled on -dev1001 => typo: 2001, should also be run puppet?
  • Enable puppet on netbox[1001,2001] => and run puppet

Correct.

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:04:12Z] <chaomodus> proceeding with Netbox 2.9 upgrade T266487

Change 643444 merged by CRusnov:
[operations/software/netbox-extras@master] Make scripts and reports compatible with Netbox 2.9

https://gerrit.wikimedia.org/r/643444

Change 643681 merged by CRusnov:
[operations/software/homer/deploy@master] Make Homer compatible with Netbox 2.9

https://gerrit.wikimedia.org/r/643681

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:12:19Z] <chaomodus> Merged Netbox 2.9 related changes in puppet and -extras; testing on -next T266487

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:30:25Z] <crusnov@deploy1001> Started deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:32:58Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487 (duration: 02m 33s)

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:37:24Z] <chaomodus> Upgrade of Netbox to 2.9 complete, checking support software. T266487

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:46:52Z] <crusnov@deploy1001> Started deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487

Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:46:57Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487 (duration: 00m 05s)

Volans raised the priority of this task from Medium to High.Jan 13 2021, 9:09 AM

Change 655871 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: fix check report for Netbox 2.9

https://gerrit.wikimedia.org/r/655871

Change 655871 merged by Volans:
[operations/puppet@production] netbox: fix check report for Netbox 2.9

https://gerrit.wikimedia.org/r/655871

Volans lowered the priority of this task from High to Medium.Jan 13 2021, 9:47 AM

@ayounsi fixed the homer part deploying the changes, I've deployed the above patch to fix the Netbox reports.

I reimaged mc1029 and mc2029 today, and towards the end of the reimaging I got:

12:13:50 | Updated Netbox:
12:13:50 | mc1029.eqiad.wmnet | Unable to run wmf-auto-reimage-host: 'log'
12:13:50 | mc1029.eqiad.wmnet | REIMAGE END | retcode=2

Looking at the logfile I found:

2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: Updated Netbox:
2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: 'log'
2021-01-13 12:36:00 [ERROR] (jiji) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host
Traceback (most recent call last):
  File "/usr/local/sbin/wmf-auto-reimage-host", line 264, in main
    run(args, user, log_path)
  File "/usr/local/sbin/wmf-auto-reimage-host", line 211, in run
    lib.update_netbox(args.host)
  File "/usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py", line 916, in update_netbox
    for log_line in result.json()['log']:
KeyError: 'log'
2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: REIMAGE END | retcode=2

Both hosts are running happily. Let me know if you need more information:)

Change 655909 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: fix Netbox update for 2.9 upgrade

https://gerrit.wikimedia.org/r/655909

@crusnov was the makevm cookbook adapted/tested? From a quick look at the code I think it might be broken: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/ganeti/makevm.py#161

Ah - no, I shall take a pass at it. tHanks for the other above fixes.

Change 655909 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: fix Netbox update for 2.9 upgrade

https://gerrit.wikimedia.org/r/655909

@jijiki thanks for the report, with the above patch it should be fixed, I've already merged and deployed it. Let us know if it works correctly in your next reimage. Sorry for the trouble.

@crusnov the proxy script that allows to get data from the GetDeviceStats script is broken too. The script POST api doesn't return the output anymore, you have to get the job ID and get the results.

Change 655914 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/cookbooks@master] ganeti.makevm: Make necessary changes to port for Netbox 2.9 API

https://gerrit.wikimedia.org/r/655914

Change 655946 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-extras@master] custom_script_proxy: adjust for Netbox 2.9 API

https://gerrit.wikimedia.org/r/655946

Here is another report. The issue below happened when using wmf-auto-reimage at the very last step after "updating netbox". Confirmed on multiple hosts. Here is a trace as example:

1 67 2021-01-13 18:14:38 [INFO] (dzahn) wmf-auto-reimage::print_line: Updated Netbox:
2 68 2021-01-13 18:14:38 [INFO] (dzahn) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: 'NoneType' object is not subscriptable
3 69 2021-01-13 18:14:38 [ERROR] (dzahn) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host
4 70 Traceback (most recent call last):
5 71 File "/usr/local/sbin/wmf-auto-reimage-host", line 264, in main
6 72 run(args, user, log_path)
7 73 File "/usr/local/sbin/wmf-auto-reimage-host", line 211, in run
8 74 lib.update_netbox(args.host)
9 75 File "/usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py", line 919, in update_netbox
10 76 for log_line in result.json()['data']['log']:
11 77 TypeError: 'NoneType' object is not subscriptable
12 78 2021-01-13 18:14:38 [INFO] (dzahn) wmf-auto-reimage::print_line: REIMAGE END | retcode=2
13 79 2021-01-13 18:14:39 [INFO] (dzahn) wmf-auto-reimage::phabricator_task_update: Updated Phabricator task 'T245757'

Change 655946 merged by CRusnov:
[operations/software/netbox-extras@master] custom_script_proxy: adjust for Netbox 2.9 API

https://gerrit.wikimedia.org/r/655946

Change 655963 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: poll Netbox script results

https://gerrit.wikimedia.org/r/655963

Change 655963 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: poll Netbox script results

https://gerrit.wikimedia.org/r/655963

Change 655914 merged by CRusnov:
[operations/cookbooks@master] ganeti.makevm: Make necessary changes to port for Netbox 2.9 API

https://gerrit.wikimedia.org/r/655914

We're hitting https://github.com/digitalocean/pynetbox/issues/285
Fix is to upgrade pynetbox to >= v5.0.8

cumin1001:~$ apt show python-pynetbox
Version: 5.0.7-1

Change 656131 had a related patch set uploaded (by Volans; owner: Volans):
[operations/debs/pynetbox@debian] Upstream release v5.3.0

https://gerrit.wikimedia.org/r/656131

Change 656131 merged by Volans:
[operations/debs/pynetbox@debian] Upstream release v5.3.0

https://gerrit.wikimedia.org/r/656131

Mentioned in SAL (#wikimedia-operations) [2021-01-14T12:14:04Z] <volans> built and uploaded python3-pynetbox 5.3.0-1 to apt.wikimedia.org - T266487

Change 656143 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: fix for Netbox 2.9 upgrade

https://gerrit.wikimedia.org/r/656143

Change 656143 merged by Volans:
[operations/cookbooks@master] sre.hosts.decommission: fix for Netbox 2.9 upgrade

https://gerrit.wikimedia.org/r/656143

Mentioned in SAL (#wikimedia-operations) [2021-01-14T12:50:03Z] <volans> upgraded python3-pynetbox to 5.3.0-1 on all affected hosts - T266487

This is still failing for me in makevm, getting this traceback: https://phabricator.wikimedia.org/P13758

https://phabricator.wikimedia.org/P13770 is the latest traceback.

That error meant that the IP range was full, and the script couldn't allocate a new IP.
We should catch the error and show a nice message to the user.

Apparently also the interface_automation.ImportPuppetDB script got broken by the upgrade, as now Interface and VMInterface are two different objects and the latter doesn't have the type property.

An exception occurred: AttributeError: 'VMInterface' object has no attribute 'type'

Traceback (most recent call last):
  File "/srv/deployment/netbox/deploy-cache/revs/b17db9919cea6f35b569e5b9f3f18a3c2fb24b3f/src/netbox/extras/scripts.py", line 451, in _run_script
    script.output = script.run(**kwargs)
  File "/srv/deployment/netbox-extras//customscripts/interface_automation.py", line 792, in run
    messages.extend(self._import_interfaces_for_device(device, net_driver, networking, lldp, True))
  File "/srv/deployment/netbox-extras//customscripts/interface_automation.py", line 502, in _import_interfaces_for_device
    self.log_success(f"{device.name}: renamed ##PRIMARY## interface to {dif.name} ({dif.type})")
AttributeError: 'VMInterface' object has no attribute 'type'

From the changelog:

A new model, VMInterface has been introduced to represent interfaces assigned to VirtualMachine instances. Previously, these interfaces utilized the DCIM model Interface. Instances will be replicated automatically upon upgrade, however any custom code which references or manipulates virtual machine interfaces will need to be updated accordingly.

I've looked a bit at the code, I don't think that a hotfix is what we need there. @crusnov It seems that the script needs a larger refactor to split the behaviour between physical and virtual devices instead of adding a bunch of if/else that will make it mostly unreadable IMHO.

I've looked a bit at the code, I don't think that a hotfix is what we need there. @crusnov It seems that the script needs a larger refactor to split the behaviour between physical and virtual devices instead of adding a bunch of if/else that will make it mostly unreadable IMHO.

That's probably true. There aren't a good deal of places where this would actually be an issue, but those places can be refactored. I'm working on this now.

Change 656954 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-extras@master] interface_automation.py: Minor refactors and fixes for 2.9

https://gerrit.wikimedia.org/r/656954

Change 656954 merged by CRusnov:
[operations/software/netbox-extras@master] interface_automation.py: Minor refactors and fixes for 2.9

https://gerrit.wikimedia.org/r/656954