We are ready to release after testing
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • crusnov | T266487 Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) | |||
Resolved | • crusnov | T266488 Upgrade netbox-next to 2.9 series |
Event Timeline
Tasks to do:
- Upgrade -next to 2.9 series
- Port and test scripts
- Upgrade production to 2.9 series
Not sure why there are 2 similar tasks :)
Could you re-import the database from prod? So -next has more recent data?
Thx
Change 643444 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/netbox-extras@master] Make scripts compatible with Netbox 2.9
Change 643681 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/software/homer/deploy@master] Make Homer compatible with Netbox 2.9
Note for myself, once prod is upgraded to 2.9 we can move the virtual-chassis FQDN from the domain to the new name field, then probably simplify some existing scripts.
We need to finalize changes in the pipe for 2.9 because scripts and things are one of two things blocking deployment. We should check these off if the script has an open patch (with a link) or we have a reasonable certainty it'll work on 2.9
- customscripts/getstats.py (should work, will verify)
- customscripts/interface_automation.py
- customscripts/offline_device.py (should work, will verify)
- reports/accounting.py (should work, will verify)
- reports/cables.py (should work, will verify)
- reports/coherence.py
- reports/librenms.py (should work, will verify)
- reports/management.py (should work, will verify)
- tools/custom_script_proxy.py (verified works)
- tools/dumpbackup.py (needs table list update, works otherwise)
- tools/ganeti-netbox-sync.py (needs major rework)
- tools/import-mgmt-dns.py (this is example code and does not need to be ported)
- dns/generate_dns_snippets.py (needs to be validated but should work)
Tested all of the low hanging fruit.
Verified on netbox-next.
- customscripts/offline_device.py (should work, will verify)
Appears to work as expected. Would like @ayounsi or @Volans to also verify.
- reports/accounting.py (should work, will verify)
Does not work since the reports internal API has changed and overloading run() is different now.
- reports/cables.py (should work, will verify)
Works.
- reports/coherence.py
Works.
- reports/librenms.py (should work, will verify)
can't test because the database isn't open for -next.
- reports/management.py (should work, will verify)
Works.
Missed one!
- reports/puppetdb.py (Works.)
Mentioned in SAL (#wikimedia-operations) [2020-12-21T21:09:43Z] <chaomodus> merging change 643354 for Netbox 2.9 support, puppet disabled on production machines until testing completed T266487
Change 651268 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox: Fix dependency loop introduced in previous patch
Change 651268 merged by CRusnov:
[operations/puppet@production] netbox: Fix dependency loop introduced in previous patch
Mentioned in SAL (#wikimedia-operations) [2020-12-21T22:18:40Z] <chaomodus> Re-enabling puppet on Netbox production instances after havintg tested netbox2001 with new puppet code T266487
Hello, with DNS generation verified (patch out) we are ready to deploy. Here is the deployment plan, which will necessitate some Netbox downtime:
- Merge (not deploy) DNS generation patch for -extras https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/655040
- Merge (not deploy) 2.9 support patch for -extras https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/643444 and https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/643681
- Disable puppet on netbox[1001, 2001]
- Merge 2.9 support patch in puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/649436
- Ensure puppet enabled on -dev1001
- Deploy -extras to -dev2001
- Spot check DNS generation, reports, etc. (Exhaustive check has been performed already)
- Deploy 2.9 to production machines netbox[1001, 2001]
- Deploy -extras to netbox[1001,2001]
- Enable puppet on netbox[1001,2001]
- Exhaustively check scripts, reports, etc on production Netbox
Done!
We should prefer to do this early PST so that a maximum amount of day is available in case anything goes wrong (we don't predict anything will go wrong, but just in case).
If there's a good day for DCOps when this won't be too much of an interruption, please let us know! We can proceed at any time.
We can merge -extras without disabling puppet, since it is not involved with puppet at all, but ordering disabling puppet first is the same outcome so that's cool.
- Ensure puppet enabled on -dev1001 => typo: 2001, should also be run puppet?
- Enable puppet on netbox[1001,2001] => and run puppet
Correct.
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:04:12Z] <chaomodus> proceeding with Netbox 2.9 upgrade T266487
Change 643444 merged by CRusnov:
[operations/software/netbox-extras@master] Make scripts and reports compatible with Netbox 2.9
Change 643681 merged by CRusnov:
[operations/software/homer/deploy@master] Make Homer compatible with Netbox 2.9
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:12:19Z] <chaomodus> Merged Netbox 2.9 related changes in puppet and -extras; testing on -next T266487
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:30:25Z] <crusnov@deploy1001> Started deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:32:58Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487 (duration: 02m 33s)
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:37:24Z] <chaomodus> Upgrade of Netbox to 2.9 complete, checking support software. T266487
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:46:52Z] <crusnov@deploy1001> Started deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487
Mentioned in SAL (#wikimedia-operations) [2021-01-12T22:46:57Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487 (duration: 00m 05s)
Note that https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/643681 needed a Homer deploy
And Netbox reports Icinga checks are failing.
Change 655871 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: fix check report for Netbox 2.9
Change 655871 merged by Volans:
[operations/puppet@production] netbox: fix check report for Netbox 2.9
@ayounsi fixed the homer part deploying the changes, I've deployed the above patch to fix the Netbox reports.
@crusnov was the makevm cookbook adapted/tested? From a quick look at the code I think it might be broken: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/ganeti/makevm.py#161
I reimaged mc1029 and mc2029 today, and towards the end of the reimaging I got:
12:13:50 | Updated Netbox: 12:13:50 | mc1029.eqiad.wmnet | Unable to run wmf-auto-reimage-host: 'log' 12:13:50 | mc1029.eqiad.wmnet | REIMAGE END | retcode=2
Looking at the logfile I found:
2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: Updated Netbox: 2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: Unable to run wmf-auto-reimage-host: 'log' 2021-01-13 12:36:00 [ERROR] (jiji) wmf-auto-reimage::main: Unable to run wmf-auto-reimage-host Traceback (most recent call last): File "/usr/local/sbin/wmf-auto-reimage-host", line 264, in main run(args, user, log_path) File "/usr/local/sbin/wmf-auto-reimage-host", line 211, in run lib.update_netbox(args.host) File "/usr/local/lib/python3.7/dist-packages/wmf_auto_reimage_lib.py", line 916, in update_netbox for log_line in result.json()['log']: KeyError: 'log' 2021-01-13 12:36:00 [INFO] (jiji) wmf-auto-reimage::print_line: REIMAGE END | retcode=2
Both hosts are running happily. Let me know if you need more information:)
Change 655909 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: fix Netbox update for 2.9 upgrade
Change 655909 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: fix Netbox update for 2.9 upgrade
@jijiki thanks for the report, with the above patch it should be fixed, I've already merged and deployed it. Let us know if it works correctly in your next reimage. Sorry for the trouble.
@crusnov the proxy script that allows to get data from the GetDeviceStats script is broken too. The script POST api doesn't return the output anymore, you have to get the job ID and get the results.
Change 655914 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/cookbooks@master] ganeti.makevm: Make necessary changes to port for Netbox 2.9 API
Change 655946 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-extras@master] custom_script_proxy: adjust for Netbox 2.9 API
Here is another report. The issue below happened when using wmf-auto-reimage at the very last step after "updating netbox". Confirmed on multiple hosts. Here is a trace as example:
Change 655946 merged by CRusnov:
[operations/software/netbox-extras@master] custom_script_proxy: adjust for Netbox 2.9 API
Change 655963 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] wmf-auto-reimage: poll Netbox script results
Change 655963 merged by Volans:
[operations/puppet@production] wmf-auto-reimage: poll Netbox script results
Change 655914 merged by CRusnov:
[operations/cookbooks@master] ganeti.makevm: Make necessary changes to port for Netbox 2.9 API
This is still failing for me in makevm, getting this traceback: https://phabricator.wikimedia.org/P13758
We're hitting https://github.com/digitalocean/pynetbox/issues/285
Fix is to upgrade pynetbox to >= v5.0.8
cumin1001:~$ apt show python-pynetbox Version: 5.0.7-1
Change 656131 had a related patch set uploaded (by Volans; owner: Volans):
[operations/debs/pynetbox@debian] Upstream release v5.3.0
Change 656131 merged by Volans:
[operations/debs/pynetbox@debian] Upstream release v5.3.0
Mentioned in SAL (#wikimedia-operations) [2021-01-14T12:14:04Z] <volans> built and uploaded python3-pynetbox 5.3.0-1 to apt.wikimedia.org - T266487
Change 656143 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: fix for Netbox 2.9 upgrade
Change 656143 merged by Volans:
[operations/cookbooks@master] sre.hosts.decommission: fix for Netbox 2.9 upgrade
Mentioned in SAL (#wikimedia-operations) [2021-01-14T12:50:03Z] <volans> upgraded python3-pynetbox to 5.3.0-1 on all affected hosts - T266487
That error meant that the IP range was full, and the script couldn't allocate a new IP.
We should catch the error and show a nice message to the user.
Apparently also the interface_automation.ImportPuppetDB script got broken by the upgrade, as now Interface and VMInterface are two different objects and the latter doesn't have the type property.
An exception occurred: AttributeError: 'VMInterface' object has no attribute 'type' Traceback (most recent call last): File "/srv/deployment/netbox/deploy-cache/revs/b17db9919cea6f35b569e5b9f3f18a3c2fb24b3f/src/netbox/extras/scripts.py", line 451, in _run_script script.output = script.run(**kwargs) File "/srv/deployment/netbox-extras//customscripts/interface_automation.py", line 792, in run messages.extend(self._import_interfaces_for_device(device, net_driver, networking, lldp, True)) File "/srv/deployment/netbox-extras//customscripts/interface_automation.py", line 502, in _import_interfaces_for_device self.log_success(f"{device.name}: renamed ##PRIMARY## interface to {dif.name} ({dif.type})") AttributeError: 'VMInterface' object has no attribute 'type'
From the changelog:
A new model, VMInterface has been introduced to represent interfaces assigned to VirtualMachine instances. Previously, these interfaces utilized the DCIM model Interface. Instances will be replicated automatically upon upgrade, however any custom code which references or manipulates virtual machine interfaces will need to be updated accordingly.
I've looked a bit at the code, I don't think that a hotfix is what we need there. @crusnov It seems that the script needs a larger refactor to split the behaviour between physical and virtual devices instead of adding a bunch of if/else that will make it mostly unreadable IMHO.
That's probably true. There aren't a good deal of places where this would actually be an issue, but those places can be refactored. I'm working on this now.
Change 656954 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-extras@master] interface_automation.py: Minor refactors and fixes for 2.9
Change 656954 merged by CRusnov:
[operations/software/netbox-extras@master] interface_automation.py: Minor refactors and fixes for 2.9