Page MenuHomePhabricator

Upgrade Netbox to 2.7 series
Closed, ResolvedPublic

Description

Procedure to follow:

  • Inform dc-ops & schedule
  • take a dump of database
  • Downtime netbox and associated services, reports, etc.
  • Merge puppet patch
  • Merge -extras patch
  • Merge deploy patch
  • Scap deploy
  • Test deployment
  • end Downtime

Upgrade done, addressing remaining issues.

Related Objects

Event Timeline

crusnov set Due Date to Feb 12 2020, 8:00 AM.

My only request is this not happen during the planned eqsin PDU work staring 2020-02-06 16:00 Pacific / 2020-02-07 00:00 GMT / 2020-02-07 08:00 Singapore time and expecting to take a few hours.

Change 572123 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox: Update configuration to support v2.7

https://gerrit.wikimedia.org/r/572123

Change 572025 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] Bump Netbox revision to v2.7.4

https://gerrit.wikimedia.org/r/572025

Once all of the changes are approved, the process will be thus:

  • Downtime netbox and associated services, reports, etc.
  • Merge puppet patch
  • Merge -extras patch
  • Merge deploy patch
  • Scap deploy
  • Test deployment
  • end Downtime

I suppose Tuesday will be when we execute on this since it's already Friday and there are days off short of anything else significant coming up.

Change 572123 merged by CRusnov:
[operations/puppet@production] netbox: Update configuration to support v2.7

https://gerrit.wikimedia.org/r/572123

Change 572025 merged by CRusnov:
[operations/software/netbox-deploy@master] Bump Netbox revision to v2.7.4

https://gerrit.wikimedia.org/r/572025

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:45:05Z] <crusnov@deploy1001> Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:46:24Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:48:23Z] <crusnov@deploy1001> Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2)

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:49:43Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part2) (duration: 01m 19s)

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:51:13Z] <crusnov@deploy1001> Started deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3)

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:51:25Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@f3d56dd]: netbox 2.7.4 upgrade T244291 (part3) (duration: 00m 11s)

Mentioned in SAL (#wikimedia-operations) [2020-02-18T22:52:53Z] <chaomodus> completed upgrading Netbox to 2.7.4 T244291

Okay fallout from upgrade:

  • Ganeti sync sometimes fails, we are debugging.
  • ManagementConsole report reports incorrect results.

All external scripts sometimes fail (dump and ganeti sync in particular).

Through process of elimination it appears that it is uwsgi that is doing the bad thing, because eliminating it from the loop leads to a situation where the errors do not seem to occur.

What happens is that occasionally a script will perform an API call which result in the error "Remote end closed connection without response". There is no strong indicator on the UWSGI side that this happened, it appears to return a 200 for the request. The apache side claims that uwsgi closed before any data was send (which I believe is consistent with uwsgi doing something unexpected.

https://phabricator.wikimedia.org/P10453

is an example of a script failing in this way, which appears to show the 200s occurring, but then a traceback happening.

To reiterate, this failure path does not seem to happen with django's internal server implementation.

I will continue to debug, but @Volans if you could lend a hand that'd be appreciated.

Change 573262 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: remove temporary config post-upgrade

https://gerrit.wikimedia.org/r/573262

Change 573262 merged by Volans:
[operations/puppet@production] netbox: remove temporary config post-upgrade

https://gerrit.wikimedia.org/r/573262

Change 573263 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: better splay scripts in the hour

https://gerrit.wikimedia.org/r/573263

Change 573263 merged by Volans:
[operations/puppet@production] netbox: better splay scripts in the hour

https://gerrit.wikimedia.org/r/573263

Mentioned in SAL (#wikimedia-operations) [2020-02-19T11:56:41Z] <volans> better splay of periodic scripts that interact with Netbox - T244291

Change 573330 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] netbox: disable keepalive between Apache and uWSGI

https://gerrit.wikimedia.org/r/573330

Change 573330 merged by Volans:
[operations/puppet@production] netbox: disable keepalive between Apache and uWSGI

https://gerrit.wikimedia.org/r/573330