Page MenuHomePhabricator

Swift version and distro upgrade
Closed, ResolvedPublic

Description

We are running mixed swift versions (1.3 - 2.2) and distributions (trusty / jessie) at the moment, in T162348: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses a 1.3-specific bug has been uncovered. We should aim at running the same swift version across the fleet, ideally the same distro as well.

Current situation:

eqiad

hostsdistroswift versionnotes
ms-be[1001-1012]trusty1.13.1decom
ms-be[1013-1021]stretch2.10.1-3
ms-be[1022-1039]jessie2.10.1-2~bpo8+1
ms-fe[1005-1008]jessie2.10.1-2~bpo8+1

codfw

hostsdistroswift versionnotes
ms-be[2001-2012]trusty1.13.1decom
ms-be[2013-2021]stretch2.10.1-3
ms-be[2022-2039]jessie2.10.1-2~bpo8+1
ms-fe[2005-2008]jessie2.10.1-2~bpo8+1

esams

Hardware to be decom, not in production.

In terms of swift versions in Debian:

2.2.0-1+deb8u1jessie
2.7.0-10~bpo8+1jessie-backports
2.10.1-2stretch
2.10.1-2sid

The end state would be swift >= 2.2 and distro >= jessie everywhere, with priority on the swift version. I've listed below possible solutions in increasing order of complexity and time requirements.

a) Backport swift 2.2 to trusty, upgrade trusty hw to 2.2

This is easy because jessie's swift 2.2 builds as-is on trusty and we'll run a single version of swift.

Swift 2.10 (from stretch) on trusty seems trickier because it requires already quite a few python dependencies: (ditto for swift 2.7 from jessie-backports)

pbuilder-satisfydepends-dummy : Depends: debhelper (>= 10) but 9.20131227ubuntu1 is to be installed.
                                Depends: openstack-pkg-tools (>= 48~) but it is not going to be installed.
                                Depends: python-setuptools (>= 20.6.8) but 3.3-1ubuntu1 is to be installed.
                                Depends: python-cryptography (>= 1.0) which is a virtual package.
                                Depends: python-dnspython (>= 1.14.0) but it is not going to be installed.
                                Depends: python-eventlet (>= 0.17.4) but 0.13.0-1ubuntu2 is to be installed.
                                Depends: python-keystoneclient (>= 1:1.3.0) but 1:0.7.1-ubuntu1 is to be installed.
                                Depends: python-netifaces (> 0.10.1) but it is not going to be installed.
                                Depends: python-openstackdocstheme (>= 1.0.3) which is a virtual package.
                                Depends: python-os-api-ref which is a virtual package.
                                Depends: python-os-testr (>= 0.4.1) which is a virtual package.
                                Depends: python-pyeclib (>= 1.2.0) which is a virtual package.

b) Reimage trusty machines with jessie (16x reimages total, assuming old hardware decom'd)

This is doable with minimal disruption during reimage (only SSDs are wiped) but more time consuming than the upgrade above due to add the right uid/gid for swift before the first puppet run (cfr T123918). Note that this solution assumes we have already finished decommissioning old ms-be hardware (30x machines) and therefore those won't need a reimage.

c) Reimage trusty machines with stretch and upgrade to swift 2.10

Stretch ships with swift 2.10, we could reimage to stretch and upgrade jessie machines to swift 2.10 (stretch's swift package backports cleanly to jessie). The same disclaimer re: decom applies, plus testing swift 2.10 for backend and frontend.

I think we'll have to do this in steps, namely:

  • Complete (a) first, run swift 2.2 everywhere and a mix of trusty/jessie
  • (Also while the above is in progress) Finish decommissioning old hardware
  • Backport swift 2.10 to jessie, upgrade and test existing jessie machines (running a mix of 2.2 and 2.10 while the step below is in progress). This includes upgrading ms-fe to 2.10 first and then ms-be.
  • Complete (b) by reinstalling trusty hw with stretch and running swift 2.10 everywhere
  • (Nice to have) reimage jessie systems with stretch. ms-fe is easy to do and we should, ms-be more time consuming but doable as well.

Event Timeline

faidon subscribed.

The plan above totally makes sense to me and sounds like the path of the least amount of work with the maximum amount of consistency.

I'd add a final step of upgrading the jessie systems to stretch at the end but that's just a nice-to-have and won't really make an impact on Swift version consistency.

I'm also guessing that ms-fes would be upgraded first to jessie/2.10, before ms-bes?

Mentioned in SAL (#wikimedia-operations) [2017-04-11T09:11:31Z] <godog> upgrade swift to 2.2.0 on ms-be2001 - T162609

The plan above totally makes sense to me and sounds like the path of the least amount of work with the maximum amount of consistency.

I'd add a final step of upgrading the jessie systems to stretch at the end but that's just a nice-to-have and won't really make an impact on Swift version consistency.

I'm also guessing that ms-fes would be upgraded first to jessie/2.10, before ms-bes?

good point! I've updated the task with both suggestions

Mentioned in SAL (#wikimedia-operations) [2017-04-11T13:01:41Z] <godog> roll-upgrade swift to 2.2.0 across codfw machines - T162609

Mentioned in SAL (#wikimedia-operations) [2017-04-11T13:49:20Z] <godog> roll-upgrade swift to 2.2.0 across eqiad machines - T162609

Change 356206 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reimage ms-be2001 with stretch

https://gerrit.wikimedia.org/r/356206

Change 356206 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reimage ms-be2001 with stretch

https://gerrit.wikimedia.org/r/356206

Mentioned in SAL (#wikimedia-operations) [2017-06-01T11:33:52Z] <godog> test upgrade of swift 2.10 on ms-fe2005 - T162609

proxy-server 2.10 seems to be basically working on ms-fe2005. I've asked upstream about an increase in proxy-server.errors metrics that seem related to ratelimit here: https://bugs.launchpad.net/swift/+bug/1695273

Change 357377 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: ms-be2013 / 16 / 17 to stretch

https://gerrit.wikimedia.org/r/357377

Change 357377 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: ms-be2013 / 16 / 17 to stretch

https://gerrit.wikimedia.org/r/357377

Change 357396 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: create swift user home

https://gerrit.wikimedia.org/r/357396

Change 357396 merged by Filippo Giunchedi:
[operations/puppet@production] swift: create swift user home

https://gerrit.wikimedia.org/r/357396

Change 357422 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add hp-mcp-stretch

https://gerrit.wikimedia.org/r/357422

Change 357568 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: mask object reconstructor on >= jessie

https://gerrit.wikimedia.org/r/357568

Change 357568 merged by Filippo Giunchedi:
[operations/puppet@production] swift: mask object reconstructor on >= jessie

https://gerrit.wikimedia.org/r/357568

Change 357569 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: move ms-be2* trusty hosts to stretch

https://gerrit.wikimedia.org/r/357569

Change 357569 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: move ms-be2* trusty hosts to stretch

https://gerrit.wikimedia.org/r/357569

I've upgraded all ms-fe2* to swift 2.10, the trusty -> stretch conversion of ms-be2* is ongoing. Regardless of the latter I think we could test some user traffic in swift codfw next week and see how that goes

Change 358376 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: swift temporary a/a

https://gerrit.wikimedia.org/r/358376

Change 358377 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hierata: swift active in codfw only

https://gerrit.wikimedia.org/r/358377

Change 358376 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: swift temporary a/a

https://gerrit.wikimedia.org/r/358376

Mentioned in SAL (#wikimedia-operations) [2017-06-13T14:14:03Z] <godog> point upload varnish to swift in codfw - T162609

Change 358377 merged by Filippo Giunchedi:
[operations/puppet@production] hierata: swift active in codfw only

https://gerrit.wikimedia.org/r/358377

Change 358609 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: point varnish upload esams to codfw

https://gerrit.wikimedia.org/r/358609

Change 358609 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: point varnish upload esams to codfw

https://gerrit.wikimedia.org/r/358609

Change 358611 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: a/a for varnish swift_thumbs

https://gerrit.wikimedia.org/r/358611

Change 358612 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: a/p for varnish swift_thumbs

https://gerrit.wikimedia.org/r/358612

Change 358611 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: a/a for varnish swift_thumbs

https://gerrit.wikimedia.org/r/358611

Change 358612 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: a/p for varnish swift_thumbs

https://gerrit.wikimedia.org/r/358612

Mentioned in SAL (#wikimedia-operations) [2017-06-19T10:15:42Z] <godog> roll-upgrade swift to 2.10 on to ms-fe1* - T162609

Mentioned in SAL (#wikimedia-operations) [2017-06-19T14:24:10Z] <godog> roll-upgrade swift to 2.10 on ms-be10[22-30] - T162609

Change 359952 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: ms-be10[13-21] to stretch

https://gerrit.wikimedia.org/r/359952

Change 359952 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: ms-be10[13-21] to stretch

https://gerrit.wikimedia.org/r/359952

Current status, swift in esams hasn't been touched since it is slated for decom anyway. ms-be[12] from 01 to 12 are decom'd from swift. Remaining machines are either running stretch or jessie, with swift 2.10

# cumin -p1 'ms-*' 'dpkg-query -W swift' 
80 hosts will be targeted:
ms-be[2013-2039].codfw.wmnet,ms-be[1001-1039].eqiad.wmnet,ms-be[3001-3004].esams.wmnet,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,ms-fe[3001-3002].esams.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                 
(12) ms-be[1001-1012].eqiad.wmnet                                                                                      
----- OUTPUT of 'dpkg-query -W swift' -----                                                                            
swift   2.2.0-1+deb8u1~trusty+1                                                                                        
===== NODE GROUP =====                                                                                                 
(45) ms-be[2022-2039].codfw.wmnet,ms-be[1022-1039].eqiad.wmnet,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,ms-fe3001.esams.wmnet                                                                                                
----- OUTPUT of 'dpkg-query -W swift' -----                                                                            
swift   2.10.1-2~bpo8+1                                                                                                
===== NODE GROUP =====                                                                                                 
(18) ms-be[2013-2021].codfw.wmnet,ms-be[1013-1021].eqiad.wmnet                                                         
----- OUTPUT of 'dpkg-query -W swift' -----                                                                            
swift   2.10.1-3                                                                                                       
===== NODE GROUP =====                                                                                                 
(5) ms-be[3001-3004].esams.wmnet,ms-fe3002.esams.wmnet                                                                 
----- OUTPUT of 'dpkg-query -W swift' -----                                                                            
swift   2.2.0-1+deb8u1                                                                                                 
================

This is resolved, we're running swift 2.10 and some machines in codfw/eqiad running stretch too.

Change 357422 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add hp-mcp to stretch-wikimedia

https://gerrit.wikimedia.org/r/357422