Page MenuHomePhabricator

swift upgrade plans: jessie and swift 2.x
Closed, InvalidPublic

Description

current situation:

  • codfw: swift 1.13.1/2.2.0 on trusty/jessie
  • eqiad: swift 1.13.1/2.2.0 on trusty/jessie
  • esams: swift 2.2.0 on jessie (cluster not used)

swift changelog: https://github.com/openstack/swift/blob/master/CHANGELOG

jessie ships with swift 2.2 (juno) though the latest version in backports is 2.7 (mitaka)

changelog diff 2.2.0 -> 2.7.0
swift (2.7.0, OpenStack Mitaka)

    * Bump PyECLib requirement to >= 1.2.0

    * Update container on fast-POST

      "Fast-POST" is the mode where `object_post_as_copy` is set to
      `False` in the proxy server config. This mode now allows for
      fast, efficient updates of metadata without needing to fully
      recopy the contents of the object. While the default still is
      `object_post_as_copy` as True, the plan is to change the default
      to False and then deprecate post-as-copy functionality in later
      releases. Fast-POST now supports container-sync functionality.

    * Add concurrent reads option to proxy.

      This change adds 2 new parameters to enable and control concurrent
      GETs in Swift, these are `concurrent_gets` and `concurrency_timeout`.

      `concurrent_gets` allows you to turn on or off concurrent
      GETs; when on, it will set the GET/HEAD concurrency to the
      replica count. And in the case of EC HEADs it will set it to
      ndata. The proxy will then serve only the first valid source to
      respond. This applies to all account, container, and replicated
      object GETs and HEADs. For EC only HEAD requests are affected.
      The default for `concurrent_gets` is off.

      `concurrency_timeout` is related to `concurrent_gets` and is
      the amount of time to wait before firing the next thread. A
      value of 0 will fire at the same time (fully concurrent), but
      setting another value will stagger the firing allowing you the
      ability to give a node a short chance to respond before firing
      the next. This value is a float and should be somewhere between
      0 and `node_timeout`. The default is `conn_timeout`, meaning by
      default it will stagger the firing.

    * Added an operational procedures guide to the docs. It can be
      found at http://swift.openstack.org/ops_runbook/index.html and
      includes information on detecting and handling day-to-day
      operational issues in a Swift cluster.

    * Make `handoffs_first` a more useful mode for the object replicator.

      The `handoffs_first` replication mode is used during periods of
      problematic cluster behavior (e.g. full disks) when replication
      needs to quickly drain partitions from a handoff node and move
      them to a primary node.

      Previously, `handoffs_first` would sort that handoff work before
      "normal" replication jobs, but the normal replication work could
      take quite some time and result in handoffs not being drained
      quickly enough.

      In order to focus on getting handoff partitions off the node
      `handoffs_first` mode will now abort the current replication
      sweep before attempting any primary suffix syncing if any of the
      handoff partitions were not removed for any reason - and start
      over with replication of handoffs jobs as the highest priority.

      Note that `handoffs_first` being enabled will emit a warning on
      start up, even if no handoff jobs fail, because of the negative
      impact it can have during normal operations by dog-piling on a
      node that was temporarily unavailable.

    * By default, inbound `X-Timestamp` headers are now disallowed
      (except when in an authorized container-sync request). This
      header is useful for allowing data migration from other storage
      systems to Swift and keeping the original timestamp of the data.
      If you have this migration use case (or any other requirement on
      allowing the clients to set an object's timestamp), set the
      `shunt_inbound_x_timestamp` config variable to False in the
      gatekeeper middleware config section of the proxy server config.

    * Requesting a SLO manifest file with the query parameters
      "?multipart-manifest=get&format=raw" will return the contents of
      the manifest in the format as was originally sent by the client.
      The "format=raw" is new.

    * Static web page listings can now be rendered with a custom
      label. By default listings are rendered with a label of:
      "Listing of /v1/<account>/<container>/<path>". This change adds
      a new custom metadata key/value pair
      `X-Container-Meta-Web-Listings-Label: My Label` that when set,
      will cause the following: "Listing of My Label/<path>" to be
      rendered instead.

    * Previously, static large objects (SLOs) had a minimum segment
      size (default to 1MiB). This limit has been removed, but small
      segments will be ratelimited. The config parameter
      `rate_limit_under_size` controls the definition of "small"
      segments (1MiB by default), and `rate_limit_segments_per_sec`
      controls how many segments per second can be served (default is 1).
      With the default values, the effective behavior is identical to the
      previous behavior when serving SLOs.

    * Container sync has been improved to perform a HEAD on the remote
      side of the sync for each object being synced. If the object
      exists on the remote side, container-sync will no longer
      transfer the object, thus significantly lowering the network
      requirements to use the feature.

    * The object auditor will now clean up any old, stale rsync temp
      files that it finds. These rsync temp files are left if the
      rsync process fails without completing a full transfer of an
      object. Since these files can be large, the temp files may end
      up filling a disk. The new auditor functionality will reap these
      rsync temp files if they are old. The new object-auditor config
      variable `rsync_tempfile_timeout` is the number of seconds old a
      tempfile must be before it is reaped. By default, this variable
      is set to "auto" or the rsync_timeout plus 900 seconds (falling
      back to a value of 1 day).

    * The Erasure Code reconstruction process has been made more
      efficient by not syncing data files when only the durable commit
      file is missing.

    * Fixed a bug where 304 and 416 response may not have the right
      Etag and Accept-Ranges headers when the object is stored in an
      Erasure Coded policy.

    * Versioned writes now correctly stores the date of previous versions
      using GMT instead of local time.

    * The deprecated Keystone middleware option is_admin has been removed.

    * Fixed log format in object auditor.

    * The zero-byte mode (ZBF) of the object auditor will now properly
      observe the `--once` option.

    * Swift keeps track, internally, of "dirty" parts of the partition
      keyspace with a "hashes.pkl" file. Operations on this file no
      longer require a read-modify-write cycle and use a new
      "hashes.invalid" file to track dirty partitions. This change
      will improve end-user performance for PUT and DELETE operations.

    * The object replicator's succeeded and failed counts are now logged.

    * `swift-recon` can now query hosts by storage policy.

    * The log_statsd_host value can now be an IPv6 address or a hostname
      which only resolves to an IPv6 address.

    * Erasure coded fragments now properly call fallocate to reserve disk
      space before being written.

    * Various other minor bug fixes and improvements.

swift (2.6.0)

    * Dependency changes
      - Updated minimum version of eventlet to 0.17.4 to support IPv6.

      - Updated the minimum version of PyECLib to 1.0.7.

    * The ring rebalancing algorithm was updated to better handle edge cases
      and to give better (more balanced) rings in the general case. New rings
      will have better initial placement, capacity adjustments will move less
      data for better balance, and existing rings that were imbalanced should
      start to become better balanced as they go through rebalance cycles.

    * Added container and account reverse listings.

      A GET request to an account or container resource with a "reverse=true"
      query parameter will return the listing in reverse order. When
      iterating over pages of reverse listings, the relative order of marker
      and end_marker are swapped.

    * Storage policies now support having more than one name.

      This allows operators to fix a typo without breaking existing clients,
      or, alternatively, have "short names" for policies. This is implemented
      with the "aliases" config key in the storage policy config in
      swift.conf. The aliases value is a list of names that the storage
      policy may also be identified by. The storage policy "name" is used to
      report the policy to users (eg in container headers). The aliases have
      the same naming restrictions as the policy's primary name.

    * The object auditor learned the "interval" config value to control the
      time between each audit pass.

    * `swift-recon --all` now includes the config checksum check.

    * `swift-init` learned the --kill-after-timeout option to force a service
      to quit (SIGKILL) after a designated time.

    * `swift-recon` now correctly shows timestamps in UTC instead of local
      time.

    * Fixed bug where `swift-ring-builder` couldn't select device id 0.

    * Documented the previously undocumented
      `swift-ring-builder pretend_min_part_hours_passed` command.

    * The "node_timeout" config value now accepts decimal values.

    * `swift-ring-builder` now properly removes devices with zero weight.

    * `swift-init` return codes are updated via "--strict" and "--non-strict"
      options. Please see the usage string for more information.

    * `swift-ring-builder` now reports the min_part_hours lockout time
      remaining

    * Container sync has been improved to more quickly find and iterate over
      the containers to be synced. This reduced server load and lowers the
      time required to see data propagate between two clusters. Please see
      http://swift.openstack.org/overview_container_sync.html for more details
      about the new on-disk structure for tracking synchronized containers.

    * A container POST will now update that container's put-timestamp value.

    * TempURL header restrictions are now exposed in /info.

    * Error messages on static large object manifest responses have been
      greatly improved.

    * Closed a bug where an unfinished read of a large object would leak a
      socket file descriptor and a small amount of memory. (CVE-2016-0738)

    * Fixed an issue where a zero-byte object PUT with an incorrect Etag
      would return a 503.

    * Fixed an error when a static large object manifest references the same
      object more than once.

    * Improved performance of finding handoff nodes if a zone is empty.

    * Fixed duplication of headers in Access-Control-Expose-Headers on CORS
      requests.

    * Fixed handling of IPv6 connections to memcache pools.

    * Continued work towards python 3 compatibility.

    * Various other minor bug fixes and improvements.

swift (2.5.0, OpenStack Liberty)

    * Added the ability to specify ranges for Static Large Object (SLO)
      segments.

    * Replicator configs now support an "rsync_module" value to allow
      for per-device rsync modules. This setting gives operators the
      ability to fine-tune replication traffic in a Swift cluster and
      isolate replication disk IO to a particular device. Please see
      the docs and sample config files for more information and
      examples.

    * Significant work has gone in to testing, fixing, and validating
      Swift's erasure code support at different scales.

    * Swift now emits StatsD metrics on a per-policy basis.

    * Fixed an issue with Keystone integration where a COPY request to a
      service account may have succeeded even if a service token was not
      included in the request.

    * Ring validation now warns if a placement partition gets assigned to the
      same device multiple times. This happens when devices in the ring are
      unbalanced (e.g. two servers where one server has significantly more
      available capacity).

    * Various other minor bug fixes and improvements.

swift (2.4.0)

    * Dependency changes

      - Added six requirement. This is part of an ongoing effort to add
        support for Python 3.

      - Dropped support for Python 2.6.

    * Config changes

      - Recent versions of Python restrict the number of headers allowed in a
        request to 100. This number may be too low for custom middleware. The
        new "extra_header_count" config value in swift.conf can be used to
        increase the number of headers allowed.

      - Renamed "run_pause" setting to "interval" (current configs with
        run_pause still work). Future versions of Swift may remove the
        run_pause setting.

    * Versioned writes middleware

      The versioned writes feature has been refactored and reimplemented as
      middleware. You should explicitly add the versioned_writes middleware to
      your proxy pipeline, but do not remove or disable the existing container
      server config setting ("allow_versions"), if it is currently enabled.
      The existing container server config setting enables existing
      containers to continue being versioned. Please see
      http://swift.openstack.org/middleware.html#how-to-enable-object-versioning-in-a-swift-cluster
      for further upgrade notes.

    * Allow 1+ object-servers-per-disk deployment

      Enabled by a new > 0 integer config value, "servers_per_port" in the
      [DEFAULT] config section for object-server and/or replication server
      configs. The setting's integer value determines how many different
      object-server workers handle requests for any single unique local port
      in the ring. In this mode, the parent swift-object-server process
      continues to run as the original user (i.e. root if low-port binding
      is required), binds to all ports as defined in the ring, and forks off
      the specified number of workers per listen socket. The child, per-port
      servers drop privileges and behave pretty much how object-server workers
      always have, except that because the ring has unique ports per disk, the
      object-servers will only be handling requests for a single disk. The
      parent process detects dead servers and restarts them (with the correct
      listen socket), starts missing servers when an updated ring file is
      found with a device on the server with a new port, and kills extraneous
      servers when their port is found to no longer be in the ring. The ring
      files are stat'ed at most every "ring_check_interval" seconds, as
      configured in the object-server config (same default of 15s).

      In testing, this deployment configuration (with a value of 3) lowers
      request latency, improves requests per second, and isolates slow disk
      IO as compared to the existing "workers" setting. To use this, each
      device must be added to the ring using a different port.

    * Do container listing updates in another (green)thread

      The object server has learned the "container_update_timeout" setting
      (with a default of 1 second). This value is the number of seconds that
      the object server will wait for the container server to update the
      listing before returning the status of the object PUT operation.

      Previously, the object server would wait up to 3 seconds for the
      container server response. The new behavior dramatically lowers object
      PUT latency when container servers in the cluster are busy (e.g. when
      the container is very large). Setting the value too low may result in a
      client PUT'ing an object and not being able to immediately find it in
      listings. Setting it too high will increase latency for clients when
      container servers are busy.

    * TempURL fixes (closes CVE-2015-5223)

      Do not allow PUT tempurls to create pointers to other data.
      Specifically, disallow the creation of DLO object manifests via a PUT
      tempurl. This prevents discoverability attacks which can use any PUT
      tempurl to probe for private data by creating a DLO object manifest and
      then using the PUT tempurl to head the object.

    * Ring changes

      - Partition placement no longer uses the port number to place
        partitions. This improves dispersion in small clusters running one
        object server per drive, and it does not affect dispersion in
        clusters running one object server per server.

      - Added ring-builder-analyzer tool to more easily test and analyze a
        series of ring management operations.

      - Stop moving partitions unnecessarily when overload is on.

    * Significant improvements and bug fixes have been made to erasure code
      support. This feature is suitable for beta testing, but it is not yet
      ready for broad production usage.

    * Bulk upload now treats user xattrs on files in the given archive as
      object metadata on the resulting created objects.

    * Emit warning log in object replicator if "handoffs_first" or
      "handoff_delete" is set.

    * Enable object replicator's failure count in swift-recon.

    * Added storage policy support to dispersion tools.

    * Support keystone v3 domains in swift-dispersion.

    * Added domain_remap information to the /info endpoint.

    * Added support for a "default_reseller_prefix" in domain_remap
      middleware config.

    * Allow SLO PUTs to forgo per-segment integrity checks. Previously, each
      segment referenced in the manifest also needed the correct etag and
      bytes setting. These fields now allow the "null" value to skip those
      particular checks on the given segment.

    * Allow rsync to use compression via a "rsync_compress" config. If set to
      true, compression is only enabled for an rsync to a device in a
      different region. In some cases, this can speed up cross-region
      replication data transfer.

    * Added time synchronization check in swift-recon (the --time option).

    * The account reaper now runs faster on large accounts.

    * Various other minor bug fixes and improvements.


swift (2.3.0, OpenStack Kilo)

    * Erasure Code support (beta)

      Swift now supports an erasure-code (EC) storage policy type. This allows
      deployers to achieve very high durability with less raw capacity as used
      in replicated storage. However, EC requires more CPU and network
      resources, so it is not good for every use case. EC is great for storing
      large, infrequently accessed data in a single region.

      Swift's implementation of erasure codes is meant to be transparent to
      end users. There is no API difference between replicated storage and
      EC storage.

      To support erasure codes, Swift now depends on PyECLib and
      liberasurecode. liberasurecode is a pluggable library that allows for
      the actual EC algorithm to be implemented in a library of your choosing.

      As a beta release, EC support is nearly fully feature complete, but it
      is lacking support for some features (like multi-range reads) and has
      not had a full performance characterization. This feature relies on
      ssync for durability. Deployers are urged to do extensive testing and
      not deploy production data using an erasure code storage policy.

      Full docs are at http://swift.openstack.org/overview_erasure_code.html

    * Add support for container TempURL Keys.

    * Make more memcache options configurable. connection_timeout,
      pool_timeout, tries, and io_timeout are all now configurable.

    * Swift now supports composite tokens. This allows another service to
      act on behalf of a user, but only with that user's consent.
      See http://swift.openstack.org/overview_auth.html for more details.

    * Multi-region replication was improved. When replicating data to a
      different region, only one replica will be pushed per replication
      cycle. This gives the remote region a chance to replicate the data
      locally instead of pushing more data over the inter-region network.

    * Internal requests from the ratelimit middleware now properly log a
      swift_source. See http://swift.openstack.org/logs.html for details.

    * Improved storage policy support for quarantine stats in swift-recon.

    * The proxy log line now includes the request's storage policy index.

    * Ring checker has been added to swift-recon to validate if rings are
      built correctly. As part of this feature, storage servers have learned
      the OPTIONS verb.

    * Add support of x-remove- headers for container-sync.

    * Rings now support hostnames instead of just IP addresses.

    * Swift now enforces that the API version on a request is valid. Valid
      versions are configured via the valid_api_versions setting in swift.conf

    * Various other minor bug fixes and improvements.


swift (2.2.2)

    * Data placement changes

      This release has several major changes to data placement in Swift in
      order to better handle different deployment patterns. First, with an
      unbalance-able ring, less partitions will move if the movement doesn't
      result in any better dispersion across failure domains. Also, empty
      (partition weight of zero) devices will no longer keep partitions after
      rebalancing when there is an unbalance-able ring.

      Second, the notion of "overload" has been added to Swift's rings. This
      allows devices to take some extra partitions (more than would normally
      be allowed by the device weight) so that smaller and unbalanced clusters
      will have less data movement between servers, zones, or regions if there
      is a failure in the cluster.

      Finally, rings have a new metric called "dispersion". This is the
      percentage of partitions in the ring that have too many replicas in a
      particular failure domain. For example, if you have three servers in a
      cluster but two replicas for a partition get placed onto the same
      server, that partition will count towards the dispersion metric. A
      lower value is better, and the value can be used to find the proper
      value for "overload".

      The overload and dispersion metrics have been exposed in the
      swift-ring-build CLI tools.

      See http://docs.openstack.org/developer/swift/overview_ring.html
      for more info on how data placement works now.

    * Improve replication of large out-of-sync, out-of-date containers.

    * Added console logging to swift-drive-audit with a new log_to_console
      config option (default False).

    * Optimize replication when a device and/or partition is specified.

    * Fix dynamic large object manifests getting versioned. This was not
      intended and did not work. Now it is properly prevented.

    * Fix the GET's response code when there is a missing segment in a
      large object manifest.

    * Change black/white listing in ratelimit middleware to use sysmeta.
      Instead of using the config option, operators can set
      "X-Account-Sysmeta-Global-Write-Ratelimit: WHITELIST" or
      "X-Account-Sysmeta-Global-Write-Ratelimit: BLACKLIST" on an account to
      whitelist or blacklist it for ratelimiting. Note: the existing
      config options continue to work.

    * Use TCP_NODELAY on outgoing connections.

    * Improve object-replicator startup time.

    * Implement OPTIONS verb for storage nodes.

    * Various other minor bug fixes and improvements.


swift (2.2.1)

    * Swift now rejects object names with Unicode surrogates.

    * Return 403 (instead of 413) on unauthorized upload when over account
      quota.

    * Fix a rare condition when a rebalance could cause swift-ring-builder
      to crash. This would only happen on old ring files when "rebalance"
      was the first command run.

    * Storage node error limits now survive a ring reload.

    * Speed up reading and writing xattrs for object metadata by using larger
      xattr value sizes. The change is moving from 254 byte values to 64KiB
      values. There is no migration issue with this.

    * Deleted containers beyond the reclaim age are now properly reclaimed.

    * Full Simplified Chinese translation (zh_CN locale) for errors and logs.

    * Container quota is now properly enforced during cross-account COPY.

    * ssync replication now properly uses the configured replication_ip.

    * Fixed issue were ssync did not replicate custom object headers.

    * swift-drive-audit now has the 'unmount_failed_device' config option
      (default to True) that controls if the process will unmount failed
      drives or not.

    * swift-drive-audit will now dump drive error rates to a recon file.
      The file location is controlled by the 'recon_cache_path' config value
      and it includes each drive and its associated number of errors.

    * When a filesystem does't support xattr, the object server now returns
      a 507 Insufficient Storage error to the proxy server.

    * Clean up empty account and container partitions directories if they
      are empty. This keeps the system healthy and prevents a large number
      of empty directories from slowing down the replication process.

    * Show the sum of every policy's amount of async pendings in swift-recon.

    * Various other minor bug fixes and improvements.

Note that 2.10 at least fixes a bug introduced in 2.7 marked as "critical" https://bugs.launchpad.net/swift/+bug/1651530 so we might hold off on 2.7

Event Timeline

fgiunchedi raised the priority of this task from to Normal.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a subscriber: fgiunchedi.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 6 2015, 2:50 PM

another option of course is to (officially) backport 2.5 from stretch to jessie-backports.
in terms of upstream support, swift 2.2.0 isn't in Kilo (EOL: 2016-05-02) http://docs.openstack.org/releases/releases/kilo.html

Change 263617 had a related patch set uploaded (by Filippo Giunchedi):
swift: reinstall ms-fe3* with jessie

https://gerrit.wikimedia.org/r/263617

Change 263617 merged by Filippo Giunchedi:
swift: reinstall ms-fe3* with jessie

https://gerrit.wikimedia.org/r/263617

Change 263628 had a related patch set uploaded (by Filippo Giunchedi):
swift: adjust dependencies for jessie

https://gerrit.wikimedia.org/r/263628

Change 263629 had a related patch set uploaded (by Filippo Giunchedi):
swift: adjust mount options for debian and ubuntu

https://gerrit.wikimedia.org/r/263629

Change 263630 had a related patch set uploaded (by Filippo Giunchedi):
swift: add explicit bind_port to servers

https://gerrit.wikimedia.org/r/263630

Change 263628 merged by Filippo Giunchedi:
swift: add python-webob, remove python-swauth deps

https://gerrit.wikimedia.org/r/263628

Change 263630 merged by Filippo Giunchedi:
swift: add explicit bind_port to servers

https://gerrit.wikimedia.org/r/263630

Gilles added a subscriber: Gilles.Jan 13 2016, 3:06 PM

yet another option proposed would be to stick with trusty but upgrade swift using openstack libery packages from https://wiki.ubuntu.com/ServerTeam/CloudArchive
this includes upgrading ~20 ms-be/ms-fe machines in eqiad to trusty (related T123525: reduce amount of remaining Ubuntu 12.04 (precise) systems in production)

I did some upgrade + dist-upgrade testing in labs with swift-upgrade-ms-fe01 and swift-upgrade-ms-be01 to move precise -> trusty and it seems successful

Is there a specific feature from 2.5 that we need compared to the stock swift version in jessie?

We could also migrated to standard jessie first (which is still a leap forward from 1.13) and migrate to a later swift release with stretch in a year.

no critical 2.5 feature afaict, mostly performance related. migrating all swift to trusty first has the added bonus of getting us in a uniform situation, codfw is trusty already and some machines in eqiad too. from there we can try upgrading swift to 2.2 or 2.5 via cloudarchive and then move to jessie, mostly to minimize the swift+os combinations

Change 264275 had a related patch set uploaded (by Filippo Giunchedi):
swift: reinstall ms-fe300* with trusty

https://gerrit.wikimedia.org/r/264275

Change 264275 merged by Filippo Giunchedi:
swift: reinstall ms-fe300* with trusty

https://gerrit.wikimedia.org/r/264275

I've dist-upgraded swift in esams to trusty, the only precise machines left are ms-fe1001 -> ms-fe1004 and ms-be1001 -> ms-be1015

in terms of next steps I think we should:

  1. dist-upgrade remaining 20 machines in eqiad to trusty, get rid of precise
  2. reimage with jessie and swift 2.2 (and address uid/gid change in T123918: 'swift' user/group IDs should be consistent across the fleet while we're at it) starting e.g. with esams then codfw then eqiad

point 1. isn't strictly necessary but easy to do and moves us forward with goal T123525: reduce amount of remaining Ubuntu 12.04 (precise) systems in production
point 2. should be easy to do, modulo differences in cloudarchive vs debian swift packages. the uid change is easy to do but a lengthy operation even when done in parallel (on ms-be3001 it took ~4h to complete with chown launched in parallel on all disks)

faidon added a subscriber: faidon.Feb 23 2016, 2:09 PM

So the uid issue we should definitely fix at some point, but I don't understand exactly why it's a blocker to a jessie upgrade (= reinstall). We can install with jessie and manually add the swift user with its old uid (either pre-apt-get install swift add the swift user, or post-install edit /etc/passwd and chown /usr /var etc.). We can then do the (lengthier) operation of T123918 at some other point in time in the future.

This way, we can a) do a reinstall rather than an uglier (IMO) dist-upgrade, b) just move immediately to jessie, rather than spending considerable time to upgrade to an already aged distribution (2 years old in a month!).

Change 263629 merged by Filippo Giunchedi:
swift: adjust mount options for debian and ubuntu

https://gerrit.wikimedia.org/r/263629

uid change isn't a blocker for jessie, adding swift user pre-swift (i.e. pre-puppet) seems like a good solution!

most of my worry comes from upgrading to jessie and new swift version + packaging a majority of hosts in eqiad (likely before the end of the quarter, considering the precise goal as well?) other than that I'm fine with going with jessie.

OK, my misunderstanding regarding the uid issue then. I'm not too worried about jessie or packaging but I'll concede that moving from 1.13.1 to 2.2 (or 2.6) -and in a rush- increases our risks significantly.

As long as we don't defer this work for much later, I wouldn't mind your plan of perhaps doing a quick upgrade to trusty as you suggested, as an immediate stopgap.

sounds good @faidon ! I've retitled T125024: upgrade 15+4 swift servers from precise to trusty accordingly (i.e. dist-upgrade to trusty).
My plan would be to resume jessie work beginning of next quarter, possibly this quarter too by running esams on jessie for example

fgiunchedi renamed this task from swift upgrade plans to swift upgrade plans: jessie and swift 2.x.Feb 24 2016, 4:26 PM
fgiunchedi set Security to None.

Mentioned in SAL [2016-06-14T15:26:01Z] <godog> reimage ms-fe3001 with jessie T117972

Change 294334 had a related patch set uploaded (by Filippo Giunchedi):
install_server: ms-fe300[12] on jessie

https://gerrit.wikimedia.org/r/294334

Change 294334 merged by Filippo Giunchedi:
install_server: ms-fe300[12] on jessie

https://gerrit.wikimedia.org/r/294334

Mentioned in SAL [2016-06-14T16:41:41Z] <godog> reimage ms-fe3002 with jessie T117972

Change 294489 had a related patch set uploaded (by Filippo Giunchedi):
install_server: ms-be300* to jessie

https://gerrit.wikimedia.org/r/294489

Change 294489 merged by Filippo Giunchedi:
install_server: ms-be300* to jessie

https://gerrit.wikimedia.org/r/294489

Change 294517 had a related patch set uploaded (by Filippo Giunchedi):
swift: add systemd unit file for proxy-server

https://gerrit.wikimedia.org/r/294517

Krenair added a subscriber: Krenair.Oct 3 2016, 2:48 PM
fgiunchedi updated the task description. (Show Details)Jan 5 2017, 9:05 PM

Change 294517 merged by Filippo Giunchedi:
swift: add systemd unit file for proxy-server

https://gerrit.wikimedia.org/r/294517

fgiunchedi closed this task as Invalid.Apr 10 2017, 2:43 PM

Superseded by T162609