Page MenuHomePhabricator

create 'attended' upgrade workflow for cloud with Toolforge as canonical case
Closed, ResolvedPublic

Description

@aborrero has been doing a lot of good work to shore up our lacking apt workflow across cloud instances and as a result https://gerrit.wikimedia.org/r/#/c/389480/ was merged (unattended upgrades for WMF packages).

We stepped through changes picked up in Toolforge at the time by using unattended-upgrade to generate reports and disabling puppet with explicit enabling and runs per role. We did pick up a breaking change for elasticsearch and also left behind kernel updates for anything Jessie ( tracked in T180809 ). We have had issues in the past we believe were from host/guest kernel version mismatch for virtio causing IO freezes, and also with unattended upgrades breaking nginx during staff offhours.

As such, we intend to manage updates during our weekly cloud clinic duty process with a set of scripts that generate a report of available updates and additionally apply them. This will be done explicitly and inside of working hours for the majority of cloud admins so we can respond to issues more appropriately in real-time.

Outcomes:

Related: T180811 T177920

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+8 -1
operations/puppetproduction+54 -35
operations/puppetproduction+1 -1
operations/puppetproduction+34 -0
operations/puppetproduction+13 -1
operations/puppetproduction+2 -0
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+7 -5
operations/puppetproduction+1 -1
operations/puppetproduction+54 -18
operations/puppetproduction+30 -36
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+141 -110
operations/puppetproduction+138 -0
operations/puppetproduction+10 -4
operations/puppetproduction+67 -0
operations/puppetproduction+50 -7
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
chasemp triaged this task as High priority.Nov 29 2017, 4:58 PM
chasemp updated the task description. (Show Details)
chasemp updated the task description. (Show Details)Nov 29 2017, 5:12 PM
chasemp updated the task description. (Show Details)Nov 29 2017, 7:53 PM

Change 394200 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] wip: toolforge: follow attended upgrade process

https://gerrit.wikimedia.org/r/394200

chasemp updated the task description. (Show Details)Nov 30 2017, 6:15 PM

Upgraded patch by @chasemp to implement only 2 hiera keys: wmf and updates. Security upgrades cannot be disabled since they come enabled by default with the unattended-upgrades package installation, and we want them always enabled anyway.

Change 394572 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: unattended-upgrades: add reporter script

https://gerrit.wikimedia.org/r/394572

In https://gerrit.wikimedia.org/r/394572 there is a script that reports every upgradeable package, per source repository, and what could be upgraded with the current unattended-ugprade configuration in the local node.

aborrero updated the task description. (Show Details)Dec 12 2017, 6:48 PM

Change 398079 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: unattended-upgrades: add targetted upgrades scripts

https://gerrit.wikimedia.org/r/398079

aborrero updated the task description. (Show Details)Dec 13 2017, 5:33 PM

Change 394200 merged by Rush:
[operations/puppet@production] cloud: setup for attended upgrade process

https://gerrit.wikimedia.org/r/394200

Change 394200 merged by Rush:
[operations/puppet@production] cloud: setup for attended upgrade process

https://gerrit.wikimedia.org/r/394200

https://phabricator.wikimedia.org/P6464

chasemp updated the task description. (Show Details)Dec 14 2017, 2:08 PM

Change 394572 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: unattended-upgrades: add reporter script

https://gerrit.wikimedia.org/r/394572

aborrero updated the task description. (Show Details)Dec 15 2017, 12:41 PM

I'm not happy with the reporter script, it returns loads of data if running with clush in the tools project. Lots of nodes, Ubuntu with lots of pending upgrades, lots of output. It doesn't fit in this environment with such a scale

Change 398458 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: report-pending-upgrades.sh: add verbosity flag

https://gerrit.wikimedia.org/r/398458

Change 398458 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: report-pending-upgrades.sh: add verbosity flag

https://gerrit.wikimedia.org/r/398458

aborrero updated the task description. (Show Details)Dec 19 2017, 3:34 PM

About the patch for the wrapper tool to do targeted upgrades, @faidon added some comments in gerrit (https://gerrit.wikimedia.org/r/#/c/398079/) and I would like to discuss this here better than there.

  • Use our own mirror, mirrors.wikimedia.org instead of deb.debian.org, faster, easier :)

Sure! thanks

  • This shouldn't be deployed in prod, just in case someone runs this accidentally. Maybe add a labs_apt module with all the labs-specific bits?

Sure, no problem. Perhaps we could put this only in instances of CloudVPS. But anyway, there are many many single commands that could potentially hurt/destroy a system if called by mistake.

  • It feels a little odd to have specific apt sources.list here. It's not very DRY, could potentially conflict with the system's configuration and ultimately with what's in configuration management (puppet).

Well, since all is in the same repo (puppet), one simply should take care to use consistent repos when doing commits?

  • Even past that, sources seem to be hardcoded in the source now, instead of e.g. a configuration file.

I tried to avoid that on purpose, since that would add a lot of config files (2 per source 'channel').

  • I only gave this a cursory look, and I may misunderstand how it works, but I think it just generates a sources.list with only one distribution at a time and instructs apt to use it. If that's the case indeed, won't work. Pin-priorities only make sense if apt has a full view of all the different sources; looking at individual sources one at a time will not have the intended result, I think. For instance, if you have puppet 3.8-1 in stretch-wikimedia and puppet 4.8-1 in stretch, by looking at stretch alone, apt will install 4.8-1, but if you have both in your sources, apt will prefer the stretch-wikimedia one because of pinning.

The script is a shortcut. Not meant to be 'intelligent' in any way. So those kind of problems are to be resolved by the admin. For security upgrades and volatile (our main use case), this just work and if not, additional preferences/prio could be added (yes, hardcoded or whatever) for apt to have enough context.

If not this approach, according to our requirements and use cases (see task description), What workflow would you implement to easily keep our systems updated?
We would like to have the powers to do something like: Hey tools cluster, upgrade [simulate] all volatile/security packages right now. Hey tools cluster, upgrade [simulate] all wmf packages right now.

faidon added a comment.EditedDec 21 2017, 6:16 PM
  • It feels a little odd to have specific apt sources.list here. It's not very DRY, could potentially conflict with the system's configuration and ultimately with what's in configuration management (puppet).

Well, since all is in the same repo (puppet), one simply should take care to use consistent repos when doing commits?

Well there are two things broadly that could be done better here: a) decouple code from configuration, b) integrate configuration with configuration management, possibly centralizing them in the same place. The fact that the duplicated information is on the same repository doesn't make it less duplicated or less prone to drift over time, unfortunately.

  • I only gave this a cursory look, and I may misunderstand how it works, but I think it just generates a sources.list with only one distribution at a time and instructs apt to use it. If that's the case indeed, won't work. Pin-priorities only make sense if apt has a full view of all the different sources; looking at individual sources one at a time will not have the intended result, I think. For instance, if you have puppet 3.8-1 in stretch-wikimedia and puppet 4.8-1 in stretch, by looking at stretch alone, apt will install 4.8-1, but if you have both in your sources, apt will prefer the stretch-wikimedia one because of pinning.

The script is a shortcut. Not meant to be 'intelligent' in any way. So those kind of problems are to be resolved by the admin. For security upgrades and volatile (our main use case), this just work and if not, additional preferences/prio could be added (yes, hardcoded or whatever) for apt to have enough context.

If not this approach, according to our requirements and use cases (see task description), What workflow would you implement to easily keep our systems updated?
We would like to have the powers to do something like: Hey tools cluster, upgrade [simulate] all volatile/security packages right now. Hey tools cluster, upgrade [simulate] all wmf packages right now.

Unless I'm misunderstanding the design, it will not work at all right now and will either result in a packages being uninstallable or unrelated packages being upgraded or messed with.

The Puppet example I mentioned: we had Puppet 3.8 in stretch-wikimedia, but stretch shipped 4.8.2. "apt upgrade" would keep that as-is (and even do a downgrade). By running an upgrade with only Debian sources (or possibly only security sources, should a Puppet DSA find its way to security.d.o), puppet would be inadvertently upgraded (and would break puppet on that VM).

I also don't think this will work for distributions that are partial. Sometimes new versions of packages as posted on security.d.o depend on new previously not installed package, e.g. puppet 4 being introduced in jessie-wikimedia, which would depend on a new version of a Ruby module, to be installed from jessie. It's rare, but this kind of thing happens sometimes with security.d.o updates as well.

I don't have any ideas to address that workflow, but I haven't thought thought much about that workflow or having experience with those kind of ToolForge issues. I understand having to make choices at the package-level (e.g. dpkg-holding packages back or rolling specific package updates manually), but I don't really understand the need for doing en-masse upgrades, but only doing them per source repository. Why is that needed?

  • It feels a little odd to have specific apt sources.list here. It's not very DRY, could potentially conflict with the system's configuration and ultimately with what's in configuration management (puppet).

Well, since all is in the same repo (puppet), one simply should take care to use consistent repos when doing commits?

Well there are two things broadly that could be done better here: a) decouple code from configuration, b) integrate configuration with configuration management, possibly centralizing them in the same place. The fact that the duplicated information is on the same repository doesn't make it less duplicated or less prone to drift over time, unfortunately.

You are right in A) and probably in B) as well. I say 'probably' because in this case configuration means 1 to 3 lines of config or something like that (repo URL, prio/pref). That's why I started with the hardcoding approach.

It might not be intelligent, but unless I'm misunderstanding the design, it will not work at all right now and will either result in a packages being uninstallable or unrelated packages being upgraded or messed with.

The Puppet example I mentioned: we had Puppet 3.8 in stretch-wikimedia, but stretch shipped 4.8.2. "apt upgrade" would do that. By running an upgrade with only Debian sources (or possibly only security sources, should a Puppet DSA find its way to security.d.o), puppet would be inadvertently upgraded (and would break puppet on that VM).

What about putting key packages like puppet into 'hold', so they aren't affected at all by apt operations?
We could have a puppet list of key packages to hold, if that doesn't exists already, and let every other non-key package be upgraded to get security updates, bug fixes, et al.

I also don't think this will work for distributions that are partial. Sometimes new versions of packages as posted on security.d.o depend on new previously not installed package, e.g. puppet 4 being introduced in jessie-wikimedia, which would depend on a new version of a Ruby module, to be installed from jessie. It's rare, but this kind of thing happens sometimes with security.d.o updates as well.

In this case, this is just a matter of finding the right combinations of repos and prio/prefs.

In the case of security.d.o, we have this repo file:

deb http://security.debian.org stretch/updates main contrib non-free
deb http://deb.debian.org/debian/ stretch main contrib non-free

And this prio/prefs:

Package: *
Pin: release l=Debian-Security
Pin-Priority: 990

Package: *
Pin: release l=Debian
Pin-Priority: 99

This can be translated to: updagrade all packages from security, with preference for packages coming from security. It will only pick packages from the other if required by deps. Is this true? The thing is that I can't find right now a single instance with pending security upgrades to confirm this.

I don't have any ideas to address that workflow, but I haven't thought thought much about that workflow or experienced the issues you've been having. I understand having to make choices on the package-level (e.g. dpkg-holding packages back or rolling specific package updates manually), but I don't really understand the need for doing en-masse upgrades, but only doing them per source repository.

We have many many instances/VM. No package upgrades on them :-( This is something we would like to improve.

Another approach could be to have a central repository, allow packages to enter the repo after review and evaluation (along with deps), and have the instances/VM to unconditionally upgrade from this central repository.

Change 398079 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: unattended-upgrades: add targetted upgrades script

https://gerrit.wikimedia.org/r/398079

Mentioned in SAL (#wikimedia-cloud) [2018-01-17T14:09:47Z] <arturo> T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

aborrero updated the task description. (Show Details)Jan 17 2018, 2:13 PM

So, this task seems complete. Perhaps we wait before closing it to know if the docs are in shape and the workflow really fulfill our use case.

chasemp added a comment.EditedJan 30 2018, 2:31 PM

I walked through this in the morning today and have a few thoughts. Great stuff overall.

  • Why the decision to split report-pending-upgrades and apt-upgrade across two scripts? Some of this feedback is moot if that's revisited.
  • report-pending-upgradesdoesn't have -h making it "do things" with a -h flag
  • apt-upgrade -h let's change the help from Run a targeted upgrade of packages to Run a targeted upgrade of packages by source
  • For `apt-upgrade is there a way we can pass an option to list available sources? At the moment if you don't have the sources at hand from some other means there isn't a way to get started. This allows agnostic source discovery across distros/releases and also an 'all' option.
  • The report-pending-upgrades output is difficult to manage/grep/sort when used via clush for a cluster of hosts:
`root@tools-bastion-03:~# /usr/local/sbin/report-pending-upgrades
I: upgradeable packages from trusty-security: 1
I: upgradeable packages from trusty-tools: 1
I: upgradeable packages from trusty-updates: 36
I: upgradeable packages from trusty-wikimedia: 2
I: 40 upgradeable packages, 0 upgradeable packages by unatteneded-upgrades

Standard output across 100 hosts isn't very greppable / sortable.

  1. Can we have a way to put the hostname in front of each line so that it can be combined and grepped on the fly from a source execution host?
  2. Mix of colon and space separation

-v output for report-pending-upgrades is roughly the same

I: upgradeable packages from trusty-wikimedia: 2

  debdeploy-client:all/trusty-wikimedia 0.0.99.1-1+trusty1 upgradeable to 0.0.99.2-1+trusty1
  ruby-rgen:all/trusty-wikimedia 0.6.6-1 upgradeable to 0.7.0-1.1~trusty1
  1. Mix of colon and space separators
  2. Can each candidate package line be:

    <source> <package> <version> <candidate_version>

    With flag options to output as:

    <hostname> <source> <package> <version> <candidate_version>
  • Typo in report-pending-upgrades with I: 40 upgradeable packages, 0 upgradeable packages by unatteneded-upgrades
  • I think this happens if some other apt activity is in progress. Can we have this fail more cleanly so it's clear that it's a pass but not an errant state?
sudo report-pending-upgrades
E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
E: Unable to lock directory /var/lib/apt/lists/
Problem renaming the file /var/cache/apt/pkgcache.bin.mvgq1U to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory)
You may want to run apt-get update to correct these problems
Can't call method "policy" on an undefined value at /usr/bin/apt-show-versions line 56.
  • Is it possible for apt-upgrade to have a [y] style confirmation without a -y passed option?

Probably makes sense to walk me through the workflow for upgrading all of Toolforge with what we have now. Something like:

  1. Spot check an instance of each "type". Exec, Static, Worker, Lighttpd, Bastion, Checker, etc.
  1. Review available upgrades. [assuming good]
  1. Upgrade each canary [per source probably?]
  1. Spot check that it's good
  1. Run report across pool of each Type looking for outlier packages that didn't show up on the Canary
  1. Upgrade each node type [per source?]
  1. Review and upgrade SPOFs

Change 407465 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] WIP: apt: merge script report-pending-upgrades to apt-upgrade

https://gerrit.wikimedia.org/r/407465

Change 407465 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: merge report-pending-upgrades script into apt-upgrade

https://gerrit.wikimedia.org/r/407465

Change 409086 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: fix confirmation prompt of apt-upgrade

https://gerrit.wikimedia.org/r/409086

Change 409086 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: fix confirmation prompt of apt-upgrade

https://gerrit.wikimedia.org/r/409086

Change 409322 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: sort output of the list operation

https://gerrit.wikimedia.org/r/409322

Change 409322 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: sort output of the list operation

https://gerrit.wikimedia.org/r/409322

Change 409323 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: add switch for the node name output

https://gerrit.wikimedia.org/r/409323

@chasemp and I agreed in the following next steps towards the new workflow:

  • Have a way to exclude packages from apt-upgrade upgrade <suite>

It should read a file with one package regexp per line. The script will create a temporal version pinning for these packages while doing the upgrade, so they won't be touched.
The call would be something like apt-upgrade upgrade <suite> -f file_with_exclusions

  • Put long term version pinning in puppet.

Packages we know we are not going to upgrade in servers (like, linux kernel due to specific modules, etc) so we don't have to exclude them every time in the apt-upgrade run.
Currently, modules/apt/manifests/pin.pp supports pinning by version, but we are not pinning any package by version but release (i.e. linux from stable). As stable could have several versions of the same package in the archive, async updates of servers could end having different versions in different servers.
We could have for example all tools-exec-* nodes having linux 4.9.whatever and bump the version in puppet when we see an upgraded upstream (Debian) release of the package.

  • Have a way to exclude packages from apt-upgrade upgrade <suite>

It should read a file with one package regexp per line. The script will create a temporal version pinning for these packages while doing the upgrade, so they won't be touched.
The call would be something like apt-upgrade upgrade <suite> -f file_with_exclusions

Update: there is no way to generate the pinning with the current implementation. I've been using the apt.Cache[0] class to do all the work, but python-apt[1] can only generate pinning if using the apt_pkg.Cache[2] class, which is of lower level.
A switch to the lower level one implies a major rewrite of the script, which can be done but will take me longer.

[0] https://apt.alioth.debian.org/python-apt-doc/library/apt.cache.html
[1] https://apt.alioth.debian.org/python-apt-doc/library/
[2] https://apt.alioth.debian.org/python-apt-doc/library/apt_pkg.html#apt_pkg.Policy

Change 409323 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: add switch for the node name output

https://gerrit.wikimedia.org/r/409323

Change 410159 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: add package exclusion by reading a file

https://gerrit.wikimedia.org/r/410159

Change 410159 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: add package exclusion by reading a file

https://gerrit.wikimedia.org/r/410159

Update: there is no way to generate the pinning with the current implementation. I've been using the apt.Cache[0] class to do all the work, but python-apt[1] can only generate pinning if using the apt_pkg.Cache[2] class, which is of lower level.
A switch to the lower level one implies a major rewrite of the script, which can be done but will take me longer.

[0] https://apt.alioth.debian.org/python-apt-doc/library/apt.cache.html
[1] https://apt.alioth.debian.org/python-apt-doc/library/
[2] https://apt.alioth.debian.org/python-apt-doc/library/apt_pkg.html#apt_pkg.Policy

An alternative was implemented in https://gerrit.wikimedia.org/r/410159 using mark_keep() that seems to cover our use case.

Change 410232 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: logging messages go to stdout and stderr

https://gerrit.wikimedia.org/r/410232

Change 410232 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: logging messages go to stdout and stderr

https://gerrit.wikimedia.org/r/410232

Change 411330 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: cleanup report output

https://gerrit.wikimedia.org/r/411330

Change 411330 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: cleanup report output

https://gerrit.wikimedia.org/r/411330

arturo and I synced up today for ongoing things. We landed the inital pinning patch but figured out os_version was being used as a fact instead of a function. with a quick revision all seems well with a few mysteries remaining:

  • nginx pinning seems not to be be effective atm using nginx-* matching
  • discovered kubernetes-client package was never upgraded for minor version https://gerrit.wikimedia.org/r/c/413213/
  • allow multiple source repos to be passed to apt-upgrade for sanity

Then we need to do do a survey and make sure there are no more pinning additions we need to make before finishing the upgrades. Arturo has already done canaries I believe of every type.

On checking I don't know if the kubernetes-client package pinning is having the effect we expect.

see https://gerrit.wikimedia.org/r/c/413213/

So I expect 1.4.6-3 to be version pinned

root@tools-bastion-05:~# apt-cache policy kubernetes-client
kubernetes-client:
  Installed: 1.4.6-3
  Candidate: 1.4.6-6
  Package pin: 1.4.6-3
  Version table:
     1.4.6-6 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
     1.4.6-4 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
     1.4.6-3 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
 *** 1.4.6-3 1001
       1001 http://apt.wikimedia.org/wikimedia/ trusty-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension

This makes me think it wants to upgrade to 1.4.6-3

which I think may be confirmed with

root@tools-bastion-05:~# apt-upgrade -u list
tools-bastion-05: trusty-tools, trusty-updates, trusty-wikimedia
root@tools-bastion-05:~# apt-upgrade -u report trusty-tools
tools-bastion-05: trusty-tools: kubernetes-client 1.4.6-3 --> 1.4.6-6
tools-bastion-05: trusty-tools: openssh-sftp-server 1:6.6p1-2ubuntu2.10 --> 1:6.9p1-2~trusty1
root@tools-bastion-05:~#

Change 413348 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: fix typo in nginx-* version pinning for jessie

https://gerrit.wikimedia.org/r/413348

Change 413348 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: fix typo in nginx-* version pinning for jessie

https://gerrit.wikimedia.org/r/413348

arturo and I synced up today for ongoing things. We landed the inital pinning patch but figured out os_version was being used as a fact instead of a function. with a quick revision all seems well with a few mysteries remaining:

  • nginx pinning seems not to be be effective atm using nginx-* matching

This one is solved. There was a typo in the version string itself (see patch).

On checking I don't know if the kubernetes-client package pinning is having the effect we expect.

see https://gerrit.wikimedia.org/r/c/413213/

So I expect 1.4.6-3 to be version pinned

root@tools-bastion-05:~# apt-cache policy kubernetes-client
kubernetes-client:
  Installed: 1.4.6-3
  Candidate: 1.4.6-6
  Package pin: 1.4.6-3
  Version table:
     1.4.6-6 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
     1.4.6-4 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
     1.4.6-3 1001
       1500 http://tools-services-01.tools.eqiad.wmflabs/repo/ trusty-tools/main amd64 Packages
 *** 1.4.6-3 1001
       1001 http://apt.wikimedia.org/wikimedia/ trusty-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension

This makes me think it wants to upgrade to 1.4.6-3

which I think may be confirmed with

root@tools-bastion-05:~# apt-upgrade -u list
tools-bastion-05: trusty-tools, trusty-updates, trusty-wikimedia
root@tools-bastion-05:~# apt-upgrade -u report trusty-tools
tools-bastion-05: trusty-tools: kubernetes-client 1.4.6-3 --> 1.4.6-6
tools-bastion-05: trusty-tools: openssh-sftp-server 1:6.6p1-2ubuntu2.10 --> 1:6.9p1-2~trusty1
root@tools-bastion-05:~#

After a bit of investigation I think I have an answer for this one.

Frist, we have 2 kind of pinning involvements here:

  • repo/origin pinning (project aptly and wikimedia repo)
  • package version pinning (the one we just added)

If I adjust the different pinnings to let our own version pinning be the most strong one, then what happens is that apt tries to move the package from origin, i.e.

aborrero@tools-bastion-05:~$ sudo apt-upgrade -u upgrade trusty-tools
tools-bastion-05: trusty-tools: kubernetes-client 1.4.6-3 --> 1.4.6-3 
commit changes? [y/N]:

In other words. remove the package from trusty-wikimedia (apt.wikimedia.org) and install it from trusty-tools (tool-services-01):

aborrero@tools-bastion-05:~$ sudo apt-get install kubernetes-client -s
[...]
The following packages will be upgraded:
  kubernetes-client
1 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.
Inst kubernetes-client [1.4.6-3] (1.4.6-3 trusty-tools:trusty-tools [amd64])
Conf kubernetes-client (1.4.6-3 trusty-tools:trusty-tools [amd64])

We actually serve kubernetes-client v1.4.6-3 from 2 different repos (apt.wikimedia.org and tools-services-01) and I bet the resolver is just trying to move to the most recent package (timestamp), and since the versions are the same, everything is fine, there are no policy violations.

Options I see:

  • bump the pinning value to something bigger (f.e. 2000) so we override the archive origin value. A patch will follow shortly for this.
  • remove the kubernetes-client package from tools-services-01
  • let the resolver upgrade the package. It should be harmless since the version is exactly the same. I'm not sure what would be the next step, i.e, yet another upgrade to the same version from the first repo? Hopefully not.

Change 413357 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: bump pinning value for kubernetes packages

https://gerrit.wikimedia.org/r/413357

Change 413357 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: bump pinning value for kubernetes packages

https://gerrit.wikimedia.org/r/413357

Change 414649 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt_upgrade: include link to wikitech docs

https://gerrit.wikimedia.org/r/414649

Change 414649 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt_upgrade: include link to wikitech docs

https://gerrit.wikimedia.org/r/414649

Change 414657 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: tools-clush-generator: introduce clush group 'one_of_each'

https://gerrit.wikimedia.org/r/414657

Change 414983 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: include clush hosts group for canary servers

https://gerrit.wikimedia.org/r/414983

Change 414657 abandoned by Arturo Borrero Gonzalez:
toollabs: tools-clush-generator: introduce clush group 'one_of_each'

Reason:
Superseded by https://gerrit.wikimedia.org/r/#/c/414983/

https://gerrit.wikimedia.org/r/414657

Change 414983 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: include clush hosts group for canary servers

https://gerrit.wikimedia.org/r/414983

Change 414985 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: fix path of new canary server list

https://gerrit.wikimedia.org/r/414985

Change 414985 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: fix path of new canary server list

https://gerrit.wikimedia.org/r/414985

Today I cleaned kernel packages using apt-get autoremove and apt-get autoclean in nodes:

tools-logs-02.tools.eqiad.wmflabs
tools-redis-1002.tools.eqiad.wmflabs
tools-cron-01.tools.eqiad.wmflabs
tools-static-13.tools.eqiad.wmflabs
tools-exec-1442.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs
tools-docker-builder-05.tools.eqiad.wmflabs
tools-grid-master.tools.eqiad.wmflabs
tools-mail.tools.eqiad.wmflabs
tools-k8s-etcd-03.tools.eqiad.wmflabs
tools-k8s-master-01.tools.eqiad.wmflabs
tools-paws-master-01.tools.eqiad.wmflabs
tools-worker-1027.tools.eqiad.wmflabs
tools-paws-worker-1019.tools.eqiad.wmflabs
tools-prometheus-02.tools.eqiad.wmflabs
tools-services-02.tools.eqiad.wmflabs
tools-bastion-03.tools.eqiad.wmflabs
tools-proxy-02.tools.eqiad.wmflabs
tools-webgrid-generic-1404.tools.eqiad.wmflabs
tools-exec-1442.tools.eqiad.wmflabs
tools-grid-shadow.tools.eqiad.wmflabs
tools-checker-01.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs
tools-flannel-etcd-03.tools.eqiad.wmflabs

But inspecting again the status regarding kernel packages in the cluster, lot of servers have no apt pinnings so working with them is a pain:

aborrero@tools-clushmaster-01:~$ clush -w @all "[ ! -r /etc/apt/preferences.d/toolforge_linux_pinning.pref ] && echo 'no apt pinning' || true"
tools-clushmaster-01.tools.eqiad.wmflabs: no apt pinning
tools-docker-registry-02.tools.eqiad.wmflabs: no apt pinning
tools-docker-builder-05.tools.eqiad.wmflabs: no apt pinning
tools-elastic-03.tools.eqiad.wmflabs: no apt pinning
tools-elastic-01.tools.eqiad.wmflabs: no apt pinning
tools-docker-registry-01.tools.eqiad.wmflabs: no apt pinning
tools-elastic-02.tools.eqiad.wmflabs: no apt pinning
tools-flannel-etcd-03.tools.eqiad.wmflabs: no apt pinning
tools-logs-02.tools.eqiad.wmflabs: no apt pinning
tools-k8s-etcd-02.tools.eqiad.wmflabs: no apt pinning
tools-k8s-etcd-01.tools.eqiad.wmflabs: no apt pinning
tools-package-builder-01.tools.eqiad.wmflabs: no apt pinning
tools-k8s-etcd-03.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1001.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1003.tools.eqiad.wmflabs: no apt pinning
tools-static-11.tools.eqiad.wmflabs: ssh: Could not resolve hostname tools-static-11.tools.eqiad.wmflabs: Name or service not known
clush: tools-static-11.tools.eqiad.wmflabs: exited with exit code 255
tools-paws-worker-1005.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1002.tools.eqiad.wmflabs: no apt pinning
tools-paws-master-01.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1006.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1007.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1010.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1013.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1016.tools.eqiad.wmflabs: no apt pinning
tools-prometheus-01.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1017.tools.eqiad.wmflabs: no apt pinning
tools-prometheus-02.tools.eqiad.wmflabs: no apt pinning
tools-paws-worker-1019.tools.eqiad.wmflabs: no apt pinning
tools-redis-1001.tools.eqiad.wmflabs: no apt pinning
tools-redis-1002.tools.eqiad.wmflabs: no apt pinning
tools-static-12.tools.eqiad.wmflabs: no apt pinning
tools-flannel-etcd-01.tools.eqiad.wmflabs: no apt pinning
tools-static-13.tools.eqiad.wmflabs: no apt pinning
tools-flannel-etcd-02.tools.eqiad.wmflabs: no apt pinning

Change 417779 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: add -x/--exclude option

https://gerrit.wikimedia.org/r/417779

Change 417779 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: add -x/--exclude option

https://gerrit.wikimedia.org/r/417779

Change 418893 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: capture exception when creating cache

https://gerrit.wikimedia.org/r/418893

Change 418893 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: capture exception when creating cache

https://gerrit.wikimedia.org/r/418893

Change 418895 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrade: fix typo in comment for documentation

https://gerrit.wikimedia.org/r/418895

Change 418895 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrade: fix typo in comment for documentation

https://gerrit.wikimedia.org/r/418895

aborrero closed this task as Resolved.Jun 7 2018, 11:08 AM

The workflow should be now in place. In future operations using this workflow we could re-evaluate it and do some improvements for sure.