Page MenuHomePhabricator

Container image lifecycle management
Open, MediumPublic

Description

With our production software continuing to move to containers running on Kubernetes, we need to build out the tools to help us manage the lifecycle of container images. In particular, an image catalog service will record all versions of all images built and published in the registry.

The image catalog will record the build's dependencies (other Docker images as well as Debian packages) when the build is registered by docker-pkg or blubber, and will integrate with tools like Clair and Debmonitor for managing known security vulnerabilities.

The image catalog will also periodically check to see what images (and which versions) are currently running in production, in order to track images in need of an update (especially when a security update is available for one of its dependencies).

For more details see the original design doc.

Event Timeline

RLazarus triaged this task as Medium priority.Jul 21 2021, 10:32 PM
RLazarus created this task.

Change 706048 had a related patch set uploaded (by RLazarus; author: RLazarus):

[integration/config@master] zuul: Add a new project for operations/docker-images/imagecatalog

https://gerrit.wikimedia.org/r/706048

Change 706048 merged by jenkins-bot:

[integration/config@master] Zuul: [operations/docker-images/imagecatalog] add tox-docker

https://gerrit.wikimedia.org/r/706048

Change 723663 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/imagecatalog@master] Minimal version of the image catalog

https://gerrit.wikimedia.org/r/723663

Change 723663 merged by jenkins-bot:

[operations/docker-images/imagecatalog@master] Minimal version of the image catalog

https://gerrit.wikimedia.org/r/723663

Change 742574 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Install and configure OCI image catalog on deploy hosts

https://gerrit.wikimedia.org/r/742574

Change 745196 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog

https://gerrit.wikimedia.org/r/745196

Change 745208 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add imagecatalog user to main and ml

https://gerrit.wikimedia.org/r/745208

Change 745208 merged by JMeybohm:

[labs/private@master] Add imagecatalog user to main and ml

https://gerrit.wikimedia.org/r/745208

Change 745202 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Use dedicated imagecatalog kubernetes user

https://gerrit.wikimedia.org/r/745202

Change 745196 merged by jenkins-bot:

[operations/deployment-charts@master] RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog

https://gerrit.wikimedia.org/r/745196

Mentioned in SAL (#wikimedia-operations) [2021-12-15T17:44:32Z] <jayme> deployed imagecatalog RBAC rules to all k8s clusters - T287130

Change 745202 abandoned by JMeybohm:

[operations/puppet@production] Use dedicated imagecatalog kubernetes user

Reason:

Merged into I2ebf9c25d31334cbb6aad5e4de9293b6ab1d5cdc

https://gerrit.wikimedia.org/r/745202

Change 742574 merged by RLazarus:

[operations/puppet@production] imagecatalog: Install and configure OCI image catalog on deploy hosts

https://gerrit.wikimedia.org/r/742574

Change 747566 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Fix outdated TODO comment

https://gerrit.wikimedia.org/r/747566

Change 747566 merged by RLazarus:

[operations/puppet@production] imagecatalog: Fix outdated TODO comment

https://gerrit.wikimedia.org/r/747566

Change 747610 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: 0770, not 0440, so the init command can create a DB

https://gerrit.wikimedia.org/r/747610

Change 747610 merged by RLazarus:

[operations/puppet@production] imagecatalog: 0770, not 0440, so the init command can create a DB

https://gerrit.wikimedia.org/r/747610

Change 747683 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/imagecatalog@master] Use the Kubernetes config API as it was in v7.0.0 (buster)

https://gerrit.wikimedia.org/r/747683

Change 747685 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Pass cluster names along with config paths

https://gerrit.wikimedia.org/r/747685

Change 747881 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/imagecatalog@master] Add a pod_name column to ActiveContainerImage

https://gerrit.wikimedia.org/r/747881

Change 747683 merged by jenkins-bot:

[operations/docker-images/imagecatalog@master] Use the Kubernetes config API as it was in v7.0.0 (buster)

https://gerrit.wikimedia.org/r/747683

Change 748232 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/imagecatalog@master] Fix --cluster command line parsing and add tests

https://gerrit.wikimedia.org/r/748232

Change 747881 merged by jenkins-bot:

[operations/docker-images/imagecatalog@master] Add a pod_name column to ActiveContainerImage

https://gerrit.wikimedia.org/r/747881

Change 748232 merged by jenkins-bot:

[operations/docker-images/imagecatalog@master] Fix --clusters command line parsing and add tests

https://gerrit.wikimedia.org/r/748232

Change 747685 merged by RLazarus:

[operations/puppet@production] imagecatalog: Pass cluster names along with config paths

https://gerrit.wikimedia.org/r/747685

Change 748799 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Puppet spelling correction, s/str/String/

https://gerrit.wikimedia.org/r/748799

Change 748799 merged by RLazarus:

[operations/puppet@production] imagecatalog: Pass cluster names along with config paths

https://gerrit.wikimedia.org/r/748799

Change 748876 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Add an hourly systemd timer to scan for what's currently running

https://gerrit.wikimedia.org/r/748876

Status update, leaving this for the end-of-year break: 748876 and 748873 are still to be merged, but then the hourly scan, and web API, should be done and ready.

Still to do:

  • Add an active/passive DNS discovery record (and read it from the Puppet class so that the imagecatalog record systemd timer is only present on the active host)
  • Regularly rsync the database from the active host to the passive one
  • Add code to blubber and docker-pkg to POST to /register_build for each new image build
  • Write it all up on Wikitech
  • Start thinking about integration with Clair and Debmonitor

I’ll pick that all back up after the break. @Joe: My recollection is you were going to take care of the blubber and docker-pkg parts (although it’s been a while since we talked about it) -- with the exception of adding a DNS record to talk to, that should be unblocked, so feel free to grab it while I’m still out, if you're so inclined.

@Joe: My recollection is you were going to take care of the blubber and docker-pkg parts (although it’s been a while since we talked about it) -- with the exception of adding a DNS record to talk to, that should be unblocked, so feel free to grab it while I’m still out, if you're so inclined.

Your recollection is correct; I'll do what I can to actually start that work while you're out but not sure I'll get to it :)

  • Regularly rsync the database from the active host to the passive one

Hello @RLazarus and @Joe,

I have good news. This part is already done :)

I was about to add code for this, then got reminded of class rsync::deployment and this has both the $deployment_server parameter to check whis is active and it syncs $deployment_path = '/srv/deployment'. And since you are under /srv/deployment/imagecatalog/ this is alerady included.

root@deploy1002:/srv/deployment/imagecatalog# ls -als catalog.sqlite 
64 -rw-r--r-- 1 imagecatalog imagecatalog 65536 Dec 21 00:10 catalog.sqlite


root@deploy2002:/srv/deployment/imagecatalog# ls -als catalog.sqlite 
64 -rw-r--r-- 1 helm helm 65536 Dec 21 00:10 catalog.sqlite



deploy1002:/etc/rsync.d] $ cat frag-deployment_module 
# This file is being maintained by Puppet.
# DO NOT EDIT

[ deployment_module ]
path            = /srv/deployment
read only       = yes


[deploy2002:~] $  sudo systemctl status sync_deployment_dir.timer
● sync_deployment_dir.timer - Periodic execution of sync_deployment_dir.service
   Loaded: loaded (/lib/systemd/system/sync_deployment_dir.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Wed 2021-02-24 23:27:14 UTC; 9 months 26 days ago
  Trigger: Wed 2021-12-22 19:00:00 UTC; 11min left


[deploy2002:~] $  sudo systemctl status sync_deployment_dir.service
● sync_deployment_dir.service - rsync the deployment server data directory /srv/deployment
   Loaded: loaded (/lib/systemd/system/sync_deployment_dir.service; static; vendor preset: enabled)
   Active: inactive (dead) since Wed 2021-12-22 18:00:34 UTC; 48min ago
  Process: 30285 ExecStart=/usr/bin/rsync -avz --delete deploy1002.eqiad.wmnet::deployment_module /srv/deployment (code=exited, status=0/SUCCESS)
 Main PID: 30285 (code=exited, status=0/SUCCESS)

Dec 22 18:00:04 deploy2002 systemd[1]: Started rsync the deployment server data directory /srv/deployment.
Dec 22 18:00:04 deploy2002 rsync[30285]: receiving incremental file list

passive host pulls from active host, with --delete

Change 748876 merged by RLazarus:

[operations/puppet@production] imagecatalog: Add an hourly systemd timer to scan for what's currently running

https://gerrit.wikimedia.org/r/748876

The hourly imagecatalog record timer is working on deploy1002. It's failing on deploy2002, because something keeps overwriting the ownership of /srv/deployment/imagecatalog/catalog.sqlite -- it's supposed to be owned by imagecatalog, and that's how Puppet creates it, but in fact it's owned by helm. I haven't been immediately able to figure out why (the catalog is presumably getting blown away by that rsync, but I don't see where the helm user is coming from) but I don't think it needs to be figured out: that systemd timer is only going to run in the active DC anyway, so as soon as I set that up in the next Puppet patch, it'll be a non-issue.

I think we need to do something a little more complicated than just the overall /srv/deployment rsync, because we want to make sure the right thing happens when we switch DCs (regardless of when Puppet happens to run on each host) but we can address that in a followup.

Change 757530 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] imagecatalog: Only run on the active deployment host

https://gerrit.wikimedia.org/r/757530

Change 757530 merged by RLazarus:

[operations/puppet@production] imagecatalog: Only run on the active deployment host

https://gerrit.wikimedia.org/r/757530

@RLazarus: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!