Page MenuHomePhabricator

docker-reporter-releng-images => docker registry: status=3/NOTIMPLEMENTED
Closed, ResolvedPublic

Description

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=deneb&service=Check+systemd+state


on deneb: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service


[deneb:~] $ systemctl status docker-reporter-releng-images

● docker-reporter-releng-images.service - Report on upgrades to releng images.
   Loaded: loaded (/lib/systemd/system/docker-reporter-releng-images.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2021-10-04 18:44:08 UTC; 4h 14min ago
  Process: 23989 ExecStart=/usr/bin/docker-report --filter-file /etc/docker-report/releng_rules.ini docker-registry.wikimedia.org (code=exited, status=3
 Main PID: 23989 (code=exited, status=3)

Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/typos:0.0.3-s4                   [OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/wikimedia-audit-resources:0.1.2-s4[OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/zuul-cloner:0.2.1-s5             [OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch:1.0.0          [FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-bundle:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php71:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php72:1.0.0    [FAIL]
Oct 04 18:44:08 deneb systemd[1]: docker-reporter-releng-images.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Oct 04 18:44:08 deneb systemd[1]: docker-reporter-releng-images.service: Failed with result 'exit-code'.

Event Timeline

well.. just manually starting it fixed it for now:

[deneb:~] $ sudo systemctl start docker-reporter-releng-images

23:02 <+icinga-wm> RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

except it made me notice these 404s for jessie stuff:

Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-jessie
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Got a 404 not found for /v2/releng/ci-jessie/tags/list: possibly a case of https:/
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-src-setup
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Got a 404 not found for /v2/releng/ci-src-setup/tags/list: possibly a case of http
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-src-setup-simple
Dzahn triaged this task as Low priority.Oct 4 2021, 11:06 PM
Joe raised the priority of this task from Low to High.Oct 5 2021, 5:27 AM
Joe removed a project: serviceops.
Joe subscribed.

The issue here is that all of the stretch images fail to report because they've not been rebuilt on top of the latest libssl updates. This issue will be fixed once the images are rebuilt.

I already did this for both the production base images and the images that run on kubernetes, but upgrading the release engineering images is responsibility of Release-Engineering-Team .

Also triaging as high as all these images are basically unable to run apt-get or to download most of the internet in case of need.

Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/typos:0.0.3-s4                   [OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/wikimedia-audit-resources:0.1.2-s4[OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/zuul-cloner:0.2.1-s5             [OK]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch:1.0.0          [FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-bundle:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php71:0.0.47-s1[FAIL]
Oct 04 18:44:08 deneb docker-report-releng[23989]: docker-registry.wikimedia.org/releng/quibble-stretch-php72:1.0.0    [FAIL]

Looking for usage:

$ pwd
/home/thcipriani/Projects/Wikimedia/integration/config/jjb
$ git log --format='%H' -1
7d6e787f5bba059d3d5301d16fba8e98b0ba078d
$ git ack 'wikimedia-audit-resources:'
misc.yaml
177:         image: docker-registry.wikimedia.org/releng/wikimedia-audit-resources:0.1.2-s4
$ git ack 'image:.*zuul-cloner.*:'
macro-docker.yaml
233:        image: 'docker-registry.wikimedia.org/releng/zuul-cloner:0.2.1-s5'
$ git ack 'quibble-stretch.*:'
$ git ack 'image:.*typos.*:'
misc.yaml
102:            image: docker-registry.wikimedia.org/releng/typos:0.0.3-s4
  • quibble-stretch*: I think we can delete all the quibble-stretch images, poking @hashar to check me on that
  • audit-resources: python3 should be safe to rebuild, poking @Legoktm in case he can foresee issues
  • zuul-cloner: runs a python2 program and it does it for a lot of repos, @hashar, worries?
  • releng/typos: seems to check mediawiki/config for typos using git-grep. Easy rebuild, methinks.
  • quibble-stretch*: I think we can delete all the quibble-stretch images, poking @hashar to check me on that

No stupid questions: do I have the power to delete images and are there docs for that?

  • audit-resources: python3 should be safe to rebuild, poking @Legoktm in case he can foresee issues

Unfortunately this project is totally broken right now because of https://github.com/laurentj/slimerjs/issues/708, unfortunately I placed my bet on something that died pretty quickly. Probably best to remove it and disable the CI job until it can be fixed.

No stupid questions: do I have the power to delete images and are there docs for that?

https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images but I believe only SREs can access deneb to do that though.

Last time that was for Jessie images and we ended up deleting them (see T251918#7138799 and following comments).

We still rely on Stretch and I had all images rebuild and switched most jobs to the new versions (T291425). The update does include releng/typos, releng/zuul-cloner and releng/wikimedia-audit-resources, then they are marked as [OK] in the docker report log.

The releng/quibble-stretch* images are no more used, we have migrated to Buster based images ( IIRC when php 7.2 has been made available). So even though they are still in the Docker registry, their definitions have been removed from our docker-pkg config files. Assuming the monitor only looks at the latest version of the images, those are the sole being flagged since they did not get rebuild.

Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-jessie
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Got a 404 not found for /v2/releng/ci-jessie/tags/list: possibly a case of https:/
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-src-setup
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Got a 404 not found for /v2/releng/ci-src-setup/tags/list: possibly a case of http
Oct 04 23:04:41 deneb docker-report-releng[3425]: INFO[docker-report] Fetching tags for releng/ci-src-setup-simple

releng/ci-src-setup has been removed from our docker-pkg and is no more maintained (but releng/ci-src-setup-simple is still used and based on Buster.

Thus please delete:

  • releng/quibble-stretch* images. They are no more referenced in the Jenkins job. If some other project reuse those, they can switch to the Buster equivalent.
  • releng/ci-src-setup image

I guess later on we can add some kind of probe to ensure we delete images that are no more defined from the Docker registry :)

We should also likely delete the no-longer-updated dev/stretch-* images, or at least most of them. I can file a separate task for that, as we probably need to double-check that none of those are in obvious use.

Mentioned in SAL (#wikimedia-operations) [2021-10-05T23:02:10Z] <legoktm> deleting old stretch docker images from the registry for T292485

btw, when I ctrl+f for jessie on https://docker-registry.wikimedia.org/ I see:

releng/quibble-jessie-php55
releng/hhvm-jessie
releng/hhvm-jessie-compile

Are those safe to delete too?

ctrl+f for hhvm that's not already covered:

releng/composer-hhvm
releng/composer-package-hhvm
releng/hhvm
releng/hhvm-compile
releng/composer-test-hhvm

btw, when I ctrl+f for jessie on https://docker-registry.wikimedia.org/ I see:

releng/quibble-jessie-php55
releng/hhvm-jessie
releng/hhvm-jessie-compile

Are those safe to delete too?

ctrl+f for hhvm that's not already covered:

releng/composer-hhvm
releng/composer-package-hhvm
releng/hhvm
releng/hhvm-compile
releng/composer-test-hhvm

Yes, all of those I believe.

  • audit-resources: python3 should be safe to rebuild, poking @Legoktm in case he can foresee issues

Unfortunately this project is totally broken right now because of https://github.com/laurentj/slimerjs/issues/708, unfortunately I placed my bet on something that died pretty quickly. Probably best to remove it and disable the CI job until it can be fixed.

Done. Can you manually delete the image so the report doesn't whine?

Thus please delete:

  • releng/quibble-stretch* images. They are no more referenced in the Jenkins job. If some other project reuse those, they can switch to the Buster equivalent.

{{done}}, see P17425 for the full log.

  • releng/ci-src-setup image

I think this was already deleted by someone.

Deleted wikimedia-audit-resources too.

The job passes now:

Oct 06 01:32:54 deneb systemd[1]: docker-reporter-releng-images.service: Succeeded.

I'm not closing this just yet in case there are other images we can delete in this round of cleanup.

Looks fixed now.