still to be cleared out, but it seems that after cleaning docker_registry_backup and docker_registry on codfw some layers and images has been corrupted
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | akosiaris | T228196 docker-registry: some layers has been corrupted due to deleting other swift containers | |||
Declined | None | T229117 create swift container-to-container synchronization metrics | |||
Declined | None | T229118 create a docker_registry_codfw swift container backup |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2019-07-16T18:05:54Z] <fsero> republishing base images for nodejs-slim due to registry T228196
Mentioned in SAL (#wikimedia-operations) [2019-07-16T19:51:13Z] <fsero> republishing base images for wikimedia-(stretch,jessie and buster) T228196
lisf of affected images
coredns dev/mediawiki dev/mediawiki-xdebug dev/restbase dev/stretch dev/stretch-php72 dev/stretch-php72-apache2 dev/stretch-php72-webserver dev/stretch-php72-webserver-xdebug envoy fluent-bit fluentd kubernetes-fluentd-daemonset nodejs10-slim prometheus-statsd-exporter releng/bazel releng/castor releng/ci-jessie releng/ci-src-setup releng/civicrm releng/composer releng/composer-package releng/composer-package-hhvm releng/composer-package-php55 releng/composer-package-php71 releng/composer-package-php72 releng/composer-package-php73 releng/composer-php55 releng/composer-php56 releng/composer-php71 releng/composer-php72 releng/composer-php73 releng/composer-test releng/composer-test-php55 releng/composer-test-php56 releng/composer-test-php72 releng/doxygen releng/gradle releng/helm-linter releng/hhvm-compile releng/hhvm-jessie releng/hhvm-jessie-compile releng/java8 releng/java8-mjolnir releng/java8-sonar-scanner releng/java8-wikidata-query-rdf releng/java8-xgboost releng/mediawiki-phan releng/mediawiki-phan-seccheck releng/mediawiki-tarball releng/node10-test-browser releng/npm releng/npm-browser-test releng/npm-test releng/npm-test-3d2png releng/npm-test-graphoid releng/npm-test-maps-service releng/npm-test-mathoid releng/npm-test-oojsui releng/npm6-browser-test releng/operations-puppet releng/php-ast releng/php-compile releng/php55 releng/php71 releng/php71-compile releng/php72 releng/php72-compile releng/php73 releng/php73-compile releng/phpmetrics releng/quibble-coverage releng/quibble-fresnel releng/quibble-jessie releng/quibble-jessie-hhvm releng/quibble-jessie-php55 releng/quibble-jessie-php56 releng/quibble-stretch releng/quibble-stretch-bundle releng/quibble-stretch-hhvm releng/quibble-stretch-php70 releng/quibble-stretch-php71 releng/quibble-stretch-php72 releng/quibble-stretch-php73 releng/rake-vagrant releng/sury-php releng/tox-acme-chief releng/tox-cergen releng/tox-certcentral releng/tox-conftool releng/tox-poolcounter releng/tox-pyspark releng/tox-pywikibot releng/typos releng/wikimedia-audit-resources releng/zuul-cloner ruby servermon service-checker wikimedia/blubber wikimedia/eventgate-ci wikimedia/mediawiki-services-citoid wikimedia/mediawiki-services-cxserver wikimedia/mediawiki-services-graphoid wikimedia/mediawiki-services-kask wikimedia/mediawiki-services-mathoid wikimedia/mediawiki-services-mobileapps wikimedia/mediawiki-services-recommendation-api wikimedia/mediawiki-services-restbase wikimedia/mediawiki-services-wikifeeds wikimedia/mediawiki-services-zotero wikimedia/wikibase-termbox wikimedia-jessie wikimedia-stretch wmfdebug
if you happen to use one of this images and experience some issue, trigger a rebuild and create a new docker image.
base images wikimedia-jessie and wikimedia-stretch and affected production images
Successfully published image docker-registry.discovery.wmnet/nodejs10-devel Successfully published image docker-registry.discovery.wmnet/service-checker Successfully published image docker-registry.discovery.wmnet/wmfdebug Successfully published image docker-registry.discovery.wmnet/golang Successfully published image docker-registry.discovery.wmnet/prometheus-statsd-exporter Successfully published image docker-registry.discovery.wmnet/nodejs10-slim Successfully published image docker-registry.discovery.wmnet/ruby Successfully published image docker-registry.discovery.wmnet/nodejs-slim Successfully published image docker-registry.discovery.wmnet/coredns Successfully published image docker-registry.discovery.wmnet/python3 Successfully published image docker-registry.discovery.wmnet/python3-build-stretch Successfully published image docker-registry.discovery.wmnet/python3-devel Successfully published image docker-registry.discovery.wmnet/nodejs-devel
has been uploaded
Mentioned in SAL (#wikimedia-releng) [2019-07-16T20:25:11Z] <James_F> Docker: Running a general rebuild for all missing RelEng images T228196
[contint1001.wikimedia.org] out: == Step 0: scanning /etc/zuul/wikimedia/dockerfiles == [contint1001.wikimedia.org] out: Will build the following images: [contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-poolcounter:0.4.0 [contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-mysqld:0.4.0 [contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-conftool:0.4.0 [contint1001.wikimedia.org] out: == Step 1: building images == [contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-poolcounter:0.4.0 [contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-mysqld:0.4.0 [contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-conftool:0.4.0 [contint1001.wikimedia.org] out: == Step 2: publishing == [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/phpmetrics [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-poolcounter [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-php73 [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php72 [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/ci-src-setup [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-php71 [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/civicrm [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-mysqld [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/bazel [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php72-compile [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-test [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php-ast [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-conftool [contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-hhvm [contint1001.wikimedia.org] out: == Build done! ==
Change 523807 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] dockerfiles: Bump all 94 images for docker-registry consistency
Change 523813 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Point all docker references to new images post-T228196
Change 523807 merged by jenkins-bot:
[integration/config@master] dockerfiles: Bump all 94 images for docker-registry consistency
Mentioned in SAL (#wikimedia-releng) [2019-07-16T21:29:23Z] <James_F> Docker: Publishing a whole new set of RelEng images for T228196
https://integration.wikimedia.org/ci/job/labs-striker-tox-docker/152/console
15:37:12 Unable to find image 'docker-registry.wikimedia.org/releng/tox-labs-striker:0.4.0' locally 15:37:13 0.4.0: Pulling from releng/tox-labs-striker 15:37:13 8d22d214682d: Already exists 15:37:13 dd5d82f356b7: Already exists 15:37:13 6adceec4ad0e: Already exists 15:37:13 95587f1fdaf7: Already exists 15:37:13 dadbf101c83c: Already exists 15:37:13 3c032ac12182: Pulling fs layer 15:37:13 docker: error pulling image configuration: image config verification failed for digest sha256:791389b9b3c7d07865a7393dbe145f2d5244e4c3e2fa124d7d4bdee7909fcc9a. 15:37:13 See 'docker run --help'.
Mentioned in SAL (#wikimedia-operations) [2019-07-16T22:28:53Z] <fsero> depooling ms-fe2005 for swift upload for registry T228196
Mentioned in SAL (#wikimedia-operations) [2019-07-16T22:35:13Z] <fsero> uploading only blobs on docker-registry-codfw from a backup on ms-fe2005 T228196
after rescuing blobs from ms-fe2005 backup it seems to have fixed pulling images. I don't see any errors doing:
for i in $(cat catalog | jq .repositories | tr -d \", ) ; do docker pull -a docker-registry.wikimedia.org/$i ; done
Mentioned in SAL (#wikimedia-operations) [2019-07-16T23:26:18Z] <fsero> repool ms-fe2005 T228196
Mentioned in SAL (#wikimedia-releng) [2019-07-17T00:04:00Z] <James_F> Docker: Complete rebuild finished and published; only took 2.5 hours. But now T228196 is fixed elsewise anyway, oh well.
it seems that container synchronization is broken and swift container on eqiad doesnt hold the same data that in codfw. swift is eventually consistent so lets wait if the sync does it job over the weekend. If it doesnt get restored the best action plan is can think of right now is:
- disable container sync-to-sync on docker_registry_codfw in eqiad.
- disable container sync-to-sync on docker_registry_codfw in codfw.
- delete docker_registry_codfw in eqiad
- recreate docker_registry_codfw container in eqiad
- wait for syncing.
In the meantime i'll submit a CR to get more info about container sync-to-sync status.
Change 523930 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] swift: enable logging for container-sync-to-sync
Change 523930 merged by Fsero:
[operations/puppet@production] swift: enable logging for container synchronization-to-synchronization
Mentioned in SAL (#wikimedia-operations) [2019-07-17T14:45:41Z] <fsero> enabling container-sync logging T228196
Mentioned in SAL (#wikimedia-operations) [2019-07-17T15:15:49Z] <fsero> restarting swift-container-sync on ms-be* for getting logging configuration T228196
We're seeing this happening now on contint1001 a few times, e.g. https://integration.wikimedia.org/ci/job/composer-php70-docker/812/console:
12:10:39 ++ /usr/bin/env 12:10:39 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL|HHVM_REPO_CENTRAL_PATH)=' 12:10:39 Unable to find image 'docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1' locally 12:10:39 0.1.7-s1: Pulling from releng/composer-test 12:10:39 8d22d214682d: Already exists 12:10:39 dd5d82f356b7: Already exists 12:10:39 e74dee1208c4: Already exists 12:10:39 69208455aa1f: Already exists 12:10:39 f1cba75babe0: Already exists 12:10:39 2cd12524c0dc: Already exists 12:10:39 3d3adb31207d: Already exists 12:10:39 e17ba03e55ec: Already exists 12:10:39 e9dd2befc159: Pulling fs layer 12:10:40 e9dd2befc159: Verifying Checksum 12:10:40 docker: filesystem layer verification failed for digest sha256:e9dd2befc159629b6a09232c6478fa48bedfc34117be1a04c1e337ebd0a46d27. 12:10:40 See 'docker run --help'.
For that particular image I can recreate locally:
$ docker pull docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1 0.1.7-s1: Pulling from releng/composer-test 8d22d214682d: Already exists dd5d82f356b7: Already exists e74dee1208c4: Verifying Checksum 69208455aa1f: Download complete f1cba75babe0: Verifying Checksum 2cd12524c0dc: Download complete 3d3adb31207d: Download complete e17ba03e55ec: Waiting e9dd2befc159: Verifying Checksum filesystem layer verification failed for digest sha256:e9dd2befc159629b6a09232c6478fa48bedfc34117be1a04c1e337ebd0a46d27
Per https://integration.wikimedia.org/ci/job/translatewiki-composer-hhvm-docker/1192/console also docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1. Back to UBN!?
Verified as well. Back to high. I am not setting UBN yet, as it's not currently causing an outage or impact to end users (yet).
Mentioned in SAL (#wikimedia-operations) [2019-07-18T09:03:56Z] <fsero> reuploding missing layers T228196
i've uploaded the missing layers from a backup, it works for me now
➜ ~ docker rmi -f docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1 (⎈ |helmmanagement:kube-system) Untagged: docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1 Untagged: docker-registry.wikimedia.org/releng/composer-test@sha256:f99405a2dd1173f796b5c81660660bc6f49a0ea3300576af069e936dad9fb63c Deleted: sha256:1f563408d90777c2f7ff0a79c0594294845d2646f8ae806c1a8e26c2bbd5ec98 Deleted: sha256:9df03fc0e4d06b5294720c5ee64c024ea51402ae5e02734b33a5f08f4f711cb1 Deleted: sha256:13acf6b84ee76cefc707bcbf9196619f37a68f4aeb27cb6f7c8b91ba4dc690da Deleted: sha256:7389d98ad5d55a9f98be1b39b175859d7d9879f034caf24a59ce2bf90f71fb90 Deleted: sha256:4bc1b5f74b631ae89f3cba6d1dd620b59f7177cd3f54085a0d5194ea2ed3776e Deleted: sha256:36bf06b66f8f2382a6a7937ed9f414749413193242397de2e6d22e3725bd4f38 Deleted: sha256:211b7399b865254092f9a4395bf6a988381236180888e339fd710d2c64330c2b Deleted: sha256:55192213c2654ea3edf83585706b4a93c6c8481c76bf247d2b566a5eb9c0ac5a ➜ ~ docker pull docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1 (⎈ |helmmanagement:kube-system) 0.1.7-s1: Pulling from releng/composer-test 8d22d214682d: Already exists dd5d82f356b7: Already exists e74dee1208c4: Pull complete 69208455aa1f: Pull complete f1cba75babe0: Pull complete 2cd12524c0dc: Pull complete 3d3adb31207d: Pull complete e17ba03e55ec: Pull complete e9dd2befc159: Pull complete Digest: sha256:f99405a2dd1173f796b5c81660660bc6f49a0ea3300576af069e936dad9fb63c Status: Downloaded newer image for docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1
fixes also docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1 @Nikerabbit
0.2.6-s1: Pulling from releng/composer-test-hhvm d397e275c51e: Already exists a041ea3cae5b: Already exists 97190e58327a: Already exists 70dfef3c11db: Already exists 30c66113a861: Already exists 9a42aa59c9aa: Already exists 8560bf24903a: Already exists 5953cc9a1ec2: Already exists 78f86f63cfb3: Already exists e9dd2befc159: Already exists Digest: sha256:7b6990ce606e46c4bb6d7f719fed8f3195115d1c95879784d3e90a45c462b6b7 Status: Downloaded newer image for docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1
Mentioned in SAL (#wikimedia-operations) [2019-07-18T15:54:30Z] <fsero> depool ms-fe2005 - T228196
I did a complete pull of all images and tags of our registry running (results are in the file attached)
curl 'https://docker-registry.wikimedia.org/v2/_catalog?n=1000' > catalog for i in $(cat catalog | jq -r .repositories[] ) ; do docker pull -a docker-registry.wikimedia.org/$i | tee -a /tmp/docker-pulls; docker rmi -f $(docker images -a -q); done
there aren't any errors pulling any image.
Mentioned in SAL (#wikimedia-operations) [2019-07-19T05:26:18Z] <fsero> repool ms-fe2005 - T228196
Change 524478 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/docker-images/production-images@master] Keeping in code what i did in boron for T228196.
Change 524478 merged by Fsero:
[operations/docker-images/production-images@master] Keeping in code what i did in boron for T228196.
What are the next steps with this incident task?
The report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190716-docker-registry
@fsero is there a retrospective you'd like to have with others in SRE to prioritize/file tasks for the actionables?
@greg thanks for following this, i definitely would like to have a retrospective about it, and there are some leftovers like creating phab tasks et al.
if a retrospective / postmortem session is done it should include people from releng as well th, because while the registry is maintained by SREs you also provide docker images for the organization and the CI system is a clear dependency of registry IMO. If people in the organization has interest im more than happy to organize and drive such session.
Regarding the third bullet point of the actionables, lets try to put some light about the pinning issue.
During the incident, there was a moment where we decided to rebuild images, docker images are by definition immutable but because in Dockerfiles generated from docker-pkg packages are not pinned some package versions could change between builds introducing errors and breaking changes. IMO this packages should also be pinned with a version so either the build completes or it fails because that specific version is not available. If we had an artifact storage system this second error should never happen as we will have in the artifact storage the needed artifacts for the build being tar.gz packages, node packages, pip or debian packages.
Link: https://gerrit.wikimedia.org/r/c/integration/config/+/523807
Change 523813 abandoned by Jforrester:
jjb: Point all docker references to new images post-T228196
Reason:
Not needed, in the end.