Page MenuHomePhabricator

docker-registry: some layers has been corrupted due to deleting other swift containers
Closed, ResolvedPublic

Description

still to be cleared out, but it seems that after cleaning docker_registry_backup and docker_registry on codfw some layers and images has been corrupted

Details

Event Timeline

fsero created this task.Jul 16 2019, 6:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2019, 6:05 PM
fsero triaged this task as Unbreak Now! priority.Jul 16 2019, 6:05 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 16 2019, 6:05 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-16T18:05:54Z] <fsero> republishing base images for nodejs-slim due to registry T228196

Mentioned in SAL (#wikimedia-operations) [2019-07-16T19:51:13Z] <fsero> republishing base images for wikimedia-(stretch,jessie and buster) T228196

jijiki added a project: Operations.
jijiki added a subscriber: jijiki.
fsero added a comment.Jul 16 2019, 8:22 PM

lisf of affected images

coredns
dev/mediawiki
dev/mediawiki-xdebug
dev/restbase
dev/stretch
dev/stretch-php72
dev/stretch-php72-apache2
dev/stretch-php72-webserver
dev/stretch-php72-webserver-xdebug
envoy
fluent-bit
fluentd
kubernetes-fluentd-daemonset
nodejs10-slim
prometheus-statsd-exporter
releng/bazel
releng/castor
releng/ci-jessie
releng/ci-src-setup
releng/civicrm
releng/composer
releng/composer-package
releng/composer-package-hhvm
releng/composer-package-php55
releng/composer-package-php71
releng/composer-package-php72
releng/composer-package-php73
releng/composer-php55
releng/composer-php56
releng/composer-php71
releng/composer-php72
releng/composer-php73
releng/composer-test
releng/composer-test-php55
releng/composer-test-php56
releng/composer-test-php72
releng/doxygen
releng/gradle
releng/helm-linter
releng/hhvm-compile
releng/hhvm-jessie
releng/hhvm-jessie-compile
releng/java8
releng/java8-mjolnir
releng/java8-sonar-scanner
releng/java8-wikidata-query-rdf
releng/java8-xgboost
releng/mediawiki-phan
releng/mediawiki-phan-seccheck
releng/mediawiki-tarball
releng/node10-test-browser
releng/npm
releng/npm-browser-test
releng/npm-test
releng/npm-test-3d2png
releng/npm-test-graphoid
releng/npm-test-maps-service
releng/npm-test-mathoid
releng/npm-test-oojsui
releng/npm6-browser-test
releng/operations-puppet
releng/php-ast
releng/php-compile
releng/php55
releng/php71
releng/php71-compile
releng/php72
releng/php72-compile
releng/php73
releng/php73-compile
releng/phpmetrics
releng/quibble-coverage
releng/quibble-fresnel
releng/quibble-jessie
releng/quibble-jessie-hhvm
releng/quibble-jessie-php55
releng/quibble-jessie-php56
releng/quibble-stretch
releng/quibble-stretch-bundle
releng/quibble-stretch-hhvm
releng/quibble-stretch-php70
releng/quibble-stretch-php71
releng/quibble-stretch-php72
releng/quibble-stretch-php73
releng/rake-vagrant
releng/sury-php
releng/tox-acme-chief
releng/tox-cergen
releng/tox-certcentral
releng/tox-conftool
releng/tox-poolcounter
releng/tox-pyspark
releng/tox-pywikibot
releng/typos
releng/wikimedia-audit-resources
releng/zuul-cloner
ruby
servermon
service-checker
wikimedia/blubber
wikimedia/eventgate-ci
wikimedia/mediawiki-services-citoid
wikimedia/mediawiki-services-cxserver
wikimedia/mediawiki-services-graphoid
wikimedia/mediawiki-services-kask
wikimedia/mediawiki-services-mathoid
wikimedia/mediawiki-services-mobileapps
wikimedia/mediawiki-services-recommendation-api
wikimedia/mediawiki-services-restbase
wikimedia/mediawiki-services-wikifeeds
wikimedia/mediawiki-services-zotero
wikimedia/wikibase-termbox
wikimedia-jessie
wikimedia-stretch
wmfdebug

if you happen to use one of this images and experience some issue, trigger a rebuild and create a new docker image.

fsero added a comment.Jul 16 2019, 8:23 PM

base images wikimedia-jessie and wikimedia-stretch and affected production images

Successfully published image docker-registry.discovery.wmnet/nodejs10-devel
Successfully published image docker-registry.discovery.wmnet/service-checker
Successfully published image docker-registry.discovery.wmnet/wmfdebug
Successfully published image docker-registry.discovery.wmnet/golang
Successfully published image docker-registry.discovery.wmnet/prometheus-statsd-exporter
Successfully published image docker-registry.discovery.wmnet/nodejs10-slim
Successfully published image docker-registry.discovery.wmnet/ruby
Successfully published image docker-registry.discovery.wmnet/nodejs-slim
Successfully published image docker-registry.discovery.wmnet/coredns
Successfully published image docker-registry.discovery.wmnet/python3
Successfully published image docker-registry.discovery.wmnet/python3-build-stretch
Successfully published image docker-registry.discovery.wmnet/python3-devel
Successfully published image docker-registry.discovery.wmnet/nodejs-devel

has been uploaded

Mentioned in SAL (#wikimedia-releng) [2019-07-16T20:25:11Z] <James_F> Docker: Running a general rebuild for all missing RelEng images T228196

[contint1001.wikimedia.org] out: == Step 0: scanning /etc/zuul/wikimedia/dockerfiles ==
[contint1001.wikimedia.org] out: Will build the following images:
[contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-poolcounter:0.4.0
[contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-mysqld:0.4.0
[contint1001.wikimedia.org] out: * docker-registry.discovery.wmnet/releng/tox-conftool:0.4.0
[contint1001.wikimedia.org] out: == Step 1: building images ==
[contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-poolcounter:0.4.0
[contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-mysqld:0.4.0
[contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/tox-conftool:0.4.0
[contint1001.wikimedia.org] out: == Step 2: publishing ==
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/phpmetrics
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-poolcounter
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-php73
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php72
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/ci-src-setup
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-php71
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/civicrm
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-mysqld
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/bazel
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php72-compile
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-test
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/php-ast
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/tox-conftool
[contint1001.wikimedia.org] out: Successfully published image docker-registry.discovery.wmnet/releng/composer-package-hhvm
[contint1001.wikimedia.org] out: == Build done! ==

Change 523807 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] dockerfiles: Bump all 94 images for docker-registry consistency

https://gerrit.wikimedia.org/r/523807

Change 523813 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Point all docker references to new images post-T228196

https://gerrit.wikimedia.org/r/523813

Change 523807 merged by jenkins-bot:
[integration/config@master] dockerfiles: Bump all 94 images for docker-registry consistency

https://gerrit.wikimedia.org/r/523807

Mentioned in SAL (#wikimedia-releng) [2019-07-16T21:29:23Z] <James_F> Docker: Publishing a whole new set of RelEng images for T228196

bd808 added a subscriber: bd808.Jul 16 2019, 9:39 PM

https://integration.wikimedia.org/ci/job/labs-striker-tox-docker/152/console

15:37:12 Unable to find image 'docker-registry.wikimedia.org/releng/tox-labs-striker:0.4.0' locally
15:37:13 0.4.0: Pulling from releng/tox-labs-striker
15:37:13 8d22d214682d: Already exists
15:37:13 dd5d82f356b7: Already exists
15:37:13 6adceec4ad0e: Already exists
15:37:13 95587f1fdaf7: Already exists
15:37:13 dadbf101c83c: Already exists
15:37:13 3c032ac12182: Pulling fs layer
15:37:13 docker: error pulling image configuration: image config verification failed for digest sha256:791389b9b3c7d07865a7393dbe145f2d5244e4c3e2fa124d7d4bdee7909fcc9a.
15:37:13 See 'docker run --help'.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T22:28:53Z] <fsero> depooling ms-fe2005 for swift upload for registry T228196

Mentioned in SAL (#wikimedia-operations) [2019-07-16T22:35:13Z] <fsero> uploading only blobs on docker-registry-codfw from a backup on ms-fe2005 T228196

after rescuing blobs from ms-fe2005 backup it seems to have fixed pulling images. I don't see any errors doing:

for i in $(cat catalog | jq .repositories | tr -d \", ) ; do docker pull -a docker-registry.wikimedia.org/$i ; done
fsero lowered the priority of this task from Unbreak Now! to Medium.Jul 16 2019, 11:18 PM
fsero moved this task from To Triage to Active Situation on the Wikimedia-Incident board.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T23:26:18Z] <fsero> repool ms-fe2005 T228196

Mentioned in SAL (#wikimedia-releng) [2019-07-17T00:04:00Z] <James_F> Docker: Complete rebuild finished and published; only took 2.5 hours. But now T228196 is fixed elsewise anyway, oh well.

it seems that container synchronization is broken and swift container on eqiad doesnt hold the same data that in codfw. swift is eventually consistent so lets wait if the sync does it job over the weekend. If it doesnt get restored the best action plan is can think of right now is:

  1. disable container sync-to-sync on docker_registry_codfw in eqiad.
  2. disable container sync-to-sync on docker_registry_codfw in codfw.
  3. delete docker_registry_codfw in eqiad
  4. recreate docker_registry_codfw container in eqiad
  5. wait for syncing.

In the meantime i'll submit a CR to get more info about container sync-to-sync status.

Change 523930 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] swift: enable logging for container-sync-to-sync

https://gerrit.wikimedia.org/r/523930

Change 523930 merged by Fsero:
[operations/puppet@production] swift: enable logging for container synchronization-to-synchronization

https://gerrit.wikimedia.org/r/523930

Mentioned in SAL (#wikimedia-operations) [2019-07-17T14:45:41Z] <fsero> enabling container-sync logging T228196

Mentioned in SAL (#wikimedia-operations) [2019-07-17T15:15:49Z] <fsero> restarting swift-container-sync on ms-be* for getting logging configuration T228196

We're seeing this happening now on contint1001 a few times, e.g. https://integration.wikimedia.org/ci/job/composer-php70-docker/812/console:

12:10:39 ++ /usr/bin/env
12:10:39 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL|HHVM_REPO_CENTRAL_PATH)='
12:10:39 Unable to find image 'docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1' locally
12:10:39 0.1.7-s1: Pulling from releng/composer-test
12:10:39 8d22d214682d: Already exists
12:10:39 dd5d82f356b7: Already exists
12:10:39 e74dee1208c4: Already exists
12:10:39 69208455aa1f: Already exists
12:10:39 f1cba75babe0: Already exists
12:10:39 2cd12524c0dc: Already exists
12:10:39 3d3adb31207d: Already exists
12:10:39 e17ba03e55ec: Already exists
12:10:39 e9dd2befc159: Pulling fs layer
12:10:40 e9dd2befc159: Verifying Checksum
12:10:40 docker: filesystem layer verification failed for digest sha256:e9dd2befc159629b6a09232c6478fa48bedfc34117be1a04c1e337ebd0a46d27.
12:10:40 See 'docker run --help'.

For that particular image I can recreate locally:

$ docker pull docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1
0.1.7-s1: Pulling from releng/composer-test
8d22d214682d: Already exists 
dd5d82f356b7: Already exists 
e74dee1208c4: Verifying Checksum 
69208455aa1f: Download complete 
f1cba75babe0: Verifying Checksum 
2cd12524c0dc: Download complete 
3d3adb31207d: Download complete 
e17ba03e55ec: Waiting 
e9dd2befc159: Verifying Checksum 
filesystem layer verification failed for digest sha256:e9dd2befc159629b6a09232c6478fa48bedfc34117be1a04c1e337ebd0a46d27

Per https://integration.wikimedia.org/ci/job/translatewiki-composer-hhvm-docker/1192/console also docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1. Back to UBN!?

akosiaris raised the priority of this task from Medium to High.Jul 18 2019, 8:03 AM
akosiaris added a subscriber: akosiaris.

For that particular image I can recreate locally:

$ docker pull docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1
0.1.7-s1: Pulling from releng/composer-test
8d22d214682d: Already exists 
dd5d82f356b7: Already exists 
e74dee1208c4: Verifying Checksum 
69208455aa1f: Download complete 
f1cba75babe0: Verifying Checksum 
2cd12524c0dc: Download complete 
3d3adb31207d: Download complete 
e17ba03e55ec: Waiting 
e9dd2befc159: Verifying Checksum 
filesystem layer verification failed for digest sha256:e9dd2befc159629b6a09232c6478fa48bedfc34117be1a04c1e337ebd0a46d27

Verified as well. Back to high. I am not setting UBN yet, as it's not currently causing an outage or impact to end users (yet).

Mentioned in SAL (#wikimedia-operations) [2019-07-18T09:03:56Z] <fsero> reuploding missing layers T228196

fsero added a comment.Jul 18 2019, 9:05 AM

i've uploaded the missing layers from a backup, it works for me now

➜  ~ docker rmi -f docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1                                                                                                     (⎈ |helmmanagement:kube-system)
Untagged: docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1
Untagged: docker-registry.wikimedia.org/releng/composer-test@sha256:f99405a2dd1173f796b5c81660660bc6f49a0ea3300576af069e936dad9fb63c
Deleted: sha256:1f563408d90777c2f7ff0a79c0594294845d2646f8ae806c1a8e26c2bbd5ec98
Deleted: sha256:9df03fc0e4d06b5294720c5ee64c024ea51402ae5e02734b33a5f08f4f711cb1
Deleted: sha256:13acf6b84ee76cefc707bcbf9196619f37a68f4aeb27cb6f7c8b91ba4dc690da
Deleted: sha256:7389d98ad5d55a9f98be1b39b175859d7d9879f034caf24a59ce2bf90f71fb90
Deleted: sha256:4bc1b5f74b631ae89f3cba6d1dd620b59f7177cd3f54085a0d5194ea2ed3776e
Deleted: sha256:36bf06b66f8f2382a6a7937ed9f414749413193242397de2e6d22e3725bd4f38
Deleted: sha256:211b7399b865254092f9a4395bf6a988381236180888e339fd710d2c64330c2b
Deleted: sha256:55192213c2654ea3edf83585706b4a93c6c8481c76bf247d2b566a5eb9c0ac5a
➜  ~ docker pull docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1                                                                                                       (⎈ |helmmanagement:kube-system)
0.1.7-s1: Pulling from releng/composer-test
8d22d214682d: Already exists 
dd5d82f356b7: Already exists 
e74dee1208c4: Pull complete 
69208455aa1f: Pull complete 
f1cba75babe0: Pull complete 
2cd12524c0dc: Pull complete 
3d3adb31207d: Pull complete 
e17ba03e55ec: Pull complete 
e9dd2befc159: Pull complete 
Digest: sha256:f99405a2dd1173f796b5c81660660bc6f49a0ea3300576af069e936dad9fb63c
Status: Downloaded newer image for docker-registry.wikimedia.org/releng/composer-test:0.1.7-s1
fsero added a comment.Jul 18 2019, 9:07 AM

fixes also docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1 @Nikerabbit

0.2.6-s1: Pulling from releng/composer-test-hhvm
d397e275c51e: Already exists 
a041ea3cae5b: Already exists 
97190e58327a: Already exists 
70dfef3c11db: Already exists 
30c66113a861: Already exists 
9a42aa59c9aa: Already exists 
8560bf24903a: Already exists 
5953cc9a1ec2: Already exists 
78f86f63cfb3: Already exists 
e9dd2befc159: Already exists 
Digest: sha256:7b6990ce606e46c4bb6d7f719fed8f3195115d1c95879784d3e90a45c462b6b7
Status: Downloaded newer image for docker-registry.wikimedia.org/releng/composer-test-hhvm:0.2.6-s1

Mentioned in SAL (#wikimedia-operations) [2019-07-18T15:54:30Z] <fsero> depool ms-fe2005 - T228196

fsero added a comment.Jul 19 2019, 4:52 AM

I did a complete pull of all images and tags of our registry running (results are in the file attached)

curl 'https://docker-registry.wikimedia.org/v2/_catalog?n=1000' > catalog
for i in $(cat catalog | jq -r .repositories[]  ) ; do docker pull -a docker-registry.wikimedia.org/$i | tee -a /tmp/docker-pulls; docker rmi -f $(docker images -a -q); done

there aren't any errors pulling any image.

fsero lowered the priority of this task from High to Medium.Jul 19 2019, 4:52 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-19T05:26:18Z] <fsero> repool ms-fe2005 - T228196

jijiki moved this task from Backlog to Next up on the serviceops board.Jul 19 2019, 8:49 AM

Change 524478 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/docker-images/production-images@master] Keeping in code what i did in boron for T228196.

https://gerrit.wikimedia.org/r/524478

Change 524478 merged by Fsero:
[operations/docker-images/production-images@master] Keeping in code what i did in boron for T228196.

https://gerrit.wikimedia.org/r/524478

greg added a subscriber: greg.Jul 25 2019, 11:03 PM

What are the next steps with this incident task?

The report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190716-docker-registry

@fsero is there a retrospective you'd like to have with others in SRE to prioritize/file tasks for the actionables?

fsero added a comment.Jul 26 2019, 9:17 AM

@greg thanks for following this, i definitely would like to have a retrospective about it, and there are some leftovers like creating phab tasks et al.

if a retrospective / postmortem session is done it should include people from releng as well th, because while the registry is maintained by SREs you also provide docker images for the organization and the CI system is a clear dependency of registry IMO. If people in the organization has interest im more than happy to organize and drive such session.

Regarding the third bullet point of the actionables, lets try to put some light about the pinning issue.

During the incident, there was a moment where we decided to rebuild images, docker images are by definition immutable but because in Dockerfiles generated from docker-pkg packages are not pinned some package versions could change between builds introducing errors and breaking changes. IMO this packages should also be pinned with a version so either the build completes or it fails because that specific version is not available. If we had an artifact storage system this second error should never happen as we will have in the artifact storage the needed artifacts for the build being tar.gz packages, node packages, pip or debian packages.

Link: https://gerrit.wikimedia.org/r/c/integration/config/+/523807

akosiaris closed this task as Resolved.Wed, Nov 20, 8:30 AM
akosiaris claimed this task.

I 'll boldly resolve (no update since July), feel free to reopen