Page MenuHomePhabricator

docker registry swift replication is not replicating content between DCs
Closed, ResolvedPublic

Description

When a new image is created (like wikimedia-buster image) and uploaded to the active docker-registry (currently on codfw) the configured swift container synchronization should replicate the image on eqiad and indeed it does however it misses some files listing the image as corrupt.

so requesting latest wikimedia-buster image from eqiad returns:

root@registry1001:/tmp# sudo curl -X GET -vvvv docker-registry.svc.eqiad.wmnet/v2/wikimedia-buster/manifests/latest
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 10.2.2.44...
* TCP_NODELAY set
* connect to 10.2.2.44 port 80 failed: Connection refused
* Failed to connect to docker-registry.svc.eqiad.wmnet port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to docker-registry.svc.eqiad.wmnet port 80: Connection refused
root@registry1001:/tmp# sudo curl -X GET -k -vvvv https://docker-registry.svc.eqiad.wmnet/v2/wikimedia-buster/manifests/latest
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 10.2.2.44...
* TCP_NODELAY set
* Connected to docker-registry.svc.eqiad.wmnet (10.2.2.44) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=docker-registry.discovery.wmnet; O=Wikimedia Foundation, Inc.; C=US; ST=California; L=San Francisco
*  start date: Oct 19 08:50:01 2017 GMT
*  expire date: Oct 19 08:50:01 2022 GMT
*  issuer: CN=Puppet CA: palladium.eqiad.wmnet
*  SSL certificate verify ok.
> GET /v2/wikimedia-buster/manifests/latest HTTP/1.1
> Host: docker-registry.svc.eqiad.wmnet
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< Server: nginx/1.13.6
< Date: Tue, 09 Jul 2019 09:41:40 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 82
< Connection: keep-alive
< Docker-Distribution-Api-Version: registry/2.0
< X-Content-Type-Options: nosniff
< Docker-Distribution-Api-Version: registry/2.0
< 
{"errors":[{"code":"MANIFEST_INVALID","message":"manifest invalid","detail":{}}]}
* Curl_http_done: called premature == 0
* Connection #0 to host docker-registry.svc.eqiad.wmnet left intact

Doing it on codfw, works seamlessly

root@registry1001:/tmp# sudo curl -X GET -k -vvvv https://docker-registry.svc.codfw.wmnet/v2/wikimedia-buster/manifests/latest
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 10.2.1.44...
* TCP_NODELAY set
* Connected to docker-registry.svc.codfw.wmnet (10.2.1.44) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=docker-registry.discovery.wmnet; O=Wikimedia Foundation, Inc.; C=US; ST=California; L=San Francisco
*  start date: Oct 19 08:50:01 2017 GMT
*  expire date: Oct 19 08:50:01 2022 GMT
*  issuer: CN=Puppet CA: palladium.eqiad.wmnet
*  SSL certificate verify ok.
> GET /v2/wikimedia-buster/manifests/latest HTTP/1.1
> Host: docker-registry.svc.codfw.wmnet
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx/1.13.6
< Date: Tue, 09 Jul 2019 09:43:08 GMT
< Content-Type: application/vnd.docker.distribution.manifest.v1+prettyjws
< Content-Length: 2045
< Connection: keep-alive
< Docker-Content-Digest: sha256:a8606a6d9865d6eb310cec91ccfb1067ee0c7e971d56e956f15897f922cf2110
< Docker-Distribution-Api-Version: registry/2.0
< Etag: "sha256:a8606a6d9865d6eb310cec91ccfb1067ee0c7e971d56e956f15897f922cf2110"
< X-Content-Type-Options: nosniff
< Docker-Distribution-Api-Version: registry/2.0
< 
{
   "schemaVersion": 1,
   "name": "wikimedia-buster",
   "tag": "latest",
   "architecture": "amd64",
   "fsLayers": [
      {
         "blobSum": "sha256:de51cc96c146da3f64ee154f732b62d50d223835796b1139fa29b324292f0b79"
      }
   ],
   "history": [
      {
         "v1Compatibility": "{\"architecture\":\"amd64\",\"comment\":\"Imported from -\",\"config\":{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[\"LC_ALL=C.UTF-8\"],\"Cmd\":[\"/bin/sh\",\"-c\",\"/bin/bash\"],\"Image\":\"\",\"Volumes\":null,\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":{\"architecture\":\"amd64\",\"build-date\":\"2019-07-05T13:16:38Z\",\"description\":\"Debian buster amd64\",\"name\":\"docker-registry.discovery.wmnet/wikimedia-buster:latest\"}},\"container_config\":{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":null,\"Cmd\":null,\"Image\":\"\",\"Volumes\":null,\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":null},\"created\":\"2019-07-05T13:16:41.55881672Z\",\"docker_version\":\"1.12.6\",\"id\":\"94a6be54d57198c06748cd5c9ed9d058e979567389a8be4eb7b53731c1d0106b\",\"os\":\"linux\"}"
      }
   ],
   "signatures": [
      {
         "header": {
            "jwk": {
               "crv": "P-256",
               "kid": "7MM4:CW3M:EFBF:BGOA:OVGI:WX72:JF2L:RJRY:DZ6C:5VQU:5VLX:UOSN",
               "kty": "EC",
               "x": "oCsa4QRiARg02xc8WwN0GLmLNqU1So4EEr6UIF4JbNI",
               "y": "aBnk6VHQNl5crtYHUSUKqS9esHN2W8tGNst1Yu_IWnE"
            },
            "alg": "ES256"
         },
         "signature": "Tj_vLM1DVQVUh5HDdwYNk3ySoCLPKRSLBEibXlwJBtgt1pE9C68o7l8aSQtGxKSWiI6cp9f6mdu-lV0x0ME5CA",
         "protected": "eyJmb3JtYXRMZW5ndGgiOjEzOTgsImZvcm1hdFRhaWwiOiJDbjAiLCJ0aW1lIjoiMjAxOS0wNy0wOVQwOTo0MzowOFoifQ"
      }
   ]
* Curl_http_done: called premature == 0
* Connection #0 to host docker-registry.svc.codfw.wmnet left intact

Enabling debug log level on eqiad registry gives more context and points to swift as the culprit

Jul 09 09:41:40 registry1001 docker-registry[4519]: time="2019-07-09T09:41:40.599808114Z" level=info msg="rewriting manifest sha256:447f8af8dea0d98d4d7c0abea37a52ec183382fbe1748a4fd087c5d5bf75a62d in schema1 format to support old client" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=8e982fd7-a4b2-496f-b6e5-5b9ca7a9cdf1 http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:41:40 registry1001 docker-registry[4519]: time="2019-07-09T09:41:40.808312244Z" level=error msg="response completed with error" err.code="manifest invalid" err.detail="unknown blob" err.message="manifest invalid" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=8e982fd7-a4b2-496f-b6e5-5b9ca7a9cdf1 http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=511.742978ms http.response.status=400 http.response.written=82 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:41:40 registry1001 docker-registry[4519]: 127.0.0.1 - - [09/Jul/2019:09:41:40 +0000] "GET /v2/wikimedia-buster/manifests/latest HTTP/1.1" 400 82 "" "curl/7.52.1"
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.218922183Z" level=debug msg="authorizing request" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.219050576Z" level=debug msg=GetImageManifest go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.271632199Z" level=debug msg="swift.GetContent("/docker/registry/v2/repositories/wikimedia-buster/_manifests/tags/latest/current/link")" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" trace.duration=52.482071ms trace.file="github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).GetContent" trace.id=27fec14d-3000-49b7-8dd1-fdf175706c4c trace.line=95 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.271752836Z" level=debug msg="(*manifestStore).Get" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.303637355Z" level=debug msg="swift.GetContent("/docker/registry/v2/repositories/wikimedia-buster/_manifests/revisions/sha256/447f8af8dea0d98d4d7c0abea37a52ec183382fbe1748a4fd087c5d5bf75a62d/link")" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" trace.duration=31.78008ms trace.file="github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).GetContent" trace.id=4d9cd200-d9ce-4f56-b604-65204f574029 trace.line=95 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.399049261Z" level=debug msg="swift.Reader("/docker/registry/v2/blobs/sha256/44/447f8af8dea0d98d4d7c0abea37a52ec183382fbe1748a4fd087c5d5bf75a62d/data", 0)" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" trace.duration=94.410943ms trace.file="github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Reader" trace.id=fcc0c90b-c7b4-43e2-9ee4-0cd097bb1054 trace.line=125 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.399219304Z" level=debug msg="(*schema2ManifestHandler).Unmarshal" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.399503086Z" level=info msg="rewriting manifest sha256:447f8af8dea0d98d4d7c0abea37a52ec183382fbe1748a4fd087c5d5bf75a62d in schema1 format to support old client" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.437189999Z" level=debug msg="swift.GetContent("/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link")" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" trace.duration=36.420267ms trace.file="github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).GetContent" trace.id=471aa815-6681-4c08-8ba7-cb318dcaef46 trace.line=95 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.586959387Z" level=debug msg="swift.Stat("/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data")" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" trace.duration=148.682085ms trace.file="github.com/docker/distribution/registry/storage/driver/base/base.go" trace.func="github.com/docker/distribution/registry/storage/driver/base.(*Base).Stat" trace.id=349f5584-08f7-4b2f-b640-4fae8fbc8dda trace.line=155 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: time="2019-07-09T09:44:33.587258847Z" level=error msg="response completed with error" err.code="manifest invalid" err.detail="unknown blob" err.message="manifest invalid" go.version=go1.11.4 http.request.host=docker-registry.svc.eqiad.wmnet http.request.id=11681be1-3028-406f-b9e9-3dcbc30b6c3a http.request.method=GET http.request.remoteaddr=10.2.2.44 http.request.uri="/v2/wikimedia-buster/manifests/latest" http.request.useragent="curl/7.52.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=370.401006ms http.response.status=400 http.response.written=82 vars.name=wikimedia-buster vars.reference=latest
Jul 09 09:44:33 registry1001 docker-registry[20618]: 127.0.0.1 - - [09/Jul/2019:09:44:33 +0000] "GET /v2/wikimedia-buster/manifests/latest HTTP/1.1" 400 82 "" "curl/7.52.1"

it seems that /docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de434 file does not exist on eqiad's swift while it does in codfw.

Searching in the swift's logs the aforementioned file it seems that the synchronization was done but ended in conflict for some reason.

Jul  5 15:32:02 ms-be1044 container-server: 10.64.48.144 - - [05/Jul/2019:15:32:02 +0000] "PUT /sdb3/12897/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" 201 - "PUT http://swift.svc.eqiad.wmnet/sdh1/1088/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" "txb5bcd85d73db477988e57-005d1f6d72" "object-server 1682" 0.0013 "-" 69376 0
Jul  5 15:32:02 ms-be1023 object-server: 10.64.32.221 - - [05/Jul/2019:15:32:02 +0000] "PUT /sdh1/1088/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" 201 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" "txb5bcd85d73db477988e57-005d1f6d72" "proxy-server 19771" 0.0895 "-" 1840 0
Jul  5 15:32:02 ms-be1039 object-server: 10.64.32.221 - - [05/Jul/2019:15:32:02 +0000] "PUT /sdh1/1088/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" 201 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" "txb5bcd85d73db477988e57-005d1f6d72" "proxy-server 19771" 0.0763 "-" 1682 0
Jul  5 15:32:02 ms-be1026 container-server: 10.64.0.138 - - [05/Jul/2019:15:32:02 +0000] "PUT /sda3/12897/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" 201 - "PUT http://swift.svc.eqiad.wmnet/sdn1/1088/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" "txb5bcd85d73db477988e57-005d1f6d72" "object-server 2054" 0.0061 "-" 9543 0
Jul  5 15:32:02 ms-be1044 object-server: 10.64.32.221 - - [05/Jul/2019:15:32:02 +0000] "PUT /sdn1/1088/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" 201 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link" "txb5bcd85d73db477988e57-005d1f6d72" "proxy-server 19771" 0.1022 "-" 2054 0
Jul  5 15:31:57 ms-fe1008 proxy-server: 10.192.0.106 10.64.32.221 05/Jul/2019/15/31/57 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 201 - Python-urllib/2.7 - - - d41d8cd98f00b204e9800998ecf8427e tx7046e24f53a24386aecb6-005d1f6d6d - 0.0886 - cs:valid 1562340717.264012098 1562340717.352655888 0
Jul  5 15:32:02 ms-fe1008 proxy-server: 10.192.16.77 10.64.32.221 05/Jul/2019/15/32/02 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/repositories/wikimedia-buster/_layers/sha256/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/link HTTP/1.0 201 - Python-urllib/2.7 - 71 - 580b57016b2299e0423026d5cec26bfa txb5bcd85d73db477988e57-005d1f6d72 - 0.1075 - cs:valid 1562340722.369966030 1562340722.477478027 0
Jul  5 15:33:35 ms-be1036 object-server: 10.64.0.39 - - [05/Jul/2019:15:33:35 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx7048851e24bd49d1a85f0-005d1f6dcf" "proxy-server 11935" 0.0009 "-" 1843 0
Jul  5 15:33:35 ms-be1030 object-server: 10.64.0.39 - - [05/Jul/2019:15:33:35 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx7048851e24bd49d1a85f0-005d1f6dcf" "proxy-server 11935" 0.0020 "-" 1888 0
Jul  5 15:33:35 ms-be1032 object-server: 10.64.0.39 - - [05/Jul/2019:15:33:35 +0000] "PUT /sde1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx7048851e24bd49d1a85f0-005d1f6dcf" "proxy-server 11935" 0.0045 "-" 1852 0
Jul  5 15:33:35 ms-fe1006 proxy-server: 10.192.32.14 10.64.0.39 05/Jul/2019/15/33/35 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e tx7048851e24bd49d1a85f0-005d1f6dcf - 0.0401 - cs:valid 1562340815.308619022 1562340815.348696947 0
Jul  5 15:34:12 ms-be1036 object-server: 10.64.0.39 - - [05/Jul/2019:15:34:12 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "txbedbf4ce3db345358a6b1-005d1f6df4" "proxy-server 11935" 0.0008 "-" 1761 0
Jul  5 15:34:12 ms-be1032 object-server: 10.64.0.39 - - [05/Jul/2019:15:34:12 +0000] "PUT /sde1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "txbedbf4ce3db345358a6b1-005d1f6df4" "proxy-server 11935" 0.0010 "-" 1849 0
Jul  5 15:34:12 ms-be1030 object-server: 10.64.0.39 - - [05/Jul/2019:15:34:12 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "txbedbf4ce3db345358a6b1-005d1f6df4" "proxy-server 11935" 0.0010 "-" 1878 0
Jul  5 15:34:12 ms-fe1006 proxy-server: 10.192.16.77 10.64.0.39 05/Jul/2019/15/34/12 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e txbedbf4ce3db345358a6b1-005d1f6df4 - 0.0226 - cs:valid 1562340852.382550001 1562340852.405145884 0
Jul  5 15:34:27 ms-be1036 object-server: 10.64.32.221 - - [05/Jul/2019:15:34:27 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx13d782de23f545b89dbae-005d1f6e03" "proxy-server 19759" 0.0007 "-" 1885 0
Jul  5 15:34:27 ms-be1032 object-server: 10.64.32.221 - - [05/Jul/2019:15:34:27 +0000] "PUT /sde1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx13d782de23f545b89dbae-005d1f6e03" "proxy-server 19759" 0.0011 "-" 1791 0
Jul  5 15:34:27 ms-be1030 object-server: 10.64.32.221 - - [05/Jul/2019:15:34:27 +0000] "PUT /sdm1/40029/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" 409 - "PUT http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data" "tx13d782de23f545b89dbae-005d1f6e03" "proxy-server 19759" 0.0022 "-" 1706 0
Jul  5 15:34:27 ms-fe1008 proxy-server: 10.192.0.106 10.64.32.221 05/Jul/2019/15/34/27 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e tx13d782de23f545b89dbae-005d1f6e03 - 0.0456 - cs:valid 1562340867.114228010 1562340867.159805059 0

I don't know what could be happening, i think the only way to debug it is to set synchronization log level and debug it while happening

Event Timeline

fsero created this task.Jul 9 2019, 9:50 AM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptJul 9 2019, 9:50 AM
fsero renamed this task from docker registry swift replication is not replication content between DCs to docker registry swift replication is not replicating content between DCs.Jul 9 2019, 9:51 AM
fsero added a project: serviceops.
fsero updated the task description. (Show Details)Jul 9 2019, 11:19 AM

What I could find so far from swift logs:

Four PUTs last friday, one in codfw and three in eqiad. As expected the one in codfw is from the registry and the rest are from swift synchronizing containers

root@lithium:~# grep proxy-server: T227570 
Jul  5 15:30:56 ms-fe2008 proxy-server: 10.192.0.113 10.192.0.113 05/Jul/2019/15/30/56 PUT /v1/AUTH_docker/docker_registry_codfw/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 201 - distribution/v2.7.0-rc.0%2Bunknown AUTH_tk707683dca... - - - txbd4da88eb55247f3b1910-005d1f6d30 - 0.0272 - - 1562340656.111365080 1562340656.138525963 0
Jul  5 15:31:57 ms-fe1008 proxy-server: 10.192.0.106 10.64.32.221 05/Jul/2019/15/31/57 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 201 - Python-urllib/2.7 - - - d41d8cd98f00b204e9800998ecf8427e tx7046e24f53a24386aecb6-005d1f6d6d - 0.0886 - cs:valid 1562340717.264012098 1562340717.352655888 0
Jul  5 15:33:35 ms-fe1006 proxy-server: 10.192.32.14 10.64.0.39 05/Jul/2019/15/33/35 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e tx7048851e24bd49d1a85f0-005d1f6dcf - 0.0401 - cs:valid 1562340815.308619022 1562340815.348696947 0
Jul  5 15:34:12 ms-fe1006 proxy-server: 10.192.16.77 10.64.0.39 05/Jul/2019/15/34/12 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e txbedbf4ce3db345358a6b1-005d1f6df4 - 0.0226 - cs:valid 1562340852.382550001 1562340852.405145884 0
Jul  5 15:34:27 ms-fe1008 proxy-server: 10.192.0.106 10.64.32.221 05/Jul/2019/15/34/27 PUT /v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data HTTP/1.0 202 - Python-urllib/2.7 - - 76 d41d8cd98f00b204e9800998ecf8427e tx13d782de23f545b89dbae-005d1f6e03 - 0.0456 - cs:valid 1562340867.114228010 1562340867.159805059 0

And trying to list that object in codfw indeed returns results but not in eqiad:

root@ms-fe2005:~# swift stat docker_registry_codfw files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data
       Account: AUTH_docker
     Container: docker_registry_codfw
        Object: files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data
  Content Type: application/octet-stream
Content Length: 1080
 Last Modified: Fri, 05 Jul 2019 15:30:57 GMT
          ETag: "4f8f39d1094dc8071b31168436a344a5"
      Manifest: docker_registry_codfw/segments/2f6/46f636b65722f72656769737472792f76322f7265706f7369746f726965732f77696b696d656469612d6275737465722f5f75706c6f6164732f34643438663635302d376437612d343964342d613930332d6235623462616334303431642f64617461f974bdea4eed296cb4fd4267010aad3938768b2be476784faaa64a08d7c18580da39a3ee5e6b4b0d3255bfef95601890afd80709
 Accept-Ranges: bytes
   X-Timestamp: 1562340656.11204
    X-Trans-Id: txf750ada709394b5bb4e7e-005d247467
root@ms-fe1005:~# swift stat docker_registry_eqiad files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data
Object HEAD failed: http://swift.svc.eqiad.wmnet/v1/AUTH_docker/docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/49/490e46c21e6009300716565b30f70bd7c7936af4f81dcec609de4344de3dc5e1/data 404 Not Found
Failed Transaction ID: txfd355478e32549d9851e5-005d24744c

Note that the object in codfw shows a Manifest (i.e. it has been uploaded as a "large file") that points to docker_registry_codfw/segments/... and what it means is that swift will consider this a dynamic large file and the manifest points to a prefix that contains all segments to this file.

It looks like docker registry always uploads files as large files regardless of size, but anyways my working theory is that large files combined with how swift container synchronization works, and the fact that we're synchronizing between two different container names means that large files in codfw won't work as-is in eqiad because the Manifest has the container name in it too. This is the scenario the first note on https://docs.openstack.org/swift/latest/overview_container_sync.html is referring to.

There's only one segment of 1080 bytes for that file in codfw:

root@ms-fe2005:~# swift stat docker_registry_codfw segments/2f6/46f636b65722f72656769737472792f76322f7265706f7369746f726965732f77696b696d656469612d6275737465722f5f75706c6f6164732f34643438663635302d376437612d343964342d613930332d6235623462616334303431642f64617461f974bdea4eed296cb4fd4267010aad3938768b2be476784faaa64a08d7c18580da39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
       Account: AUTH_docker
     Container: docker_registry_codfw
        Object: segments/2f6/46f636b65722f72656769737472792f76322f7265706f7369746f726965732f77696b696d656469612d6275737465722f5f75706c6f6164732f34643438663635302d376437612d343964342d613930332d6235623462616334303431642f64617461f974bdea4eed296cb4fd4267010aad3938768b2be476784faaa64a08d7c18580da39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
  Content Type: application/octet-stream
Content Length: 1080
 Last Modified: Fri, 05 Jul 2019 15:30:55 GMT
          ETag: 64188db2e9f2d08b3482be00831b43ce
 Accept-Ranges: bytes
   X-Timestamp: 1562340654.75711
    X-Trans-Id: txf9eaba8331564f31828bd-005d24759f

And in eqiad the segment is there too, but not in the container that's in the manifest (i.e. docker_registry_codfw vs docker_registry_eqiad)

root@ms-fe1005:~# swift stat docker_registry_eqiad segments/2f6/46f636b65722f72656769737472792f76322f7265706f7369746f726965732f77696b696d656469612d6275737465722f5f75706c6f6164732f34643438663635302d376437612d343964342d613930332d6235623462616334303431642f64617461f974bdea4eed296cb4fd4267010aad3938768b2be476784faaa64a08d7c18580da39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
       Account: AUTH_docker
     Container: docker_registry_eqiad
        Object: segments/2f6/46f636b65722f72656769737472792f76322f7265706f7369746f726965732f77696b696d656469612d6275737465722f5f75706c6f6164732f34643438663635302d376437612d343964342d613930332d6235623462616334303431642f64617461f974bdea4eed296cb4fd4267010aad3938768b2be476784faaa64a08d7c18580da39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
  Content Type: application/octet-stream
Content Length: 1080
 Last Modified: Fri, 05 Jul 2019 15:30:55 GMT
          ETag: 64188db2e9f2d08b3482be00831b43ce
 Accept-Ranges: bytes
   X-Timestamp: 1562340654.75711
    X-Trans-Id: txd6d411ead07447d09f099-005d2475ac

Change 521828 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registry, swift: some images are not replicated.

https://gerrit.wikimedia.org/r/521828

fsero claimed this task.Jul 10 2019, 10:36 AM
fsero triaged this task as High priority.
fsero moved this task from Backlog to Doing on the serviceops board.

I ran an audit of all images that might suffer from this problem, assuming latest is the tag we're looking for (which is for the majority of images AFAICT) looks like 66 images are affected

$ curl -k https://docker-registry.svc.eqiad.wmnet/v2/_catalog?n=10000 > eqiad_catalog
$ curl -k https://docker-registry.svc.codfw.wmnet/v2/_catalog?n=10000 > codfw_catalog
$ md5sum codfw_catalog eqiad_catalog 
f613b7b8cc7c1ebc41fdfc7b4edc3f90  codfw_catalog
f613b7b8cc7c1ebc41fdfc7b4edc3f90  eqiad_catalog
$ for i in $(cat eqiad_catalog | jq .repositories | tr -d \", ) ; do curl -s -k  https://docker-registry.svc.codfw.wmnet/v2/$i/manifests/latest | grep -q MANIFEST_INVALID && echo $i ; done
$ for i in $(cat eqiad_catalog | jq .repositories | tr -d \", ) ; do curl -s -k  https://docker-registry.svc.eqiad.wmnet/v2/$i/manifests/latest | grep -q MANIFEST_INVALID && echo $i ; done
dev/mediawiki
dev/mediawiki-xdebug
dev/parsoid
dev/restbase
dev/stretch
dev/stretch-php72
dev/stretch-php72-apache2
dev/stretch-php72-webserver
dev/stretch-php72-webserver-xdebug
nodejs10-slim
releng/bazel
releng/ci-src-setup
releng/civicrm
releng/composer
releng/composer-hhvm
releng/composer-package
releng/composer-package-hhvm
releng/composer-package-php71
releng/composer-package-php72
releng/composer-package-php73
releng/composer-php71
releng/composer-php72
releng/composer-php73
releng/composer-test
releng/composer-test-php72
releng/curl
releng/helm-linter
releng/java8-sonar-scanner
releng/mediawiki-phan
releng/mediawiki-phan-seccheck
releng/node10-test-browser
releng/npm-test-oojsui
releng/operations-puppet
releng/php
releng/php-ast
releng/php-compile
releng/php71
releng/php71-compile
releng/php72
releng/php72-compile
releng/php73
releng/php73-compile
releng/phpmetrics
releng/quibble-coverage
releng/quibble-fresnel
releng/quibble-stretch
releng/quibble-stretch-bundle
releng/quibble-stretch-hhvm
releng/quibble-stretch-php70
releng/quibble-stretch-php71
releng/quibble-stretch-php72
releng/quibble-stretch-php73
releng/tox
releng/tox-acme-chief
releng/tox-cergen
releng/tox-conftool
releng/tox-eventlogging
releng/tox-labs-striker
releng/tox-mysqld
releng/tox-poolcounter
releng/tox-pyspark
releng/tox-pywikibot
releng/zuul-cloner
service-checker
wikimedia-buster
wikimedia-jessie

Thanks for the audit @fgiunchedi !

I think our next step is to try if creating a swift container with the same name in the eqiad cluster would solve the missing segments problem. In order to do that i'm going to do the following:

  1. backup the swift container on codfw creating docker_registry_backup.
  2. modify the sync-to metadata in codfw to point to docker_registry_codfw on eqiad
  3. create docker_registry_codfw container on eqiad
  4. modify the sync-to metadata in eqiad to point to docker_registry_codfw on codfw

Mentioned in SAL (#wikimedia-operations) [2019-07-11T10:08:45Z] <fsero> creating swift docker_registry_container_backup T227570

Mentioned in SAL (#wikimedia-operations) [2019-07-11T10:28:37Z] <fsero> depooling ms-fe2005 for docker_registry_backups T227570

Change 523100 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registry: introducing read only mode for maintenances

https://gerrit.wikimedia.org/r/523100

Change 521828 merged by Fsero:
[operations/puppet@production] registry, swift: some images are not replicated.

https://gerrit.wikimedia.org/r/521828

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:48:59Z] <fsero> T227570 changing container_synchronization on docker_registry_codfw to //docker_registry/eqiad/AUTH_docker/docker_registry_codfw

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:50:19Z] <fsero> creating docker_registry_codfw on eqiad T227570

Mentioned in SAL (#wikimedia-operations) [2019-07-15T09:39:42Z] <fsero> repooling ms-fe2005 T227570

fsero closed this task as Resolved.Jul 16 2019, 2:24 PM

uploaded a new image today (coredns) and rechecked like @fgiunchedi and it seems to be working \o/ so resolving this issue.

Change 523100 merged by Fsero:
[operations/puppet@production] registry: introducing read only mode for maintenances

https://gerrit.wikimedia.org/r/523100

Change 525223 had a related patch set uploaded (by Fsero; owner: Fsero):
[operations/puppet@production] registry: setting eqiad registries in read_only_mode

https://gerrit.wikimedia.org/r/525223

Change 525223 merged by Fsero:
[operations/puppet@production] registry: setting eqiad registries in read_only_mode

https://gerrit.wikimedia.org/r/525223

fsero added a comment.Wed, Jul 24, 8:14 AM

as result of this issue, registries in the passive DC (eqiad now) are set in read only mode (they accept pulls but no pushes of new images)