Page MenuHomePhabricator

kubernetes unable to pull images from registry
Closed, ResolvedPublic


Attempted twice to deploy in staging today; same result both times. The deployment failed with a timeout and was rolled back.

This is the status (visible by running helmfile status in a different process) while waiting for the timeout to expire:

mholloway-shell@deploy1001:/srv/deployment-charts/helmfile.d/services/staging/mobileapps$ source .hfenv; helmfile status
Getting status staging
LAST DEPLOYED: Tue Jun 30 17:33:14 2020
NAMESPACE: mobileapps

==> v1/ConfigMap
NAME                                    DATA  AGE
config-staging                          1     19d
mobileapps-staging-envoy-config-volume  1     19d
mobileapps-staging-metrics-config       1     19d
mobileapps-staging-tls-proxy-certs      2     19d

==> v1/Deployment
mobileapps-staging  1/1    1           1          19d

==> v1/NetworkPolicy
NAME                POD-SELECTOR                    AGE
mobileapps-staging  app=mobileapps,release=staging  19d

==> v1/Pod(related)
NAME                                READY  STATUS            RESTARTS  AGE
mobileapps-staging-8f64589f6-tvvss  3/3    Running           0         5d21h
mobileapps-staging-c986ff654-62xmv  2/3    ImagePullBackOff  0         2m13s

==> v1/Secret
NAME                              TYPE    DATA  AGE
mobileapps-staging-secret-config  Opaque  0     19d

==> v1/Service
NAME                            TYPE      CLUSTER-IP    EXTERNAL-IP  PORT(S)        AGE
mobileapps-staging-tls-service  NodePort  <none>       4888:4888/TCP  19d

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-06-30T17:40:41Z] <mdholloway> mobileapps deployments on k8s failing with timeouts; filed T256786

JMeybohm renamed this task from mobileapps kubernetes deployment is timing out to kubernetes unable to pull images from registry.Jul 1 2020, 8:17 AM

Still getting ErrImagePull in kubectl get events:

73s         Normal    Pulling             Pod          pulling image "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-mobileapps:2020-06-29-163540-production"
73s         Warning   Failed              Pod          Failed to pull image "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-mobileapps:2020-06-29-163540-production": rpc error: code = Unknown desc = Error response from daemon: Get https://docker-registry.discovery.wmnet/v1/_ping: x509: certificate has expired or is not yet valid
73s         Warning   Failed              Pod          Error: ErrImagePull

The certificate looks okay, though:

* Server certificate:
*  subject: CN=docker-registry.discovery.wmnet
*  start date: Aug 27 14:52:23 2019 GMT
*  expire date: Aug 26 14:52:23 2024 GMT
*  subjectAltName: host "docker-registry.discovery.wmnet" matched cert's "docker-registry.discovery.wmnet"
*  issuer: CN=Puppet CA: palladium.eqiad.wmnet
*  SSL certificate verify ok.

It's only docker that is totally sure that the certificate is not valid, so I guess it does not reload ca-certificates (even on SIGHUP).

JMeybohm raised the priority of this task from High to Unbreak Now!.Jul 1 2020, 8:44 AM

Raising prio as we do have the same situation on prod clusters.

Mentioned in SAL (#wikimedia-operations) [2020-07-01T08:53:45Z] <jayme> draining kubernetes staging node kubestage1001.eqiad.wmnet - T256786

Mentioned in SAL (#wikimedia-operations) [2020-07-01T09:23:06Z] <jayme> restarting dockerd on kubestage1002.eqiad.wmnet - T256786

This is the old Puppet CA that some docker daemons have still loaded.
Unfortunately a docker reload does not reload the CA, so we need to do a docker restart on: kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet Newer Kubernetes nodes already started with the updated CA and are fine.

Mentioned in SAL (#wikimedia-operations) [2020-07-01T09:46:35Z] <jayme> cordoning kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet - T256786

Mentioned in SAL (#wikimedia-operations) [2020-07-01T10:45:41Z] <jayme> draining and docker restart (one at a time) kubernetes[1001-1004].eqiad.wmnet - T256786

JMeybohm claimed this task.

Did a rolling restart on all affected nodes, we should be fine now. Sorry for the inconvenience and thanks a lot for the reports @jeena && @Mholloway !

I've deployed the latest changes of blubberoid and mobileapps to staging just to be sure.