Page MenuHomePhabricator

Kubernetes timeing out before pulling the mediawiki-multiversion image
Open, MediumPublic

Description

When trying to pull multiversion:2021-06-08-184813-publish on kubestage1002, kubernetes appeared to timeout during the extraction of the image.

Warning Failed 7m49s kubelet, kubestage1002.eqiad.wmnet Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-06-08-184813-publish": rpc error: code = Unknown desc = context canceled

Event Timeline

jijiki renamed this task from Kubernetes timeing out befor pulling the mediawiki-multiversion image to Kubernetes timeing out before pulling the mediawiki-multiversion image .Jun 9 2021, 8:15 AM
jijiki triaged this task as Medium priority.
JMeybohm added a subscriber: JMeybohm.

The default timeout is 2min here, unfortunately that is not configurable for pull only but for all runtime requests. See https://v1-17.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ (--runtime-request-timeout)

As reference: In prod we did T288345 (T288509) to circumvent this (which is unfortunately not an option for staging, as we only have HDDs there.

We have undeployed mw in staging, so for the time being. In the future, if we deploy to staging again, a bandaid for this would be to download the image locally to potential hosts:

cumin 'kubestage*' 'docker --config /var/lib/kubelet pull docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-07-01-232810-publish

We have undeployed mw in staging, so for the time being. In the future, if we deploy to staging again, a bandaid for this would be to download the image locally to potential hosts:

cumin 'kubestage*' 'docker --config /var/lib/kubelet pull docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-07-01-232810-publish

Is this something that we could be able in an automated CI job? Asking in the context of T288629.

We discussed a few different options during the ServiceOps meeting today:

  • Do testing on production clusters which have SSDs - prefer not to, in case testing load affects real prod services
  • Do testing on staging-codfw which has SSDs - prefer to keep this as a playground for ServiceOps to break
  • Stick SSDs in staging cluster machines - DC ops prefers not to do this, and this will happen anyways as part of the HW refresh next year
  • Have a timer pull new images to staging cluster - no objections
  • Give CI permissions to pull the image - don't think there were any objections.

I think having CI pull the image is the nicest since you'll only pull image versions that actually need verification, plus you know when the image is ready. But I think it might be harder, as we'd need to give CI access to those boxes (via SSH I assume).

A timer is simpler, we could just pull :latest every minute or so. I'm not exactly sure how CI would figure out whether the image has already been pulled or not though.

A timer is simpler, we could just pull :latest every minute or so. I'm not exactly sure how CI would figure out whether the image has already been pulled or not though.

Being able to pull a specific image is a must.

@Jelto and I where looking at issues in staging today that where caused by high disk and network IO on kubestage1001 (due to the mediawiki CI deployment).

  • Have a timer pull new images to staging cluster - no objections
  • Give CI permissions to pull the image - don't think there were any objections.

I think having CI pull the image is the nicest since you'll only pull image versions that actually need verification, plus you know when the image is ready. But I think it might be harder, as we'd need to give CI access to those boxes (via SSH I assume).

A timer is simpler, we could just pull :latest every minute or so. I'm not exactly sure how CI would figure out whether the image has already been pulled or not though.

I'd love not to allow SSH access (basically root, as CI would need to talk to dockerd) from CI tbh. Pulling :latest on a regular basis could ease the issue as layers are potentially already available when CI deploys to kubernetes (which then can, very quickly, pull the missing specific tag @dancy). But unfortunately the CI would not know at which point the image is available on the nodes.

Unfortunately it seems as if the canceled context to dockerd makes it cancel the pull process (which makes sense), so the "just try for a bit longer" approach discussed with @akosiaris won't work as well.

Maybe we should to with increasing the timeout in staging after all and fast track the replacement of the nodes to not spend more time on this.

Maybe we should to with increasing the timeout in staging after all and fast track the replacement of the nodes to not spend more time on this.

This idea is interesting. What is the current timeout and what would it have to be for the pull to succeed?

The default timeout is 2min here, unfortunately that is not configurable for pull only but for all runtime requests. See https://v1-17.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ (--runtime-request-timeout)

I actually don't really like that idea as that leaves us with different settings between prod and staging for a quite critical part. But it might be feasible as a workaround for a limited amount of time. I'd assume we would have to set it to something >=5min.

A timer is simpler, we could just pull :latest every minute or so. I'm not exactly sure how CI would figure out whether the image has already been pulled or not though.

Being able to pull a specific image is a must.

@Legoktm I'm backing off from this position. Let's see how far we can get using this technique.

Change 715993 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] [WIP] Automatically pull latest MediaWiki image onto staging cluster

https://gerrit.wikimedia.org/r/715993

To sum up a bit, as already mentioned, we could have a timer pulling the most recent image based on a tag pattern (and avoid using :latest), in a similar way we are checking for one in the timer we are using to deploy to mwdebug on kubernetes.

We can do this for now since we will be replacing the kubestage* hosts hopefully in Q2, with machines that have SSDs.

I put up a simple patch using :latest for now (I was already in the process of writing it before you commented), I think if it works well enough, then we can improve it by copying from deploy-mwdebug.py so we only try pulling when we see a new tag rather than every minute.

I put up a simple patch using :latest for now (I was already in the process of writing it before you commented), I think if it works well enough, then we can improve it by copying from deploy-mwdebug.py so we only try pulling when we see a new tag rather than every minute.

I'm fine with trying that, but please keep a close eye on the nodes when you merge that as it might hit them hard as well.
Please keep in mind that this will also heavily increase the space used for docker images which will lead to kubernetes needing to clean up more often (deleting unused images/layers). That again increases load on the disks (deleting layers, then potentially pulling mediawiki image again because it got cleaned up...).

Just so everyone is clear, under normal circumstances there will be one big image per week (at the start of the train) and all images created after that are very small layers on top of the big one, so the impact of the timer-based pull on node-local image storage should be reasonable.

Change 715993 merged by Legoktm:

[operations/puppet@production] Automatically pull latest MediaWiki image onto staging cluster

https://gerrit.wikimedia.org/r/715993

Change 716435 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/release@master] Push built image with 'latest' tag as well

https://gerrit.wikimedia.org/r/716435

Change 716435 merged by jenkins-bot:

[mediawiki/tools/release@master] Push built image with 'latest' tag as well

https://gerrit.wikimedia.org/r/716435

The initial pull of the 12G image took ~11 minutes on kubestage1002, after that it just needed to pull one more layer which took 8 seconds (note that if someone changes a i18n message, the layer could be much larger). And then @dancy was able to do a basic deploy to staging :)

In terms of load, https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=1630602050915&to=1630603604780&var-server=kubestage1002&var-datasource=thanos&var-cluster=kubernetes is what the pull looks like.

Please keep in mind that this will also heavily increase the space used for docker images which will lead to kubernetes needing to clean up more often (deleting unused images/layers). That again increases load on the disks (deleting layers, then potentially pulling mediawiki image again because it got cleaned up...).

I think as long as the most recent MW image is kept it should be fine to get rid of all the others since build process just stacks the layers.

then we can improve it by copying from deploy-mwdebug.py so we only try pulling when we see a new tag rather than every minute.

After thinking about this a bit more, I don't see the value in doing this, given that's basically what docker pull is doing internally.

Wow, that's quite some time. :)

I think as long as the most recent MW image is kept it should be fine to get rid of all the others since build process just stacks the layers.

Yeah, sure. I just wanted to say that it is possible the housekeeping that kicks in when the disk is full will remove the most recent image again (if there is not currently a running container using it).

then we can improve it by copying from deploy-mwdebug.py so we only try pulling when we see a new tag rather than every minute.

After thinking about this a bit more, I don't see the value in doing this, given that's basically what docker pull is doing internally.

+1

I'd say we keep this issue open until the replacement (T290199) is done to remove the hack then.

@dancy Looking at staging I see there is still a ~10h old mediawiki-bruce deployment and pod running. Is that intended? My understanding was that this is just used to verify images after building. So I had assumed the deployment would be cleaned up afterwards and as they sit on quite some resources it might be nice to do so.

@dancy Looking at staging I see there is still a ~10h old mediawiki-bruce deployment and pod running. Is that intended? My understanding was that this is just used to verify images after building. So I had assumed the deployment would be cleaned up afterwards and as they sit on quite some resources it might be nice to do so.

Now that image pulling is taken care of I'm now attempting to work on the testing machinery. I had left that deployment running when I stopped working yesterday. I removed it now and I'll be sure not to leave any around for too long in the future.

@JMeybohm By the way, I think I managed to get the 'jenkins' k8s account auto-banned in the staging cluster while experimenting on Friday. At this point any kubectl command I attempt results in a Forbidden response, so I cannot delete a deployment that is currently deployed to staging. Can you help?

@JMeybohm By the way, I think I managed to get the 'jenkins' k8s account auto-banned in the staging cluster while experimenting on Friday. At this point any kubectl command I attempt results in a Forbidden response, so I cannot delete a deployment that is currently deployed to staging. Can you help?

With kubectl you only have limited (viewer) permissions in the cluster/namespace. You will have to use helmfile destroy or helm delete to have the tiller service (which has elevated privileges) do it for you.

@JMeybohm By the way, I think I managed to get the 'jenkins' k8s account auto-banned in the staging cluster while experimenting on Friday. At this point any kubectl command I attempt results in a Forbidden response, so I cannot delete a deployment that is currently deployed to staging. Can you help?

With kubectl you only have limited (viewer) permissions in the cluster/namespace. You will have to use helmfile destroy or helm delete to have the tiller service (which has elevated privileges) do it for you.

Sorry, I shouldn't have mentioned kubectl. That just confused the issue. Here is what is happening, using helm:

helm --tiller-namespace=ci --kubeconfig=/etc/kubernetes/ci-staging.config delete --purge dancy
Error: Get "https://kubestagemaster.svc.eqiad.wmnet:6443/api/v1/namespaces/ci/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller": Forbidden

This was working until about mid-day Friday.

I logged into releases1002 fresh today and helm commands are working. Not sure what happened but I'll report more here if I find out. No action needed at this time.

@dancy Thanks for finding this issue!

I pretty sure its related to some RBAC refactoring I did in 715498. I try to understand whats going on there but the credentials (kube-config) are missing for the ci namespace user. I think this is a edge case I did not test/think about because the ci namespace uses the jenkins username. I will try to come up with a fix as soon as possible.

@akosiaris @Jelto @JMeybohm
Sorry for the false alarm. I figured out why things stopped working. I was working on releases1002.eqiad.wmet. At some point I set the following environment variables to access some Internet resources:
http_proxy=http://webproxy.eqiad.wmnet:8080
https_proxy=http://webproxy.eqiad.wmnet:8080

When these are set, they also (negatively) affect the HTTP accesses that k8s clients make.