Page MenuHomePhabricator

restarted pod failed to schedule due to resource constraints
Open, Needs TriagePublicBUG REPORT

Description

tools.cluebotng@tools-bastion-15:~$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
bot-6dff8488df-snqfw                0/1     Pending   0          3h22m
core-5b4bfd9d88-wtncm               1/1     Running   0          3d8h
grafana-alloy-79f4589c5f-gqwh4      1/1     Running   0          3d8h
irc-relay-7c7c4fdfd-tdb6t           1/1     Running   0          20h
pushgateway-54fc8d676c-ns2tj        1/1     Running   0          3d8h
redis-854fc8fb77-bdbfl              1/1     Running   0          3d8h
report-interface-85f4f9b766-d2xqh   1/1     Running   0          3d8h
report-interface-85f4f9b766-mxkq4   1/1     Running   0          3d8h
tools.cluebotng@tools-bastion-15:~$ kubectl events
LAST SEEN                TYPE      REASON             OBJECT                                    MESSAGE
6m36s (x222 over 108m)   Warning   FailedScheduling   Pod/bot-6dff8488df-snqfw                  0/81 nodes are available: 1 node(s) were unschedulable, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.
94s (x529 over 3h21m)    Warning   FailedScheduling   Pod/bot-6dff8488df-snqfw                  0/81 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.
tools.cluebotng@tools-bastion-15:~$ kubectl describe pod bot-6dff8488df-snqfw
Name:             bot-6dff8488df-snqfw
Namespace:        tool-cluebotng
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app.kubernetes.io/component=deployments
                  app.kubernetes.io/created-by=cluebotng
                  app.kubernetes.io/managed-by=toolforge-jobs-framework
                  app.kubernetes.io/name=bot
                  app.kubernetes.io/version=2
                  jobs.toolforge.org/emails=none
                  pod-template-hash=6dff8488df
                  toolforge=tool
                  toolforge.org/mount-storage=none
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/bot-6dff8488df
Containers:
  job:
    Image:      tools-harbor.wmcloud.org/tool-cluebotng/bot:latest@sha256:3bef93cf1957c6b1a8e510db1917112f9e0816d3fbab30c1d425adafd7ed7130
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      --
      launcher run-cbng
    Limits:
      cpu:     3
      memory:  2Gi
    Requests:
      cpu:     3
      memory:  1073741824
    Liveness:  exec [/bin/sh -c health-check] delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:   exec [/bin/sh -c health-check] delay=0s timeout=5s period=1s #success=1 #failure=120
    Environment:
      NO_HOME:                                   a buildservice pod does not need a home env
      CBNG_BOT_MYSQL_CREDENTIALS:                <set to the key 'CBNG_BOT_MYSQL_CREDENTIALS' in secret 'toolforge.envvar.v1.cbng-bot-mysql-credentials'>                              Optional: false
      CBNG_BOT_PASSWORD:                         <set to the key 'CBNG_BOT_PASSWORD' in secret 'toolforge.envvar.v1.cbng-bot-password'>                                                Optional: false
      CBNG_REPORT_OAUTH_KEY:                     <set to the key 'CBNG_REPORT_OAUTH_KEY' in secret 'toolforge.envvar.v1.cbng-report-oauth-key'>                                        Optional: false
      CBNG_REPORT_OAUTH_SECRET:                  <set to the key 'CBNG_REPORT_OAUTH_SECRET' in secret 'toolforge.envvar.v1.cbng-report-oauth-secret'>                                  Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_CHANNELS:   <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_CHANNELS' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-channels'>    Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_NICK:       <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_NICK' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-nick'>            Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_PORT:       <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_PORT' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-port'>            Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_SERVER:     <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_SERVER' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-server'>        Optional: false
      IRC_RELAY_SENDER_HUGGLE_RECEIVER:          <set to the key 'IRC_RELAY_SENDER_HUGGLE_RECEIVER' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-receiver'>                  Optional: false
      IRC_RELAY_SENDER_HUGGLE_THROTTLER_CONFIG:  <set to the key 'IRC_RELAY_SENDER_HUGGLE_THROTTLER_CONFIG' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-throttler-config'>  Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_CHANNELS:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_CHANNELS' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-channels'>        Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_NICK:         <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_NICK' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-nick'>                Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_PASSWORD:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_PASSWORD' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-password'>        Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_USERNAME:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_USERNAME' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-username'>        Optional: false
      IRC_RELAY_SENDER_MAIN_THROTTLER_CONFIG:    <set to the key 'IRC_RELAY_SENDER_MAIN_THROTTLER_CONFIG' in secret 'toolforge.envvar.v1.irc-relay-sender-main-throttler-config'>      Optional: false
      REDIS_PASSWORD:                            <set to the key 'REDIS_PASSWORD' in secret 'toolforge.envvar.v1.redis-password'>                                                      Optional: false
      TOOL_DEPLOY_TOKEN:                         <set to the key 'TOOL_DEPLOY_TOKEN' in secret 'toolforge.envvar.v1.tool-deploy-token'>                                                Optional: false
      TOOL_REPLICA_PASSWORD:                     <set to the key 'TOOL_REPLICA_PASSWORD' in secret 'toolforge.envvar.v1.tool-replica-password'>                                        Optional: false
      TOOL_REPLICA_USER:                         <set to the key 'TOOL_REPLICA_USER' in secret 'toolforge.envvar.v1.tool-replica-user'>                                                Optional: false
      TOOL_TOOLSDB_PASSWORD:                     <set to the key 'TOOL_TOOLSDB_PASSWORD' in secret 'toolforge.envvar.v1.tool-toolsdb-password'>                                        Optional: false
      TOOL_TOOLSDB_SCHEMA:                       <set to the key 'TOOL_TOOLSDB_SCHEMA' in secret 'toolforge.envvar.v1.tool-toolsdb-schema'>                                            Optional: false
      TOOL_TOOLSDB_USER:                         <set to the key 'TOOL_TOOLSDB_USER' in secret 'toolforge.envvar.v1.tool-toolsdb-user'>                                                Optional: false
      TOOL_TOOLFORGE_API_URL:                    https://api.svc.tools.eqiad1.wikimedia.cloud:30003
      TOOL_REDIS_URI:                            redis://redis.svc.tools.eqiad1.wikimedia.cloud:6379
      TOOL_ELASTICSEARCH_URL:                    http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud:80
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f42pl (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-f42pl:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/created-by=cluebotng,app.kubernetes.io/managed-by=toolforge-jobs-framework,app.kubernetes.io/name=bot,toolforge=tool
Events:
  Type     Reason            Age                      From               Message
  ----     ------            ----                     ----               -------
  Warning  FailedScheduling  7m42s (x222 over 110m)   default-scheduler  0/81 nodes are available: 1 node(s) were unschedulable, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.
  Warning  FailedScheduling  2m40s (x529 over 3h22m)  default-scheduler  0/81 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.

Resource config has not changed in > 6 months, last deployment was a day ago (code change).

It appears the resources available within the toolforge cluster are not enough for these long existing jobs, which is causing an outage outside of the maintainer's control.

Flushing all jobs and re-creating does not help the situation.

tools.cluebotng@tools-bastion-15:~$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
bot-5b5d5f7f7b-csdrn                0/1     Pending   0          25s
core-6db4ddb6cf-k5smt               1/1     Running   0          23s
grafana-alloy-645798f6bf-h9dz2      1/1     Running   0          21s
irc-relay-7c7c4fdfd-kkgx2           1/1     Running   0          19s
pushgateway-54fc8d676c-tpg7j        1/1     Running   0          16s
redis-854fc8fb77-zr2bl              1/1     Running   0          14s
report-interface-85f4f9b766-2wzrd   1/1     Running   0          12s
report-interface-85f4f9b766-964bg   1/1     Running   0          12s
tools.cluebotng@tools-bastion-15:~$ kubectl describe pod bot-5b5d5f7f7b-csdrn 
Name:             bot-5b5d5f7f7b-csdrn
Namespace:        tool-cluebotng
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app.kubernetes.io/component=deployments
                  app.kubernetes.io/created-by=cluebotng
                  app.kubernetes.io/managed-by=toolforge-jobs-framework
                  app.kubernetes.io/name=bot
                  app.kubernetes.io/version=2
                  jobs.toolforge.org/emails=none
                  pod-template-hash=5b5d5f7f7b
                  toolforge=tool
                  toolforge.org/mount-storage=none
Annotations:      <none>
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/bot-5b5d5f7f7b
Containers:
  job:
    Image:      tools-harbor.wmcloud.org/tool-cluebotng/bot:latest@sha256:efe6896abbd4ce8b2e060e088857fa7ea247a0a289b526329d48d8f6ee6f7b31
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      --
      launcher run-cbng
    Limits:
      cpu:     3
      memory:  2Gi
    Requests:
      cpu:     3
      memory:  1073741824
    Liveness:  exec [/bin/sh -c health-check] delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:   exec [/bin/sh -c health-check] delay=0s timeout=5s period=1s #success=1 #failure=120
    Environment:
      NO_HOME:                                   a buildservice pod does not need a home env
      CBNG_BOT_MYSQL_CREDENTIALS:                <set to the key 'CBNG_BOT_MYSQL_CREDENTIALS' in secret 'toolforge.envvar.v1.cbng-bot-mysql-credentials'>                              Optional: false
      CBNG_BOT_PASSWORD:                         <set to the key 'CBNG_BOT_PASSWORD' in secret 'toolforge.envvar.v1.cbng-bot-password'>                                                Optional: false
      CBNG_REPORT_OAUTH_KEY:                     <set to the key 'CBNG_REPORT_OAUTH_KEY' in secret 'toolforge.envvar.v1.cbng-report-oauth-key'>                                        Optional: false
      CBNG_REPORT_OAUTH_SECRET:                  <set to the key 'CBNG_REPORT_OAUTH_SECRET' in secret 'toolforge.envvar.v1.cbng-report-oauth-secret'>                                  Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_CHANNELS:   <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_CHANNELS' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-channels'>    Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_NICK:       <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_NICK' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-nick'>            Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_PORT:       <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_PORT' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-port'>            Optional: false
      IRC_RELAY_SENDER_HUGGLE_CLIENT_SERVER:     <set to the key 'IRC_RELAY_SENDER_HUGGLE_CLIENT_SERVER' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-client-server'>        Optional: false
      IRC_RELAY_SENDER_HUGGLE_RECEIVER:          <set to the key 'IRC_RELAY_SENDER_HUGGLE_RECEIVER' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-receiver'>                  Optional: false
      IRC_RELAY_SENDER_HUGGLE_THROTTLER_CONFIG:  <set to the key 'IRC_RELAY_SENDER_HUGGLE_THROTTLER_CONFIG' in secret 'toolforge.envvar.v1.irc-relay-sender-huggle-throttler-config'>  Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_CHANNELS:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_CHANNELS' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-channels'>        Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_NICK:         <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_NICK' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-nick'>                Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_PASSWORD:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_PASSWORD' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-password'>        Optional: false
      IRC_RELAY_SENDER_MAIN_CLIENT_USERNAME:     <set to the key 'IRC_RELAY_SENDER_MAIN_CLIENT_USERNAME' in secret 'toolforge.envvar.v1.irc-relay-sender-main-client-username'>        Optional: false
      IRC_RELAY_SENDER_MAIN_THROTTLER_CONFIG:    <set to the key 'IRC_RELAY_SENDER_MAIN_THROTTLER_CONFIG' in secret 'toolforge.envvar.v1.irc-relay-sender-main-throttler-config'>      Optional: false
      REDIS_PASSWORD:                            <set to the key 'REDIS_PASSWORD' in secret 'toolforge.envvar.v1.redis-password'>                                                      Optional: false
      TOOL_DEPLOY_TOKEN:                         <set to the key 'TOOL_DEPLOY_TOKEN' in secret 'toolforge.envvar.v1.tool-deploy-token'>                                                Optional: false
      TOOL_REPLICA_PASSWORD:                     <set to the key 'TOOL_REPLICA_PASSWORD' in secret 'toolforge.envvar.v1.tool-replica-password'>                                        Optional: false
      TOOL_REPLICA_USER:                         <set to the key 'TOOL_REPLICA_USER' in secret 'toolforge.envvar.v1.tool-replica-user'>                                                Optional: false
      TOOL_TOOLSDB_PASSWORD:                     <set to the key 'TOOL_TOOLSDB_PASSWORD' in secret 'toolforge.envvar.v1.tool-toolsdb-password'>                                        Optional: false
      TOOL_TOOLSDB_SCHEMA:                       <set to the key 'TOOL_TOOLSDB_SCHEMA' in secret 'toolforge.envvar.v1.tool-toolsdb-schema'>                                            Optional: false
      TOOL_TOOLSDB_USER:                         <set to the key 'TOOL_TOOLSDB_USER' in secret 'toolforge.envvar.v1.tool-toolsdb-user'>                                                Optional: false
      TOOL_TOOLFORGE_API_URL:                    https://api.svc.tools.eqiad1.wikimedia.cloud:30003
      TOOL_REDIS_URI:                            redis://redis.svc.tools.eqiad1.wikimedia.cloud:6379
      TOOL_ELASTICSEARCH_URL:                    http://elasticsearch.svc.tools.eqiad1.wikimedia.cloud:80
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-798nh (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-798nh:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/created-by=cluebotng,app.kubernetes.io/managed-by=toolforge-jobs-framework,app.kubernetes.io/name=bot,toolforge=tool
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  18s               default-scheduler  0/81 nodes are available: 1 node(s) were unschedulable, 2 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.
  Warning  FailedScheduling  8s (x5 over 44s)  default-scheduler  0/81 nodes are available: 1 node(s) were unschedulable, 3 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) had untolerated taint {toolforge.org/gateway: true}, 74 Insufficient cpu. preemption: 0/81 nodes are available: 7 Preemption is not helpful for scheduling, 74 No preemption victims found for incoming pod.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Reduced cpu request @ https://github.com/cluebotng/component-configs/commit/874d7f6f407fc9a3995f52f40312cd7d3a712176.

tools.cluebotng@tools-bastion-15:~$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
bot-6997ff8d76-jr4wg                1/1     Running   0          20s
core-6db4ddb6cf-k5smt               1/1     Running   0          4m20s
grafana-alloy-645798f6bf-h9dz2      1/1     Running   0          4m18s
irc-relay-7c7c4fdfd-kkgx2           1/1     Running   0          4m16s
pushgateway-54fc8d676c-tpg7j        1/1     Running   0          4m13s
redis-854fc8fb77-zr2bl              1/1     Running   0          4m11s
report-interface-85f4f9b766-2wzrd   1/1     Running   0          4m9s
report-interface-85f4f9b766-964bg   1/1     Running   0          4m9s

This configuration has been in place for months, so leaving ticket as it should not randomly break.

This might be fixed by T420565, which is being deployed today / next week. @DamianZaremba you might want to try reverting your change above and see if it still works. Otherwise, if it's still an issue, we can look into it deeper.