Page MenuHomePhabricator

quarry.wmcloud.org: "This web service cannot be reached" due to redis pod running out of disk space
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Go to https://quarry.wmcloud.org/ (any subpage also works).

What happens?:

You get an error message:

image.png (647×955 px, 39 KB)

What should have happened instead?:

Quarry should have loaded.

Other info:

The bottom of the error message says "proxy-03.project-proxy.eqiad1.wikimedia.cloud". Nothing visible in console.

I first got the error at about 14:45 UTC. I don't know for how long it has been down. Yesterday it was up at any rate.

Event Timeline

Aklapper renamed this task from Quarry down? to quarry.wmcloud.org: "This web service cannot be reached".Apr 16 2025, 3:06 PM

From the logs:

redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist to disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
bd808 triaged this task as High priority.Apr 16 2025, 5:55 PM
bd808 added a project: cloud-services-team.

Redis RDB persistence is failing as the pod is out of disk space.

4134:C 16 Apr 2025 17:53:32.082 # Failed opening the temp RDB file temp-4134.rdb (in server root dir /data) for saving: No space left on device
sd@quarry-bastion:~$ kubectl -n quarry exec -it pod/redis-676b955f95-nc6x2 -- df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.9G     19.9G     32.0K 100% /
tmpfs                    64.0M         0     64.0M   0% /dev
/dev/vdb                 19.9G     19.9G     32.0K 100% /data
/dev/vda4                19.4G      6.4G     13.0G  33% /etc/hosts
/dev/vda4                19.4G      6.4G     13.0G  33% /dev/termination-log
/dev/vdb                 19.9G     19.9G     32.0K 100% /etc/hostname
/dev/vdb                 19.9G     19.9G     32.0K 100% /etc/resolv.conf
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     7.7G     12.0K      7.7G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     3.9G         0      3.9G   0% /proc/asound
tmpfs                     3.9G         0      3.9G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/latency_stats
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                     3.9G         0      3.9G   0% /proc/scsi
tmpfs                     3.9G         0      3.9G   0% /sys/firmware

I don't see any PersistentVolumes mounted to the redis pod. So not sure how to debug further.

Andrew claimed this task.
Andrew subscribed.

This seems to have been a disk space issue on one of the worker nodes. I rebooted both nodes, and then taavi killed existing pods (kubectl delete pod -n quarry --all) and everything recovered.

There are a few followups but the immediate issue is resolved.

Followups are: T392143 T392141 T392138

bd808 added subscribers: Liz, komla, dcaro.
bd808 subscribed.

Down again per T392169: [bug] Quarry queries don't run. The magical restart fix did not hold unfortunately.

bd808@laptop$ ssh root@quarry-bastion.quarry.eqiad1.wikimedia.cloud
root@quarry-bastion:~# export KUBECONFIG=/home/rook/quarry/tofu/kube.config
root@quarry-bastion:~# kubectl -n quarry get po
NAME                      READY   STATUS    RESTARTS       AGE
redis-676b955f95-tkbb7    1/1     Running   0              4h4m
web-d8d77d-2lzww          0/1     Running   2 (4h5m ago)   5h30m
web-d8d77d-57l8m          0/1     Running   0              4h4m
web-d8d77d-j7k8g          0/1     Running   0              4h4m
web-d8d77d-lwz79          0/1     Running   2 (4h5m ago)   5h30m
web-d8d77d-mt7gf          0/1     Running   2 (4h5m ago)   5h30m
web-d8d77d-n94p9          0/1     Running   0              4h4m
web-d8d77d-nnm5v          0/1     Running   2 (4h5m ago)   5h30m
web-d8d77d-qrrt6          0/1     Running   0              4h4m
worker-64cf9db7d8-p7vzf   1/1     Running   0              4h4m
worker-64cf9db7d8-xssxn   1/1     Running   0              5h30m
root@quarry-bastion:~# kubectl -n quarry exec -it pod/redis-676b955f95-tkbb7 -- df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.9G     19.9G     20.0K 100% /
tmpfs                    64.0M         0     64.0M   0% /dev
/dev/vdb                 19.9G     19.9G     20.0K 100% /data
/dev/vda4                19.4G      6.4G     13.0G  33% /etc/hosts
/dev/vda4                19.4G      6.4G     13.0G  33% /dev/termination-log
/dev/vdb                 19.9G     19.9G     20.0K 100% /etc/hostname
/dev/vdb                 19.9G     19.9G     20.0K 100% /etc/resolv.conf
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     7.7G     12.0K      7.7G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     3.9G         0      3.9G   0% /proc/asound
tmpfs                     3.9G         0      3.9G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/latency_stats
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                     3.9G         0      3.9G   0% /proc/scsi
tmpfs                     3.9G         0      3.9G   0% /sys/firmware

Mentioned in SAL (#wikimedia-cloud) [2025-04-17T00:12:02Z] <bd808> kubectl -n quarry delete pod/redis-676b955f95-tkbb7 (T392107)

Mentioned in SAL (#wikimedia-cloud) [2025-04-17T00:19:59Z] <bd808> kubectl delete pod -n quarry --all (T392107)

kubectl delete pod -n quarry --all seems to be the magic temporary fix. I'm going to assume that there is something out of the normal that keeps getting started in the cluster that is trashing redis.

@Liz Things should be working again. I just ran a query myself on the system.

I am going to leave this task open for now since the last "victory" only lasted about 4.5 hours.

The redis pod's storage looks good following the pod restart:

root@quarry-bastion:~# kubectl -n quarry exec -it pod/redis-676b955f95-gn4d9 -- df -h /
Filesystem                Size      Used Available Use% Mounted on
overlay                  19.9G      1.5G     18.4G   8% /
bd808 renamed this task from quarry.wmcloud.org: "This web service cannot be reached" to quarry.wmcloud.org: "This web service cannot be reached" due to redis pod running out of disk space.Apr 17 2025, 12:36 AM

I think this may be resolved.

Thank you, much appreciated. This situation seems to happen every 6-9 months. But, luckily, not too often. Thanks again.

Pods are failing to run on node1 again, it seems to have gotten out of inodes:

dcaro@quarry-bastion:~$ kubectl get pods -A -o wide
NAMESPACE     NAME                                         READY   STATUS              RESTARTS         AGE     IP              NODE                                NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-59df77fcd4-6ch5t     1/1     Running             4 (13d ago)      233d    10.100.104.72   quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   calico-node-29q7q                            1/1     Running             2                198d    172.16.2.212    quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
kube-system   calico-node-bqtf6                            1/1     Running             1 (85d ago)      233d    172.16.2.84     quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   calico-node-x4bgt                            1/1     Running             2 (12h ago)      198d    172.16.0.235    quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
kube-system   coredns-5ff97bd88-2qdtc                      1/1     Running             1 (85d ago)      233d    10.100.104.73   quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   coredns-5ff97bd88-db26s                      1/1     Running             1 (85d ago)      233d    10.100.104.70   quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   dashboard-metrics-scraper-5b9cf67c69-425ph   1/1     Running             1 (85d ago)      233d    10.100.104.71   quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   k8s-keystone-auth-4cnqk                      1/1     Running             2 (85d ago)      233d    172.16.2.84     quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   kube-dns-autoscaler-6c8ccf9bb4-4ckmz         1/1     Running             1                4d21h   10.100.80.103   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
kube-system   kubernetes-dashboard-dd5c86b8f-gxwqm         1/1     Running             6 (13d ago)      233d    10.100.104.69   quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
kube-system   magnum-metrics-server-8457b5867c-vpnvb       1/1     Running             1                4d21h   10.100.80.94    quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
kube-system   npd-9h6s6                                    1/1     Running             2                198d    10.100.80.101   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
kube-system   npd-9mdsx                                    1/1     Running             2 (12h ago)      198d    10.100.207.51   quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
kube-system   openstack-cloud-controller-manager-68bmf     1/1     Running             126 (3d5h ago)   233d    172.16.2.84     quarry-127a-g4ndvpkr5sro-master-0   <none>           <none>
quarry        redis-676b955f95-gn4d9                       1/1     Running             0                6h50m   10.100.207.2    quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        web-d8d77d-2lzww                             1/1     Terminating         2 (11h ago)      12h     10.100.80.92    quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-57l8m                             1/1     Terminating         0                11h     10.100.80.116   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-b6gqw                             0/1     ContainerCreating   0                6h50m   <none>          quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-g8wvq                             1/1     Running             0                6h50m   10.100.207.8    quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        web-d8d77d-j7k8g                             1/1     Terminating         0                11h     10.100.80.112   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-lwz79                             1/1     Terminating         2 (11h ago)      12h     10.100.80.113   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-mt7gf                             1/1     Terminating         2 (11h ago)      12h     10.100.80.96    quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-n94p9                             1/1     Terminating         0                11h     10.100.80.115   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-nnm5v                             1/1     Terminating         2 (11h ago)      12h     10.100.80.100   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-pmknq                             1/1     Running             0                6h50m   10.100.207.16   quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        web-d8d77d-qrrt6                             1/1     Terminating         0                11h     10.100.80.106   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-r455p                             1/1     Running             0                6h50m   10.100.207.3    quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        web-d8d77d-rsr9m                             0/1     ContainerCreating   0                6h50m   <none>          quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-s7kbq                             0/1     ContainerCreating   0                6h50m   <none>          quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        web-d8d77d-wmkcm                             1/1     Running             0                6h50m   10.100.207.46   quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        web-d8d77d-ztzqr                             1/1     Running             0                6h50m   10.100.207.58   quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        worker-64cf9db7d8-55vwd                      1/1     Running             0                6h50m   10.100.207.5    quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        worker-64cf9db7d8-fmfq8                      1/1     Running             0                6h50m   10.100.207.62   quarry-127a-g4ndvpkr5sro-node-0     <none>           <none>
quarry        worker-64cf9db7d8-p7vzf                      1/1     Terminating         0                11h     10.100.80.97    quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
quarry        worker-64cf9db7d8-xssxn                      1/1     Terminating         0                12h     10.100.80.102   quarry-127a-g4ndvpkr5sro-node-1     <none>           <none>
dcaro@quarry-bastion:~$ kubectl get --raw "/api/v1/nodes/quarry-127a-g4ndvpkr5sro-node-1/proxy/stats/summary" | jq '.node.runtime' 
{
  "imageFs": {
    "time": "2025-04-17T07:08:13Z",
    "availableBytes": 20480,
    "capacityBytes": 21407727616,
    "usedBytes": 21552541696,
    "inodesFree": 87,
    "inodes": 49800,
    "inodesUsed": 49206
  }
}

Note that the fs section does not report the inode issues:

dcaro@quarry-bastion:~$ kubectl get --raw "/api/v1/nodes/quarry-127a-g4ndvpkr5sro-node-1/proxy/stats/summary" | jq '.node.fs' 
  "fs": {
    "time": "2025-04-17T07:11:03Z",
    "availableBytes": 13907382272,
    "capacityBytes": 20869787648,
    "usedBytes": 6962405376,
    "inodesFree": 10115146,
    "inodes": 10223040,
    "inodesUsed": 107894
  },