Page MenuHomePhabricator

Request increased quota for pageviews Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: pageviews
Quota increase requested: +6 pod
Reason: We appear to have exhausted our 4 pod limit. See comments at T301649 for the research that led to this conclusion.

This chart is very telling: https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=4w&g0.end_input=2022-02-16%2011%3A52&g0.expr=sum(nginx_ingress_controller_request_duration_seconds_sum%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)%20%2F%20sum(nginx_ingress_controller_request_duration_seconds_count%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)&g0.tab=0

As far back as data goes we've had no problems, except for a few days of 499s starting Feb 3, then the big and ever-growing jump since Feb 10, and you see it go down after I increased the kubernetes pod replica count to 4. I'd like to increase that more, but it appears 4 is the maximum.

This is likely due to an increase in automated traffic. (see also T226688: Block web crawlers from accessing Cloud Services). Even that shouldn't normally be a problem, though, since most bots seem to be headless or don't make XMLHttpRequests, but I could be wrong about that. At any rate, getting our pod count increased would seemingly help. If it means anything, Pageviews and its sister apps were once all 8 separate tools, which was a nightmare to maintain. If kept that way, we would have 8 * 4 = 32 pods total, so I hope asking for +6 to bring us to 10 is not too much. Let's assume these bots are good-faith, then we are only giving them cheaply fetched and cached data. Upstream services like the db replicas aren't put under much stress. It's just the pods that can't handle it.

Thank you as always!

Event Timeline

From @AntiCompositeNumber on IRC:

00:26:07 <AntiComposite> https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?viewPanel=1&orgId=1&var-namespace=tool-pageviews&refresh=5m it doesn't look like you're actually using all that much CPU, so you could try decreasing the per-pod allocations to see if you can squeeze more in

Perhaps I'm doing this wrong:

> webservice start --backend=kubernetes --replicas=6 --cpu=6
Starting webservice...............
> kubectl get pods
No resources found in tool-pageviews namespace.

I tried several other variations such as a low value for cpu in case it applies to each replica, and same with --mem. I could not find a working combination except for the status-quo of using only --replicas=4.

I see these webservice options are plainly in the Wikitech docs, I just didn't know what I needed was there, hehe. Anyway 8-10 pods each with low RAM/CPU should be sufficient, or some combination thereof, if we can get it to work. Thank you for your help!

Name:                   tool-pageviews
Resource                Used  Hard
--------                ---   ---
configmaps              2     10
count/cronjobs.batch    0     50
count/deployments.apps  1     3
count/jobs.batch        0     15
limits.cpu              2     2
limits.memory           2Gi   8Gi
persistentvolumeclaims  0     3
pods                    4     10
replicationcontrollers  0     1
requests.cpu            600m  2
requests.memory         1Gi   6Gi
secrets                 1     10
services                1     1
services.nodeports      0     0

I think you're being limited not by the cpu request, but by the cpu limit.

Try:

webservice start --backend=kubernetes --replicas=6 --cpu=200m

(Does webservice not have an update action?)
That should give each of your pods 0.2cpu and 6 of them. You could increase that a bit, up to the 2cpu limit, if you feel they need more per pod. As for the pods themselves I believe they are limited to 10 by default.

In T301844#7714497, @mdipietro wrote:

Try:

webservice start --backend=kubernetes --replicas=6 --cpu=200m

That should give each of your pods 0.2cpu and 6 of them. You could increase that a bit, up to the 2cpu limit, if you feel they need more per pod. As for the pods themselves I believe they are limited to 10 by default.

Thanks. Unfortunately I still have the same issue:

> webservice start --backend=kubernetes --replicas=6 --cpu=200m
Starting webservice...............
> kubectl get pods
No resources found in tool-pageviews namespace.

hrm, works for me

tools.anticompositetest@tools-sgebastion-07:~$ webservice --cpu 200m --replicas 8 --backend=kubernetes python3.9 start
Starting webservice.........
tools.anticompositetest@tools-sgebastion-07:~$ kubectl get all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/anticompositetest-d686cb4dc-9scv5   1/1     Running   0          3m17s
pod/anticompositetest-d686cb4dc-c97kq   1/1     Running   0          3m17s
pod/anticompositetest-d686cb4dc-gldqp   1/1     Running   0          3m19s
pod/anticompositetest-d686cb4dc-hf2jm   1/1     Running   0          3m17s
pod/anticompositetest-d686cb4dc-lsqt2   1/1     Running   0          3m17s
pod/anticompositetest-d686cb4dc-n7jqd   1/1     Running   0          3m17s
pod/anticompositetest-d686cb4dc-pvqmh   1/1     Running   0          3m18s
pod/anticompositetest-d686cb4dc-x67pz   1/1     Running   0          3m18s

NAME                        TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
service/anticompositetest   ClusterIP   10.98.35.39   <none>        8000/TCP   3m19s

NAME                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/anticompositetest   8/8     8            8           3m19s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/anticompositetest-d686cb4dc   8         8         8       3m19s
tools.anticompositetest@tools-sgebastion-07:~$ kubectl describe resourcequota
Name:                   tool-anticompositetest
Namespace:              tool-anticompositetest
Resource                Used   Hard
--------                ----   ----
configmaps              2      10
count/cronjobs.batch    0      50
count/deployments.apps  1      3
count/jobs.batch        0      15
limits.cpu              1600m  2
limits.memory           4Gi    8Gi
persistentvolumeclaims  0      3
pods                    8      10
replicationcontrollers  0      1
requests.cpu            1600m  2
requests.memory         2Gi    6Gi
secrets                 1      10
services                1      1
services.nodeports      0      0

What does kubectl describe resourcequota say?

If you would be hitting your quotas, it'd create as many pods as it could and fail to create the rest. Can you post the outputs of kubectl get all and kubectl get events please?

Reviewing kubectl get events I can see what happened. It failed to create the pods because the current pods hadn't been removed yet, so it thought it had already hit the limit for CPU. This is odd because I can run webservice start --backend=kubernetes --replicas=4 immediately after webservice stop, and it waits for the pods to terminate before creating new ones. With the --cpu flag it doesn't seem to wait, for whatever reason.

Anyway, I successfully have 6 pods running now! Unfortunately, that didn't seem to resolve the issue :( I can continually reload https://pageviews.toolforge.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Cat|Dog for instance and every 5-10th time, the request to /pageviews/api.php will hang and be cancelled after ~8 seconds, and you'll see "Data unavailable" under "Revisions" on the right side.

Maybe more pods isn't the answer? I tried --replicas=10 --cpu=200m and still no dice :( Prometheus graph is still showing a steady stream of 499s: https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1w&g0.end_input=2022-02-16%2011%3A52&g0.expr=sum(nginx_ingress_controller_request_duration_seconds_sum%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)%20%2F%20sum(nginx_ingress_controller_request_duration_seconds_count%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)&g0.tab=0

It definitely sounds like an increase in quota won't solve this, since Pageviews isn't using that much anyway (and an increase in pod counts didn't solve it)? I'm happy to close this and move discussion back to T301649, if you'd like.

Thank you all so much for the help!