Page MenuHomePhabricator

Request increased quota for anomiebot Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: anomiebot
Quota increase requested: Maybe +5 pods and +2 CPU? Details below.
Reason: T319557: Migrate anomiebot from Toolforge GridEngine to Toolforge Kubernetes


Since T319557 asks me to migrate AnomieBOT to Kubernetes, I looked at what the quotas are versus AnomieBOT's current usage on GridEngine. That current usage is:

ID       Bot                 State    CPU         VMem    Peak    Max     %      Queue                                                                
-------  ------------------  -------  ----------  ------  ------  ------  -----  ---------------------------------------------------------------------
6020222  AnomieBOT-2         running    30:12:10   86.2M   95.2M  350.0M  24.6%  continuous@tools-sgeexec-10-19.tools.eqiad1.wikimedia.cloud          
6020223  AnomieBOT-3         running    51:03:51  138.0M  158.5M  350.0M  39.4%  continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud           
6020224  AnomieBOT-4         running  1028:38:41  120.0M  128.8M  512.0M  23.4%  continuous@tools-sgeexec-10-19.tools.eqiad1.wikimedia.cloud          
6020225  AnomieBOT-5         running     2:29:25  123.1M  140.3M  256.0M  48.1%  continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud           
1237289  AnomieBOT-7         running    12:23:31   94.0M  104.6M  256.0M  36.7%  continuous@tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud          
6017869  AnomieBOT-200       running     0:10:42   81.1M   81.6M  256.0M  31.7%  continuous@tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud          
6017870  AnomieBOT-500       running     0:15:11   80.3M   81.4M  256.0M  31.4%  continuous@tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud          
6017871  AnomieBOT-501       running     1:06:08   80.1M   80.8M  256.0M  31.3%  continuous@tools-sgeexec-10-8.tools.eqiad1.wikimedia.cloud           
6017872  AnomieBOT-999       running    16:24:00  235.0M  425.8M  512.0M  45.9%  continuous@tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud          
9999436  lighttpd-anomiebot  running     0:20:05  175.2M  316.2M    4.0G   4.3%  webgrid-lighttpd@tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud

As I understand it, each of the 10 jobs there would be a "pod" in Kubernetes, and the default quota is 10 pods. That doesn't leave any overhead, for e.g. the daily cron task that sends me a status email and the AnomieBOT-1 job used to run on-demand tasks.

The default 8Gi quota for memory seems like it should be fine, AnomieBOT doesn't use a lot. Especially if I can turn the webserver's request down when switching it to Kubernetes.

As for CPU, that's where I could really use some advice. AnomieBOT-4 clearly does the most processing and could probably use 1 full CPU. The rest should be fine with fractions, although it seems likely that 1/9 each would be low. The +2 requested would be enough for 1/4 each plus 3/4 left over for overhead, but I'd be happy to have more. If you can point me at monitoring (grafana?), that would also be helpful once I start switching over to inform balancing the allocations.

For background, AnomieBOT currently runs 40 separate tasks for enwiki. Rather than having 40 separate jobs, most usually idle but potentially being a thundering herd if they all wake at once, the tasks are divided among a small number of "runners" that execute tasks in series.

  • AnomieBOT-1 runs on-demand tasks, if someone asks me to run one.
  • AnomieBOT-2 runs 12 different clerking tasks, that generally all would want to run at hourly, 4-hourly, or 6-hourly intervals.
  • AnomieBOT-3 runs 14 continuous but not particularly time-sensitive tasks.
  • AnomieBOT-4 runs a CPU-intensive task that runs pretty much continuously.
  • AnomieBOT-5 runs a task which runs infrequently but is IO-bound when it runs, so I put it on a separate runner to avoid blocking tasks on -2 or -3.
  • AnomieBOT-6 doesn't do anything right now, the task it used to run was discontinued.
  • AnomieBOT-7 runs an IO-bound task that runs fairly continuously.
  • AnomieBOT-200 runs 2 tasks that use the AnomieBOT II account (which has the templateeditor group).
  • AnomieBOT-500 runs 2 tasks that use the AnomieBOT III account (which is an adminbot).
  • AnomieBOT-501 runs a task using the AnomieBOT III account that needs particularly low latency.
  • AnomieBOT-999 runs 6 tasks that operate under a "does not need specific approval" clause of enwiki's bot policy.

Event Timeline

JJMC89 renamed this task from Request increased quota for <Replace Me> Toolforge tool to Request increased quota for anomiebot Toolforge tool.Oct 14 2022, 5:41 PM

Hi.

As I understand it, each of the 10 jobs there would be a "pod" in Kubernetes, and the default quota is 10 pods. That doesn't leave any overhead, for e.g. the daily cron task that sends me a status email and the AnomieBOT-1 job used to run on-demand tasks.

Correct. Your proposed increase of +5 seems fine to me.

As for CPU, that's where I could really use some advice. AnomieBOT-4 clearly does the most processing and could probably use 1 full CPU. The rest should be fine with fractions, although it seems likely that 1/9 each would be low. The +2 requested would be enough for 1/4 each plus 3/4 left over for overhead, but I'd be happy to have more. If you can point me at monitoring (grafana?), that would also be helpful once I start switching over to inform balancing the allocations.

Picking a good CPU quota value has always felt like guessing to me. In general what you are proposing seems fine as a start, and if that's not enough for some of the tasks I'm fine with granting more quota to a large, well-established bot like this one.

Mentioned in SAL (#wikimedia-cloud) [2022-10-15T17:58:19Z] <taavi> increase k8s quotas: 2 cpu -> 4, 10 pods -> 15 # T320830

And applied those changes. @Anomie I'm leaving this task open for now, please check if the current limits are enough and close if so.

Thanks. It may be a bit before I can check, as I'll also need T320824 to begin.

https://k8s-status.toolforge.org/namespaces/tool-anomiebot/ has quota information and a link to Grafana.

Thanks for the info. Is there a way to get Grafana to show CPU usage per pod, so I could see how often a pod is reaching the limits?

Thanks for the info. Is there a way to get Grafana to show CPU usage per pod, so I could see how often a pod is reaching the limits?

Not that I am aware of, but maybe one of the Toolforge admins knows better or can create one.

Seems like everything here is now complete. Please re-open if this is not the case.