Request increased quota for spacemedia Toolforge tool
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Don-vip
	Aug 10 2019, 10:30 PM

Description

Tool Name: spacemedia
Type of quota increase requested: RAM per pod beyond the 4Gb hard limit, Ideally largest size that can be allocated
Reason: T251026: Detect duplicate images via perceptual hashes before upload, see T230284#6087891 for additional information

I have started to deploy my new "spacemedia" tool and just hit the memory limit:

2019-08-10 22:05:57.076 ERROR 9 --- [pool-2-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task.

java.lang.OutOfMemoryError: Java heap space
        at java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92) ~[na:1.8.0_212]
        at java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:445) ~[na:1.8.0_212]
        at java.awt.image.Raster.createWritableRaster(Raster.java:941) ~[na:1.8.0_212]
        at javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074) ~[na:1.8.0_212]
        at javax.imageio.ImageReader.getDestination(ImageReader.java:2892) ~[na:1.8.0_212]
        at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1082) ~[na:1.8.0_212]
        at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050) ~[na:1.8.0_212]
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:78) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:100) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.service.EsaService.updateMissingImages(EsaService.java:291) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.service.EsaService.updateImages(EsaService.java:314) ~[classes!/:0.0.1-SNAPSHOT]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.1.9.RELEASE.jar!/:5.1.9.RELEASE]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.1.9.RELEASE.jar!/:5.1.
9.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

Can you please increase it? The tool computes SHA-1 hashes of free media released by space agencies in order to detect those missing on Wikimedia Commons. For large images, more memory is needed.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		aborrero	T230284 Request increased quota for spacemedia Toolforge tool
		Resolved		• Bstorm	T183436 Add memory limit configuration for Kubernetes pods

Event Timeline

Don-vip created this task.Aug 10 2019, 10:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2019, 10:30 PM

Don-vip awarded a token.Aug 10 2019, 10:31 PM

Is this running on Kubernetes or on Grid Engine?

It's running on Kubernetes / jdk8.

I have enabled some management features to monitor the memory usage:

https://tools.wmflabs.org/spacemedia/actuator/metrics/jvm.memory.used
https://tools.wmflabs.org/spacemedia/actuator/metrics/jvm.memory.max
https://tools.wmflabs.org/spacemedia/actuator/logfile

This tool has a 4G limit. We do not currently have a method for raising the memory limit for webservices running on the Toolforge Kubernetes cluster (T183436: Add memory limit configuration for Kubernetes pods).

I have not looked deeply at the source code for the tool, but the stacktrace show seems to imply that the SHA1 hash is computed using a java.awt.image.BufferedImage object. This seems like a really heavy way to compute a file's SHA1 value. Is there maybe a technique you could use to compute the SHA1 from a stream of bytes without keeping the entire decoded image in RAM?

bd808 added a subtask: T183436: Add memory limit configuration for Kubernetes pods.Aug 12 2019, 3:07 AM

My bad. I wrote this code some weeks ago and did not remember that image loading code was here to ensure the files are indeed valid images (to avoid uploading corrupted files to Wikimedia Commons). Some ESA files are corrupted. The SHA-1 hashing code does not require to read the image.

This is not a blocking issue as my tool seems robust to it. A scheduled task that fails because of a memory error will likely succeed on the next try. So I can wait for the other task to be completed :)

I fixed a potential memory leak, let's see if OOM errors are gone.

Don-vip added a project: Tool-spacemedia.Aug 16 2019, 5:31 PM

After some weeks of runtime it appears my tool was able to handle all files. So I don't really need more memory, I cancel this request.

Don-vip moved this task from Backlog to Done on the Tool-spacemedia board.Sep 11 2019, 7:43 PM

aborrero closed subtask T183436: Add memory limit configuration for Kubernetes pods as Resolved.Nov 27 2019, 4:11 PM

It seems since T234702 has been done, the default memory settings haven been considerably lowered. My tool crashes right after startup, and I can see the following using kubectl describe pods:

Name:           spacemedia-597444c5db-fc6mg
Namespace:      tool-spacemedia
Priority:       0
Labels:         name=spacemedia
                toolforge=tool
                tools.wmflabs.org/webservice=true
                tools.wmflabs.org/webservice-version=1
Annotations:
                kubernetes.io/limit-ranger: LimitRanger plugin set: cpu, memory request for container webservice; cpu, memory limit for container webservice
Containers:
  webservice:
    Image:         docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-web:latest
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sat, 11 Apr 2020 12:36:18 +0000
      Finished:     Sat, 11 Apr 2020 12:41:52 +0000
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     150m
      memory:  256Mi

Can the memory settings please be increased to 4Gb?

Don-vip moved this task from Done to In progress on the Tool-spacemedia board.Apr 11 2020, 12:58 PM

Don-vip mentioned this in T237270: The spacemedia tool keeps crashing and filling kubernetes nodes.

On IRC (Freenode / #wikimedia-cloud) Krenair helped me to resolve the issue. I just had to adapt my startup script from:

webservice jdk11 start /data/project/spacemedia/run.sh

to:

webservice --mem 4Gi jdk11 start /data/project/spacemedia/run.sh

Don-vip moved this task from In progress to Done on the Tool-spacemedia board.Apr 11 2020, 1:54 PM

Hello,
I am reopening this ticket. As I am working on computing and comparing perceptual hashes in T251026, the application now needs to load all images.
Even with a single thread, it hits the current memory limit (4 Gi) when loading large files:

2020-04-28 07:43:27.111  INFO 8 --- [pool-2-thread-1] o.w.c.donvip.spacemedia.utils.Utils      : Reading image https://www.spacetelescope.org/static/archives/images/original/heic1901a.tif
2020-04-28 07:44:25.903 ERROR 8 --- [pool-2-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

java.lang.OutOfMemoryError: Java heap space
        at java.desktop/java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92)
        at java.desktop/java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:439)
        at java.desktop/java.awt.image.Raster.createWritableRaster(Raster.java:1005)
        at java.desktop/javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074)
        at com.twelvemonkeys.imageio.ImageReaderBase.getDestination(ImageReaderBase.java:326)
        at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.read(TIFFImageReader.java:900)
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:98)
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:135)
        at org.wikimedia.commons.donvip.spacemedia.service.MediaService.updateReadableStateAndHashes(MediaService.java:150)
        at org.wikimedia.commons.donvip.spacemedia.service.MediaService.updateMedia(MediaService.java:52)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.AbstractAgencyService.doCommonUpdate(AbstractAgencyService.java:543)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.CommonEsoService.updateMediaForUrl(CommonEsoService.java:194)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.CommonEsoService.doUpdateMedia(CommonEsoService.java:436)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.HubbleEsaService.updateMedia(HubbleEsaService.java:41)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Can you please increase the memory limit to 8 Gi?

Don-vip moved this task from Done to In progress on the Tool-spacemedia board.Apr 28 2020, 8:23 AM

In T230284#6087891, @Don-vip wrote:

Can you please increase the memory limit to 8 Gi?

8Gi is the entire system level memory for a single Kubernetes worker node. There are other required pods on each worker node, so I'm fairly certain that an 8Gi pod would never end up being scheduled on our cluster. We could try 6Gi in theory, but that's basically giving your tool full control of a worker node. That might be a sign that your tool is ready to graduate out of Toolforge into a dedicated project instead.

In T230284#6090359, @bd808 wrote:

That might be a sign that your tool is ready to graduate out of Toolforge into a dedicated project instead.

Thank you for your answer. I'm quite new to WMF infra, what does it mean/imply to go away from Toolforge into a dedicated project?

In T230284#6090362, @Don-vip wrote:

Thank you for your answer. I'm quite new to WMF infra, what does it mean/imply to go away from Toolforge into a dedicated project?

There is an attempt to describe this at https://wikitech.wikimedia.org/wiki/Help:At_a_glance:_Cloud_VPS_and_Toolforge#What_is_the_difference_between_Cloud_VPS_and_Toolforge?

Don-vip mentioned this in R2610:5f89b249dc5b: T230284 - T251026 - rework handling of hashes, allow to disable full-res.May 7 2020, 9:37 PM

@bd808 thank you for the docs, I now understand better what it implies. It looks like a significant effort to move away from Toolforge, I'm not ready yet to spend the required amount of work. Can we please try the intermediate solution by increasing the limit to 6Gb, at least to see if it's enough?

bd808 removed Don-vip as the assignee of this task.Jun 19 2020, 5:14 PM

bd808 edited projects, added Toolforge (Quota-requests); removed Toolforge.

bd808 renamed this task from Raise spacemedia tool memory limit to Request increased quota for spacemedia Toolforge tool.Jun 19 2020, 5:24 PM

bd808 updated the task description. (Show Details)

• Bstorm moved this task from Inbox to Approved on the Toolforge (Quota-requests) board.Jul 1 2020, 3:57 PM

aborrero claimed this task.Jul 1 2020, 3:58 PM

aborrero triaged this task as Medium priority.

I see this:

spec:
  hard:
    configmaps: "10"
    limits.cpu: "2"
    limits.memory: 8Gi
    persistentvolumeclaims: "3"
    pods: "4"
    replicationcontrollers: "1"
    requests.cpu: "2"
    requests.memory: 6Gi
    secrets: "10"
    services: "1"
    services.nodeports: "0"

@Bstorm I guess this mean the limit is already set to 6Gi memory, right?

I confirmed with @Bstorm yesterday this was done already.

@aborrero I still see a 4 Gi limit when I use kubectl:

tools.spacemedia@tools-sgebastion-08:~$ kubectl get pods -o json | jq .items[0].spec.containers[0].resources
{
  "limits": {
    "cpu": "1",
    "memory": "4Gi"
  },
  "requests": {
    "cpu": "500m",
    "memory": "2147483648"
  }
}

My startup script is:

#!/bin/sh
webservice --cpu 1 --mem 6Gi jdk11 start /data/project/spacemedia/run.sh
kubectl get -n tool-spacemedia limitrange tool-spacemedia -o json | jq .spec.limits
kubectl get pods -o json | jq .items[0].spec.containers[0].resources
kubectl get pods

The only way to start my application is still to ask for 4Gi only:

webservice --cpu 1 --mem 4Gi jdk11 start /data/project/spacemedia/run.sh

If I try more (5, 6) the pod doesn't start.

In T230284#6297736, @Don-vip wrote:

If I try more (5, 6) the pod doesn't start.

What does "not starting" look like? Did you capture any log or event output from Kubernetes about the pod? My hunch is that there was not an exec node with 6Gi of free space to schedule the pod. As mentioned before (T230284#6090359) the memory needs you have are pushing the limits of what Toolforge is currently built to handle.

In T230284#6297781, @bd808 wrote:

What does "not starting" look like? Did you capture any log or event output from Kubernetes about the pod?

kubectl get pods

No resources found in tool-spacemedia namespace.

kubectl get events

tools.spacemedia@tools-sgebastion-08:~$ kubectl get events
LAST SEEN   TYPE      REASON              OBJECT                             MESSAGE
2m40s       Normal    Killing             pod/spacemedia-66ff5479bc-lrkvl    Stopping container webservice
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-s7bxq" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-f67x2" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-49pnx" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-gzb89" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-4ljd2" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-7ctmk" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-4rw2d" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m13s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-l9cc5" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m13s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-wjlvh" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
52s         Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   (combined from similar events): Error creating: pods "spacemedia-6c4dc9cd8c-dzs2d" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    ScalingReplicaSet   deployment/spacemedia              Scaled up replica set spacemedia-6c4dc9cd8c to 1

In T230284#6297945, @Don-vip wrote:

kubectl get events

tools.spacemedia@tools-sgebastion-08:~$ kubectl get events
LAST SEEN   TYPE      REASON              OBJECT                             MESSAGE
2m40s       Normal    Killing             pod/spacemedia-66ff5479bc-lrkvl    Stopping container webservice
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-s7bxq" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi

After some poking around I found a new-to-me setting that limits the per-container RAM and CPU allocations. We have created a LimitRange object in each namespace in addition to the Quota object. This LimitRange looks like this by default:

$ kubectl get limitrange tool-bd808-test2 -o yaml
apiVersion: v1
kind: LimitRange
metadata:
  creationTimestamp: "2019-12-17T04:38:44Z"
  name: tool-bd808-test2
  namespace: tool-bd808-test2
  resourceVersion: "31274100"
  selfLink: /api/v1/namespaces/tool-bd808-test2/limitranges/tool-bd808-test2
  uid: 1db9a555-ce82-4044-b37a-36eb26add217
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 150m
      memory: 256Mi
    max:
      cpu: "1"
      memory: 4Gi
    min:
      cpu: 50m
      memory: 100Mi
    type: Container

The important bit for this bug is the max.memory setting. Let's change that with an admin edit of the object for spacemedia:

$ kubectl --as-group=system:masters --as=admin edit limitrange tool-spacemedia -n tool-spacemedia
limitrange/tool-spacemedia edited
$ kubectl --as-group=system:masters --as=admin get limitrange tool-spacemedia -n tool-spacemedia -o yaml
apiVersion: v1
kind: LimitRange
metadata:
  creationTimestamp: "2019-12-17T01:25:38Z"
  name: tool-spacemedia
  namespace: tool-spacemedia
  resourceVersion: "129451479"
  selfLink: /api/v1/namespaces/tool-spacemedia/limitranges/tool-spacemedia
  uid: 8f82b8fd-ad96-4d99-a973-2495a395e42f
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 150m
      memory: 256Mi
    max:
      cpu: "1"
      memory: 6Gi
    min:
      cpu: 50m
      memory: 100Mi
    type: Container

@Don-vip I think this is ready for you to give it another try. I did a test with a tool of mine and was able to create a 6Gi container after making the smae edit I made for spacemedia.

In T230284#6297975, @bd808 wrote:

@Don-vip I think this is ready for you to give it another try. I did a test with a tool of mine and was able to create a 6Gi container after making the smae edit I made for spacemedia.

It works! The container is created when I request 6 Gi and I can see the increased limit:

Starting webservice....
[
  {
    "default": {
      "cpu": "500m",
      "memory": "512Mi"
    },
    "defaultRequest": {
      "cpu": "150m",
      "memory": "256Mi"
    },
    "max": {
      "cpu": "1",
      "memory": "6Gi"
    },
    "min": {
      "cpu": "50m",
      "memory": "100Mi"
    },
    "type": "Container"
  }
]
{
  "limits": {
    "cpu": "1",
    "memory": "6Gi"
  },
  "requests": {
    "cpu": "500m",
    "memory": "3221225472"
  }
}

Thanks a lot @bd808!

Don-vip moved this task from In progress to Done on the Tool-spacemedia board.Sep 14 2022, 9:41 PM

Request increased quota for spacemedia Toolforge toolClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Request increased quota for spacemedia Toolforge tool
Closed, ResolvedPublic
Actions

Related Objects
Search...