Page MenuHomePhabricator

Request increased quota for spacemedia Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: spacemedia
Type of quota increase requested: RAM per pod beyond the 4Gb hard limit, Ideally largest size that can be allocated
Reason: T251026: Detect duplicate images via perceptual hashes before upload, see T230284#6087891 for additional information


I have started to deploy my new "spacemedia" tool and just hit the memory limit:

2019-08-10 22:05:57.076 ERROR 9 --- [pool-2-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task.

java.lang.OutOfMemoryError: Java heap space
        at java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92) ~[na:1.8.0_212]
        at java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:445) ~[na:1.8.0_212]
        at java.awt.image.Raster.createWritableRaster(Raster.java:941) ~[na:1.8.0_212]
        at javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074) ~[na:1.8.0_212]
        at javax.imageio.ImageReader.getDestination(ImageReader.java:2892) ~[na:1.8.0_212]
        at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1082) ~[na:1.8.0_212]
        at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050) ~[na:1.8.0_212]
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:78) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:100) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.service.EsaService.updateMissingImages(EsaService.java:291) ~[classes!/:0.0.1-SNAPSHOT]
        at org.wikimedia.commons.donvip.spacemedia.service.EsaService.updateImages(EsaService.java:314) ~[classes!/:0.0.1-SNAPSHOT]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.1.9.RELEASE.jar!/:5.1.9.RELEASE]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.1.9.RELEASE.jar!/:5.1.
9.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_212]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]

Can you please increase it? The tool computes SHA-1 hashes of free media released by space agencies in order to detect those missing on Wikimedia Commons. For large images, more memory is needed.

Event Timeline

Is this running on Kubernetes or on Grid Engine?

It's running on Kubernetes / jdk8.

This tool has a 4G limit. We do not currently have a method for raising the memory limit for webservices running on the Toolforge Kubernetes cluster (T183436: Add memory limit configuration for Kubernetes pods).

I have not looked deeply at the source code for the tool, but the stacktrace show seems to imply that the SHA1 hash is computed using a java.awt.image.BufferedImage object. This seems like a really heavy way to compute a file's SHA1 value. Is there maybe a technique you could use to compute the SHA1 from a stream of bytes without keeping the entire decoded image in RAM?

My bad. I wrote this code some weeks ago and did not remember that image loading code was here to ensure the files are indeed valid images (to avoid uploading corrupted files to Wikimedia Commons). Some ESA files are corrupted. The SHA-1 hashing code does not require to read the image.

This is not a blocking issue as my tool seems robust to it. A scheduled task that fails because of a memory error will likely succeed on the next try. So I can wait for the other task to be completed :)

After some weeks of runtime it appears my tool was able to handle all files. So I don't really need more memory, I cancel this request.

It seems since T234702 has been done, the default memory settings haven been considerably lowered. My tool crashes right after startup, and I can see the following using kubectl describe pods:

Name:           spacemedia-597444c5db-fc6mg
Namespace:      tool-spacemedia
Priority:       0
Labels:         name=spacemedia
                toolforge=tool
                tools.wmflabs.org/webservice=true
                tools.wmflabs.org/webservice-version=1
Annotations:
                kubernetes.io/limit-ranger: LimitRanger plugin set: cpu, memory request for container webservice; cpu, memory limit for container webservice
Containers:
  webservice:
    Image:         docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-web:latest
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sat, 11 Apr 2020 12:36:18 +0000
      Finished:     Sat, 11 Apr 2020 12:41:52 +0000
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     150m
      memory:  256Mi

Can the memory settings please be increased to 4Gb?

Don-vip claimed this task.

On IRC (Freenode / #wikimedia-cloud) Krenair helped me to resolve the issue. I just had to adapt my startup script from:

webservice jdk11 start /data/project/spacemedia/run.sh

to:

webservice --mem 4Gi jdk11 start /data/project/spacemedia/run.sh
Don-vip reopened this task as Open.EditedApr 28 2020, 8:22 AM

Hello,
I am reopening this ticket. As I am working on computing and comparing perceptual hashes in T251026, the application now needs to load all images.
Even with a single thread, it hits the current memory limit (4 Gi) when loading large files:

2020-04-28 07:43:27.111  INFO 8 --- [pool-2-thread-1] o.w.c.donvip.spacemedia.utils.Utils      : Reading image https://www.spacetelescope.org/static/archives/images/original/heic1901a.tif
2020-04-28 07:44:25.903 ERROR 8 --- [pool-2-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

java.lang.OutOfMemoryError: Java heap space
        at java.desktop/java.awt.image.DataBufferByte.<init>(DataBufferByte.java:92)
        at java.desktop/java.awt.image.ComponentSampleModel.createDataBuffer(ComponentSampleModel.java:439)
        at java.desktop/java.awt.image.Raster.createWritableRaster(Raster.java:1005)
        at java.desktop/javax.imageio.ImageTypeSpecifier.createBufferedImage(ImageTypeSpecifier.java:1074)
        at com.twelvemonkeys.imageio.ImageReaderBase.getDestination(ImageReaderBase.java:326)
        at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.read(TIFFImageReader.java:900)
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:98)
        at org.wikimedia.commons.donvip.spacemedia.utils.Utils.readImage(Utils.java:135)
        at org.wikimedia.commons.donvip.spacemedia.service.MediaService.updateReadableStateAndHashes(MediaService.java:150)
        at org.wikimedia.commons.donvip.spacemedia.service.MediaService.updateMedia(MediaService.java:52)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.AbstractAgencyService.doCommonUpdate(AbstractAgencyService.java:543)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.CommonEsoService.updateMediaForUrl(CommonEsoService.java:194)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.CommonEsoService.doUpdateMedia(CommonEsoService.java:436)
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.HubbleEsaService.updateMedia(HubbleEsaService.java:41)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Can you please increase the memory limit to 8 Gi?

Can you please increase the memory limit to 8 Gi?

8Gi is the entire system level memory for a single Kubernetes worker node. There are other required pods on each worker node, so I'm fairly certain that an 8Gi pod would never end up being scheduled on our cluster. We could try 6Gi in theory, but that's basically giving your tool full control of a worker node. That might be a sign that your tool is ready to graduate out of Toolforge into a dedicated project instead.

That might be a sign that your tool is ready to graduate out of Toolforge into a dedicated project instead.

Thank you for your answer. I'm quite new to WMF infra, what does it mean/imply to go away from Toolforge into a dedicated project?

Thank you for your answer. I'm quite new to WMF infra, what does it mean/imply to go away from Toolforge into a dedicated project?

There is an attempt to describe this at https://wikitech.wikimedia.org/wiki/Help:At_a_glance:_Cloud_VPS_and_Toolforge#What_is_the_difference_between_Cloud_VPS_and_Toolforge?

@bd808 thank you for the docs, I now understand better what it implies. It looks like a significant effort to move away from Toolforge, I'm not ready yet to spend the required amount of work. Can we please try the intermediate solution by increasing the limit to 6Gb, at least to see if it's enough?

bd808 edited projects, added Toolforge (Quota-requests); removed Toolforge.
bd808 renamed this task from Raise spacemedia tool memory limit to Request increased quota for spacemedia Toolforge tool.Jun 19 2020, 5:24 PM
bd808 updated the task description. (Show Details)
aborrero triaged this task as Medium priority.

I see this:

spec:
  hard:
    configmaps: "10"
    limits.cpu: "2"
    limits.memory: 8Gi
    persistentvolumeclaims: "3"
    pods: "4"
    replicationcontrollers: "1"
    requests.cpu: "2"
    requests.memory: 6Gi
    secrets: "10"
    services: "1"
    services.nodeports: "0"

@Bstorm I guess this mean the limit is already set to 6Gi memory, right?

I confirmed with @Bstorm yesterday this was done already.

@aborrero I still see a 4 Gi limit when I use kubectl:

tools.spacemedia@tools-sgebastion-08:~$ kubectl get pods -o json | jq .items[0].spec.containers[0].resources
{
  "limits": {
    "cpu": "1",
    "memory": "4Gi"
  },
  "requests": {
    "cpu": "500m",
    "memory": "2147483648"
  }
}

My startup script is:

#!/bin/sh
webservice --cpu 1 --mem 6Gi jdk11 start /data/project/spacemedia/run.sh
kubectl get -n tool-spacemedia limitrange tool-spacemedia -o json | jq .spec.limits
kubectl get pods -o json | jq .items[0].spec.containers[0].resources
kubectl get pods

The only way to start my application is still to ask for 4Gi only:

webservice --cpu 1 --mem 4Gi jdk11 start /data/project/spacemedia/run.sh

If I try more (5, 6) the pod doesn't start.

If I try more (5, 6) the pod doesn't start.

What does "not starting" look like? Did you capture any log or event output from Kubernetes about the pod? My hunch is that there was not an exec node with 6Gi of free space to schedule the pod. As mentioned before (T230284#6090359) the memory needs you have are pushing the limits of what Toolforge is currently built to handle.

What does "not starting" look like? Did you capture any log or event output from Kubernetes about the pod?

kubectl get pods

No resources found in tool-spacemedia namespace.

kubectl get events

tools.spacemedia@tools-sgebastion-08:~$ kubectl get events
LAST SEEN   TYPE      REASON              OBJECT                             MESSAGE
2m40s       Normal    Killing             pod/spacemedia-66ff5479bc-lrkvl    Stopping container webservice
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-s7bxq" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-f67x2" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-49pnx" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-gzb89" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-4ljd2" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-7ctmk" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m14s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-4rw2d" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m13s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-l9cc5" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m13s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-wjlvh" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
52s         Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   (combined from similar events): Error creating: pods "spacemedia-6c4dc9cd8c-dzs2d" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m40s       Normal    DELETE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    CREATE              ingress/spacemedia-subdomain       Ingress tool-spacemedia/spacemedia-subdomain
2m16s       Normal    ScalingReplicaSet   deployment/spacemedia              Scaled up replica set spacemedia-6c4dc9cd8c to 1

kubectl get events

tools.spacemedia@tools-sgebastion-08:~$ kubectl get events
LAST SEEN   TYPE      REASON              OBJECT                             MESSAGE
2m40s       Normal    Killing             pod/spacemedia-66ff5479bc-lrkvl    Stopping container webservice
2m15s       Warning   FailedCreate        replicaset/spacemedia-6c4dc9cd8c   Error creating: pods "spacemedia-6c4dc9cd8c-s7bxq" is forbidden: maximum memory usage per Container is 4Gi, but limit is 5Gi

After some poking around I found a new-to-me setting that limits the per-container RAM and CPU allocations. We have created a LimitRange object in each namespace in addition to the Quota object. This LimitRange looks like this by default:

$ kubectl get limitrange tool-bd808-test2 -o yaml
apiVersion: v1
kind: LimitRange
metadata:
  creationTimestamp: "2019-12-17T04:38:44Z"
  name: tool-bd808-test2
  namespace: tool-bd808-test2
  resourceVersion: "31274100"
  selfLink: /api/v1/namespaces/tool-bd808-test2/limitranges/tool-bd808-test2
  uid: 1db9a555-ce82-4044-b37a-36eb26add217
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 150m
      memory: 256Mi
    max:
      cpu: "1"
      memory: 4Gi
    min:
      cpu: 50m
      memory: 100Mi
    type: Container

The important bit for this bug is the max.memory setting. Let's change that with an admin edit of the object for spacemedia:

$ kubectl --as-group=system:masters --as=admin edit limitrange tool-spacemedia -n tool-spacemedia
limitrange/tool-spacemedia edited
$ kubectl --as-group=system:masters --as=admin get limitrange tool-spacemedia -n tool-spacemedia -o yaml
apiVersion: v1
kind: LimitRange
metadata:
  creationTimestamp: "2019-12-17T01:25:38Z"
  name: tool-spacemedia
  namespace: tool-spacemedia
  resourceVersion: "129451479"
  selfLink: /api/v1/namespaces/tool-spacemedia/limitranges/tool-spacemedia
  uid: 8f82b8fd-ad96-4d99-a973-2495a395e42f
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 150m
      memory: 256Mi
    max:
      cpu: "1"
      memory: 6Gi
    min:
      cpu: 50m
      memory: 100Mi
    type: Container

@Don-vip I think this is ready for you to give it another try. I did a test with a tool of mine and was able to create a 6Gi container after making the smae edit I made for spacemedia.

@Don-vip I think this is ready for you to give it another try. I did a test with a tool of mine and was able to create a 6Gi container after making the smae edit I made for spacemedia.

It works! The container is created when I request 6 Gi and I can see the increased limit:

Starting webservice....
[
  {
    "default": {
      "cpu": "500m",
      "memory": "512Mi"
    },
    "defaultRequest": {
      "cpu": "150m",
      "memory": "256Mi"
    },
    "max": {
      "cpu": "1",
      "memory": "6Gi"
    },
    "min": {
      "cpu": "50m",
      "memory": "100Mi"
    },
    "type": "Container"
  }
]
{
  "limits": {
    "cpu": "1",
    "memory": "6Gi"
  },
  "requests": {
    "cpu": "500m",
    "memory": "3221225472"
  }
}

Thanks a lot @bd808!