Page MenuHomePhabricator

The spacemedia tool keeps crashing and filling kubernetes nodes
Open, HighPublic

Description

The spacemedia tool seems to regularly (weekly?) hits some URL that it cannot reach and starts crashing. This fills the disk with docker logs on the host that it is running on.

So first, I wanted to let you know that it is crashing regularly (since those logs are likely never reaching you), and second, can you send error logs to the tools NFS homedir on /data/project/spacemedia? I'm taking the action item to tighten up log rotation to prevent this from crashing cluster nodes.

Thanks!

Details

Related Gerrit Patches:
operations/puppet : productiontoolforge-k8s: rotate docker logs

Event Timeline

Bstorm triaged this task as High priority.Mon, Nov 4, 3:20 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Nov 4, 3:20 PM
Bstorm added a comment.Mon, Nov 4, 3:23 PM
org.jsoup.HttpStatusException: HTTP error fetching URL
        at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:760) ~[jsoup-1.12.1.jar!/:na]
        at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:705) ~[jsoup-1.12.1.jar!/:na]
        at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:295) ~[jsoup-1.12.1.jar!/:na]
        at org.jsoup.helper.HttpConnection.get(HttpConnection.java:284) ~[jsoup-1.12.1.jar!/:na]
        at org.wikimedia.commons.donvip.spacemedia.service.agencies.EsaService.updateMedia(EsaService.java:337) ~[classes!/:0.0.1-SNAPSHOT]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_232]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_232]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_232]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_232]
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) [spring-context-5.1.9.RELEASE.jar!/:5.1.9.RELEASE]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.1.9.RELEASE.jar!/:5.1.9.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_232]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_232]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_232]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]

That's a sample of the crashes. That just loops indefinitely in the logs until the disk fills once it is going badly. It seems that functional logs are also going to STDOUT, which is best practice in docker, but, in this environment, you won't be able to see the logs unless they go to disk.

in this environment, you won't be able to see the logs unless they go to disk.

It is possible to use kubectl logs <pod_name> to see the stdout data from a pod, but this data is not persisted across pod restarts.

Phamhi added a subscriber: Phamhi.Mon, Nov 4, 3:59 PM

I think our best bet is to set the max-size and max-file options for the json-file logging driver

https://apimirror.com/docker~1.12/engine/admin/logging/overview/index (our version is so old that you can't find this on the official website)

Bstorm added a comment.Mon, Nov 4, 4:25 PM

@Phamhi: That's docker, not kubernetes. Users cannot do that directly.

Change 548304 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-k8s: rotate docker logs

https://gerrit.wikimedia.org/r/548304

Change 548304 merged by Bstorm:
[operations/puppet@production] toolforge-k8s: rotate docker logs

https://gerrit.wikimedia.org/r/548304

@Bstorm thanks a lot for your bug report! Indeed I have no automatic error report system (yet) allowing me to detect the issue without connecting to the server and checking log files directly, which I didn't do last month.
ESA changed its website recently (I wasn't aware) and it triggers an unexpected bug, causing an infinite loop. I fixed this.
The /data/project/spacemedia/logs folder contained 140Mb of compressed logs. I was not aware Spring Boot 2.1 default logging configuration was to keep indefinitely previous logs. I upgraded to version 2.2, it should now keep only 7 days of logs.
The application was also logging both on standard output and log files. I changed this behaviour to log only to log files when deployed on toolforge. Can you please confirm it works?

What remains for me is to update the ESA parsing code with the new website HTML layout (still no API in sight).