Page MenuHomePhabricator

[opsweek] Bump Yarn logs retention period to support debugging long running jobs
Closed, ResolvedPublic

Description

On T342911, we were trying to debug a long running (3+ days) Yarn job submitted via Airflow that had run on 2023-06-24, so about a month old. Unfortunately, all the Yarn logs were purged:

xcollazo@an-launcher1002:~$ date
Thu 27 Jul 2023 08:20:46 PM UTC
xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1686833367123_9750
Unable to get ApplicationState. Attempting to fetch logs directly from the filesystem.
File /var/log/hadoop-yarn/apps/analytics/logs/application_1686833367123_9750 does not exist.

Can not find any log file matching the pattern: [ALL] for the application: application_1686833367123_9750
Can not find the logs for the application: application_1686833367123_9750 with the appOwner: analytics

This makes debugging of older jobs impossible. See T342911 to see why we were interested in such an old job.

In this task we should consider bumping the retention period. I suggest 3 months.

Event Timeline

@BTullis not sure about the tags for this one. Is it SRE ?

xcollazo renamed this task from Bump Yarn logs retention period to support debugging long running jobs to [opsweek] Bump Yarn logs retention period to support debugging long running jobs.Jul 31 2023, 6:39 PM

Thanks @xcollazo - I can definitely see the use case here. I remember reading something about this in T300937: Evaluate storing logs from applications in yarn with the typical logging infrastructure and it doesn't quite add up.

@JAllemandou mentioned that we currently keep 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

So did we change the log retention from 40 days to 7 in between these times, or is it something about the fact that they're launched by Airflow that reduces the retntion to 7 days, or something else?
I'll look around to see if I can find out more about this.

So did we change the log retention from 40 days to 7 in between these times

Ah, what I meant to say is that about a month passed between job start and when I wanted to fetch the logs:

Job start:
Sometime on 2023-06-24.
When I wanted the logs:

xcollazo@an-launcher1002:~$ date
Thu 27 Jul 2023 08:20:46 PM UTC

So about 33 days elapsed between. So more than 7 days but also less than 40. Something is fishy here.

40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

Hmm, that is a lot. I wonder if there is any setting that would compress the logs, even if retrieval takes longer.

40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

Hmm, that is a lot. I wonder if there is any setting that would compress the logs, even if retrieval takes longer.

Seems like there is a flag for this: yarn.nodemanager.log-aggregation.compression-type, and takes none, lzo and gzip as options. Source: https://renenyffenegger.ch/notes/development/Apache/Hadoop/YARN/log-aggregation

+1 to add compression to aggregated application logs!

We changed the log-retention with Nicolas Fraison when he was here.
Our idea was that normally logs are useful in case of failure, and 14 days is well enough to debug (or, you can copy the logs in a safe spot if needed if those 14 days are not enough).
I'm happy to reconsider this decision, let's talk :)

+1 to add compression to aggregated application logs!

We changed the log-retention with Nicolas Fraison when he was here.

Aha! I see that change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894481

Our idea was that normally logs are useful in case of failure, and 14 days is well enough to debug (or, you can copy the logs in a safe spot if needed if those 14 days are not enough).
I'm happy to reconsider this decision, let's talk :)

So how about reverting to 40 days' worth of logs, but enabling lzo compression? Would that be a good compromise, or should we go right to the 90 days requested in the description, with compression?

So how about reverting to 40 days' worth of logs, but enabling lzo compression? Would that be a good compromise, or should we go right to the 90 days requested in the description, with compression?

Happy to compromise, but the average between 14 and 90 is 59, and if we are doing 59 we might as well do 60, no? Just saying! 😸 😸 😸

Ok for me to keep 60 days - I think this will seldom be used, but eh, we have a counter example :)
In terms of compression, I'd use gzip instead of lzo for this case, Assuming logs are not to be read very frequently and therefore benefit from being more compressed at the cost of some more CPU at decompress time.

Change 950191 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Retain yarn logs for 60 days and compress with gzip

https://gerrit.wikimedia.org/r/950191

Change 950191 merged by Btullis:

[operations/puppet@production] Retain yarn logs for 60 days and compress with gzip

https://gerrit.wikimedia.org/r/950191

Mentioned in SAL (#wikimedia-analytics) [2023-08-22T13:03:44Z] <btullis> deploying the change to the yarn log retention and compression for T342923

I'll leave the ticket open a for a while, whilst we check to make sure that there are no unintended consequences.

Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development / non-prod jobs for that long, and dev jobs are often run on a noisier log level.

Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development / non-prod jobs for that long, and dev jobs are often run on a noisier log level.

I like the idea, but from a quick scan of the docs (e.g. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Queue_Properties) I don't think it is possible.
It looks like mainly scheduling parameters and ACLs etc that are configurable on a per-queue basis. I could be wrong though, so if you find any evidence that supports this type of configuration I'd be happy for us to try it.

dev jobs are often run on a noisier log level

At least noisy logs generally compress quite well, so that's better than leaving them in plaintext.

There was one incident that resulted from this change: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn
In short, we should have referenced the file suffix gz into /etc/hadoop/conf/yarn-site.xml but we put gzip instead.

btullis@an-worker1078:~$ grep -B3 gz /etc/hadoop/conf/yarn-site.xml 
  <property>
    <description>What type of compression should be used for yarn logs.</description>
    <name>yarn.nodemanager.log-aggregation.compression-type</name>
    <value>gz</value>

Now that it has been corrected, log aggregation is working as expected.