[opsweek] Bump Yarn logs retention period to support debugging long running jobs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Jul 27 2023, 8:25 PM

Description

On T342911, we were trying to debug a long running (3+ days) Yarn job submitted via Airflow that had run on 2023-06-24, so about a month old. Unfortunately, all the Yarn logs were purged:

xcollazo@an-launcher1002:~$ date
Thu 27 Jul 2023 08:20:46 PM UTC
xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1686833367123_9750
Unable to get ApplicationState. Attempting to fetch logs directly from the filesystem.
File /var/log/hadoop-yarn/apps/analytics/logs/application_1686833367123_9750 does not exist.

Can not find any log file matching the pattern: [ALL] for the application: application_1686833367123_9750
Can not find the logs for the application: application_1686833367123_9750 with the appOwner: analytics

This makes debugging of older jobs impossible. See T342911 to see why we were interested in such an old job.

In this task we should consider bumping the retention period. I suggest 3 months.

Details

	Subject	Repo	Branch	Lines +/-
	Retain yarn logs for 60 days and compress with gzip	operations/puppet	production	+91 -81

Customize query in gerrit

Related Objects

Mentioned In: T330176: [Data Platform] Deploy Spark History Service
T342911: Data Quality Issue: Wikitext History Job fail / rerun in Airflow
Mentioned Here: T300937: Evaluate storing logs from applications in yarn with the typical logging infrastructure
T342911: Data Quality Issue: Wikitext History Job fail / rerun in Airflow

Event Timeline

xcollazo created this task.Jul 27 2023, 8:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2023, 8:25 PM

@BTullis not sure about the tags for this one. Is it SRE ?

xcollazo mentioned this in T342911: Data Quality Issue: Wikitext History Job fail / rerun in Airflow.Jul 27 2023, 8:36 PM

LSobanski edited projects, added Data-Platform-SRE; removed SRE.Jul 28 2023, 10:01 AM

LSobanski subscribed.

xcollazo renamed this task from Bump Yarn logs retention period to support debugging long running jobs to [opsweek] Bump Yarn logs retention period to support debugging long running jobs.Jul 31 2023, 6:39 PM

Thanks @xcollazo - I can definitely see the use case here. I remember reading something about this in T300937: Evaluate storing logs from applications in yarn with the typical logging infrastructure and it doesn't quite add up.

@JAllemandou mentioned that we currently keep 40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

So did we change the log retention from 40 days to 7 in between these times, or is it something about the fact that they're launched by Airflow that reduces the retntion to 7 days, or something else?
I'll look around to see if I can find out more about this.

LSobanski unsubscribed.Aug 1 2023, 10:39 AM

So did we change the log retention from 40 days to 7 in between these times

Ah, what I meant to say is that about a month passed between job start and when I wanted to fetch the logs:

Job start:
Sometime on 2023-06-24.
When I wanted the logs:

xcollazo@an-launcher1002:~$ date
Thu 27 Jul 2023 08:20:46 PM UTC

So about 33 days elapsed between. So more than 7 days but also less than 40. Something is fishy here.

40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

Hmm, that is a lot. I wonder if there is any setting that would compress the logs, even if retrieval takes longer.

In T342923#9059010, @xcollazo wrote:

40 days' worth of YARN logs and that these are about 3.1 TB of uncompressed text.

Hmm, that is a lot. I wonder if there is any setting that would compress the logs, even if retrieval takes longer.

Seems like there is a flag for this: yarn.nodemanager.log-aggregation.compression-type, and takes none, lzo and gzip as options. Source: https://renenyffenegger.ch/notes/development/Apache/Hadoop/YARN/log-aggregation

xcollazo mentioned this in T330176: [Data Platform] Deploy Spark History Service.Aug 16 2023, 4:37 PM

+1 to add compression to aggregated application logs!

We changed the log-retention with Nicolas Fraison when he was here.
Our idea was that normally logs are useful in case of failure, and 14 days is well enough to debug (or, you can copy the logs in a safe spot if needed if those 14 days are not enough).
I'm happy to reconsider this decision, let's talk :)

In T342923#9101565, @JAllemandou wrote:

+1 to add compression to aggregated application logs!

We changed the log-retention with Nicolas Fraison when he was here.

Aha! I see that change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894481

Our idea was that normally logs are useful in case of failure, and 14 days is well enough to debug (or, you can copy the logs in a safe spot if needed if those 14 days are not enough).
I'm happy to reconsider this decision, let's talk :)

So how about reverting to 40 days' worth of logs, but enabling lzo compression? Would that be a good compromise, or should we go right to the 90 days requested in the description, with compression?

So how about reverting to 40 days' worth of logs, but enabling lzo compression? Would that be a good compromise, or should we go right to the 90 days requested in the description, with compression?

Happy to compromise, but the average between 14 and 90 is 59, and if we are doing 59 we might as well do 60, no? Just saying! 😸 😸 😸

Ok for me to keep 60 days - I think this will seldom be used, but eh, we have a counter example :)
In terms of compression, I'd use gzip instead of lzo for this case, Assuming logs are not to be read very frequently and therefore benefit from being more compressed at the cost of some more CPU at decompress time.

BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.Aug 18 2023, 4:17 PM

Change 950191 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Retain yarn logs for 60 days and compress with gzip

https://gerrit.wikimedia.org/r/950191

gerritbot added a project: Patch-For-Review.Aug 18 2023, 4:21 PM

BTullis moved this task from In Progress to Needs Review on the Data-Platform-SRE board.Aug 18 2023, 4:38 PM

Change 950191 merged by Btullis:

[operations/puppet@production] Retain yarn logs for 60 days and compress with gzip

https://gerrit.wikimedia.org/r/950191

Deploying this change.

Mentioned in SAL (#wikimedia-analytics) [2023-08-22T13:03:44Z] <btullis> deploying the change to the yarn log retention and compression for T342923

I'll leave the ticket open a for a while, whilst we check to make sure that there are no unintended consequences.

Maintenance_bot removed a project: Patch-For-Review.Aug 22 2023, 1:11 PM

Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development / non-prod jobs for that long, and dev jobs are often run on a noisier log level.

In T342923#9109622, @fkaelin wrote:

Would it be possible to define the retention period based on the queue a appliocation is running in? It is not required to keep the logs for development / non-prod jobs for that long, and dev jobs are often run on a noisier log level.

I like the idea, but from a quick scan of the docs (e.g. https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Queue_Properties) I don't think it is possible.
It looks like mainly scheduling parameters and ACLs etc that are configurable on a per-queue basis. I could be wrong though, so if you find any evidence that supports this type of configuration I'd be happy for us to try it.

dev jobs are often run on a noisier log level

At least noisy logs generally compress quite well, so that's better than leaving them in plaintext.

Gehel assigned this task to BTullis.Aug 22 2023, 3:32 PM

There was one incident that resulted from this change: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-30_hadoop-yarn
In short, we should have referenced the file suffix gz into /etc/hadoop/conf/yarn-site.xml but we put gzip instead.

btullis@an-worker1078:~$ grep -B3 gz /etc/hadoop/conf/yarn-site.xml 
  <property>
    <description>What type of compression should be used for yarn logs.</description>
    <name>yarn.nodemanager.log-aggregation.compression-type</name>
    <value>gz</value>

Now that it has been corrected, log aggregation is working as expected.

[opsweek] Bump Yarn logs retention period to support debugging long running jobsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

[opsweek] Bump Yarn logs retention period to support debugging long running jobs
Closed, ResolvedPublic
Actions