Applications run within the analytics yarn cluster have a separate logging system from the production side of things. These logs currently reside with hdfs as uncompressed binary (contains lots of ascii) files are are typically accessed with the yarn logs ... command on a per-invocation basis. This ticket is to investigate if these logs, or some subset of these logs if they are too voluminous, could be indexed and accessed from logstash.wikimedia.org. This is become more of a problem of late as more teams that typically interact with logstash are starting to integrate task scheduling via airflow with the analytics yarn cluster.
Description
Description
Related Objects
Related Objects
Event Timeline
Comment Actions
Adding some information about logs stored in HDFS by YARN:
- We keep them for 40 days
- Today, 40 days of logs weight 3.1Tb of uncomnpressed plain text
- They might contain PII (we process IPs etc, and logs could leak those)
Comment Actions
This seems like a major investment given the data sizes involved. Let's decline for the moment, if someone has a strong use case, we'll revisit.