Page MenuHomePhabricator

Evaluate storing logs from applications in yarn with the typical logging infrastructure
Closed, DeclinedPublic

Description

Applications run within the analytics yarn cluster have a separate logging system from the production side of things. These logs currently reside with hdfs as uncompressed binary (contains lots of ascii) files are are typically accessed with the yarn logs ... command on a per-invocation basis. This ticket is to investigate if these logs, or some subset of these logs if they are too voluminous, could be indexed and accessed from logstash.wikimedia.org. This is become more of a problem of late as more teams that typically interact with logstash are starting to integrate task scheduling via airflow with the analytics yarn cluster.

Event Timeline

Adding some information about logs stored in HDFS by YARN:

  • We keep them for 40 days
  • Today, 40 days of logs weight 3.1Tb of uncomnpressed plain text
  • They might contain PII (we process IPs etc, and logs could leak those)
Gehel subscribed.

This seems like a major investment given the data sizes involved. Let's decline for the moment, if someone has a strong use case, we'll revisit.