Page MenuHomePhabricator

Provide way to enhance Logstash logs with information from the Train Log Triage process
Open, Needs TriagePublic

Description

Forked from T293694

This task is to organize the technical exploration of a tool that incorporates information from the Train Log Triage process into Logstash events.

  • Can Dashboards plugins read/write to an arbitrary index?
  • Can Dashboards read from the Phabricator API?
  • Can a Logstash script or plugin periodically refresh a cache from OpenSearch without restarts?
  • Is the filter approach performant enough for our scale?
  • Can submitted rules through Dashboards be attributed correctly?
  • Can stack traces be parsed or otherwise used to more exactly identify log messages by the code path that led to the event?
  • Can we add information from MediaWiki's GitInfo (cache/gitinfo/*.json) to log messages for traceability from stack trace frames to recent changes (git blame -L {lineno},{lineno} {file})?

Event Timeline

This isn't strictly about Train Log Triage, but during our Trainsperiments week the question came up about whether we can add more information to log messages about the state of the deployed git checkouts. Specifically, each /srv/mediawiki/php-{version}/cache/gitinfo directory contains a number of JSON files that record the state of the mediawiki/core and submodule git checkouts at the time of deployment (since .git directories don't actually live on app servers, this is necessary). In MediaWiki, the interface to this information is the GitInfo class.

This information could be useful for automated or manual triage and escalation. One idea, for example, would be to link from stack trace frames in Kibana to git blames in Gerrit. Looking at git blame output for a particular file:line is one way to find recent changes that might have affected the code path from where an exception was thrown.

There are potentially a lot of other uses for this data.

Talked with @lmata about this, there are some unknowns from SRE Observability that need experimentation/info from Release-Engineering-Team (possibly also bothering other teams with specific questions)

  • Can Dashboards plugins read/write to an arbitrary index?
  • Can Dashboards read from the Phabricator API?
  • Can we add information from MediaWiki’s GitInfo (cache/gitinfo/*.json) to log messages for traceability from stack trace frames to recent changes (git blame -L {lineno},{lineno} {file})?

A couple candid thoughts

  • gitinfo: may require a better view of the logging pipeline than we currently have on our team, but maybe we can think about this together (although maybe this will require more MediaWiki knowledge, depending on the places available in our current logging pipeline to inject this information)—we definitely know all the places we can get the blame info
  • phab api: definitely seems possible, but we should try this with a testing openapi