Page MenuHomePhabricator

Stop overriding LogstashFormatter format
Open, LowPublic

Description

Bumping monolog from 1.25.3 to 2.0.2 (T242751) we had to further modify the output of the base LogstashFormatter to save a breaking change

From the changelog

- Context/extra data are now sent as an object instead of single keys

This meant that entries that previously would've ended up prefixed (because they duplicated/overwrote existing keys) now ended up in objects, rather than "$prefix$key" in the base array

I'm not 100% clear if we need to do anything about these context keys anymore (ie no prefixing/renaming neeed?), as they're not moved into the top level, as per the comment that is there now...

That would imply that either data from the processors doesn't make it into logstash, or the way process data and log context is merged has changed and our conflict detection code doesn't work anymore, or that we'll end up with some different structure in Logstash (processor data in a sub-array or something like that) which will break the indexing. Either of those would be bad.

To save causing this breakage, the code was imported from 1.25.3 that kept the original format (https://github.com/Seldaek/monolog/blob/1.25.3/src/Monolog/Formatter/LogstashFormatter.php#L130-L165), however this brings on additional Technical-Debt as a maintenance burden, for any further upstream changes et al need copying in, as well as special casing MediaWiki

A longer term plan should be worked out and implemented to remove this special casing, and how we deal with this in logstash... A patch like https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/579682/ does that removal, but doesn't update the indexing and such on the logstash side

Event Timeline

Change 579682 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Cleanup and simplify LogStashFormatter

https://gerrit.wikimedia.org/r/579682

Below is my current understanding of the change, impact, and a proposal for how to roll it out.

Impact
  • Kibana dashboards and dashboard templates.
  • Custom queries saved and permalinked in Puppet for Scap, Icinga alerts, and more.
  • Custom queries submitted to Logstash from Scap for canary checking.
  • Links generated by WikimediaDebug browser extension.
  • Links generated by Phatality for Phabricator task descriptions.
  • Fields read by Phatality within Logstash.
  • Logstash documentation and tutorials.
  • Human habits.
The change
  • The main difference between our override and the new standard is that "extra" keys (known keys added by a Processor), and "context" keys (free-form array passed with a message to Logger::debug, etc) will now be grouped under a single key, instead of flattened as top-level keys in logstash record.
  • The secondary difference is that in the case of key conflicts between the "record", "extra" and "context" keys we currently prepend c_, and also report it for monitoring queries via logstash_formatter_key_conflict: $keys. This logic will naturally no longer be needed.
Where we are
{
  "@timestamp": ...,
  "host": ...,
  "message": "Foo Barbara, Baz Quux",

  "exception": { "class": .. "file": ..., "trace": .. },
  "bar": "Barbara",
  "quux": "Quuxified",

  "wiki": ...,
  "url": ...,
  "reqId": ...,
  "normalized_message": "Foo {bar}, Baz {quux}",
}
Where we are going
{
  "@timestamp": ...,
  "host": ...,
  "message": "Foo Barbara, Baz Quux",

  "context": {
    "exception": { "class": .. "file": ..., "trace": .. },
    "bar": "Barbara",
    "quux": "Quuxified"
  },

  "extra": {
    "wiki": ...,
    "url": ...,
    "reqId": ...,
    "normalized_message": "Foo {bar}, Baz {quux}"
  }
}
Proposal

High-level:

  1. Clean the house.
  2. Start doing the new way, but keep the old keys for now. Also remove conflict handling (conflicts are rare, and if they do happen it's generally not through any of the above methods of access but rather something ad-hoc. People will find those keys naturally in the new structure in that case).
  3. Wait 90 days (our Logstash retention).
  4. After the 90 days, update dashboards.
  5. Remove the old logic after a week or two.

Concrete:

  1. Clean the house
    • Remove unused V0 handling from MW-LogstashFormatter override. Prod uses V1 (via CeeFormatter; wmf-config/logging).
    • Update production to no longer pass the then-unused $version parameter.
  2. Start the new, and remove obsolete conflict handling.
      • Update MW-LogstashFormatter override as follows:
      • For the contextKey is not empty case, instead of calling our old override, return upstream's format() handling as-is. This means doing the grouping logic instead of the prefixing logic.
      • For the contextKey is empty case (current production), use our override as today, but update our override to match the current upstream format code (grouping assignments), and do an extra iteration like we have today over $extra and $context to put each key also in $record if-and-only-if they they do not already exist.
      • Remove unused fixKeyConflicts method.
    1. Wait 90 days.
      • During these 90 days, we can start updating impacted consumers that only act on "today" information such as WikimediaDebug and Phatality.
  3. After the 90 days.
    • Email ops-l and wikitech-l and ask people to update their dashboards. At this time people can cleanly switch from querying old-style fields to new-style fields and still match all the same data. Remember that most Kibana features don't support OR-conditions across different keys, hence we wait to backfill 90 days first.
  4. Remove all code in MW-LogstashFormatter except normalizeException.

Is there any desire/effort to migrate MW logging to ECS format? If this ticket results in a high impact logging format change, perhaps it would be worth going all the way and trying to make MW do ECS instead?

I ask because I just filed T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL and would love to see MediaWiki software logs queryable in Hive and elsewhere.

On IRC, @Krinkle wrote:

TLDR: Can we afford to double the size of mediaiwki messages within the same document/index?

Assuming that doubling the size is accurate and reflected as actual usage on-disk, my spreadsheet says it will increase disk usage by ~15% across the cluster.

  • Current cluster capacity used: 57%.
  • Projected cluster capacity used: 72%.
  • ~6.5TB per storage tier.

Total cluster usage seems to increase by about 10% in the last 60 days. Adding a 5% increase each 30d interval projects 87% cluster usage by the 90d mark when the proposed duplication is expected to end.

I would expect Kibana performance to degrade due to larger indexes and bigger responses from Kibana. This makes an additional 1744 fields for Kibana to track and report on. How much degradation, I'm not sure. This would be only for the duration of the duplication.

There are a few other options to consider:

  1. Begin the migration to ECS - This is a big project and likely beyond the scope of this task, but this task presents an opportunity. :)
  2. Duplicate the fields. - Potential impacts beyond those previously described include reduced cluster headroom and Kibana performance degradation. These issues are time-limited.
  3. Leverage field aliases to query the fields in their new location from the old locations. - A bit tricky to implement, but should have no adverse effects on dashboards or human habits. We would need to declare the field aliases in the schema (index template) which would make them available at index generation time. We'll want to test to see if Kibana will handle this situation gracefully. Fields becoming unqueryable would make this option a no-go.

I would guess that 1 is the heaviest project but ultimately where we hope to be. 2 is probably the easiest but more risky. 3 would be a little harder to implement, but less risk of cluster overload and performance degradation (among other benefits).