Stop overriding LogstashFormatter format
Open, LowPublic
Actions

Assigned To

None

Authored By

	Reedy
	Mar 14 2020, 7:57 PM

Description

Bumping monolog from 1.25.3 to 2.0.2 (T242751) we had to further modify the output of the base LogstashFormatter to save a breaking change

From the changelog

- Context/extra data are now sent as an object instead of single keys

This meant that entries that previously would've ended up prefixed (because they duplicated/overwrote existing keys) now ended up in objects, rather than "$prefix$key" in the base array

In T242751#5969175, @Tgr wrote:

In T242751#5969133, @Reedy wrote:

I'm not 100% clear if we need to do anything about these context keys anymore (ie no prefixing/renaming neeed?), as they're not moved into the top level, as per the comment that is there now...

That would imply that either data from the processors doesn't make it into logstash, or the way process data and log context is merged has changed and our conflict detection code doesn't work anymore, or that we'll end up with some different structure in Logstash (processor data in a sub-array or something like that) which will break the indexing. Either of those would be bad.

To save causing this breakage, the code was imported from 1.25.3 that kept the original format (https://github.com/Seldaek/monolog/blob/1.25.3/src/Monolog/Formatter/LogstashFormatter.php#L130-L165), however this brings on additional Technical-Debt as a maintenance burden, for any further upstream changes et al need copying in, as well as special casing MediaWiki

A longer term plan should be worked out and implemented to remove this special casing, and how we deal with this in logstash... A patch like https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/579682/ does that removal, but doesn't update the indexing and such on the logstash side

Details

	Subject	Repo	Branch	Lines +/-
	Cleanup and simplify LogStashFormatter	mediawiki/core	master	+1 -178

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T245289 Create sniff to help prevent logstash_formatter_key_conflict in mediawiki logs
Open	None	T247675 Stop overriding LogstashFormatter format
Resolved	Reedy	T242751 Update monolog/monolog to 2.1.1 or later

Event Timeline

Reedy created this task.Mar 14 2020, 7:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2020, 7:57 PM

Reedy added a subtask: T242751: Update monolog/monolog to 2.1.1 or later.Mar 14 2020, 7:57 PM

Tgr mentioned this in T242751: Update monolog/monolog to 2.1.1 or later.Mar 16 2020, 9:58 AM

Reedy updated the task description. (Show Details)Oct 8 2020, 9:33 PM

Reedy added a project: Platform Engineering.Oct 8 2020, 9:59 PM

Naike triaged this task as Low priority.Oct 13 2020, 8:19 PM

Naike edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.

Naike moved this task from Inbox to Later on the Platform Team Workboards (Clinic Duty Team) board.

Reedy mentioned this in T269286: Monolog update removes exception object from logstash.Dec 2 2020, 8:01 PM

Change 579682 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Cleanup and simplify LogStashFormatter

https://gerrit.wikimedia.org/r/579682

gerritbot added a project: Patch-For-Review.Dec 2 2020, 8:40 PM

Jdforrester-WMF closed subtask T242751: Update monolog/monolog to 2.1.1 or later as Resolved.Jan 17 2021, 3:40 AM

fgiunchedi added a project: observability.Apr 20 2021, 8:43 AM

lmata moved this task from Inbox to Backlog on the observability board.Jun 9 2021, 4:02 AM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

Maintenance_bot added a project: observability.Jul 12 2021, 2:49 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Aug 9 2021, 1:15 AM

Maintenance_bot edited projects, added SRE Observability; removed Observability-Logging.Aug 9 2021, 1:46 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Aug 9 2021, 3:06 AM

Maintenance_bot edited projects, added SRE Observability; removed Observability-Logging.Aug 9 2021, 3:49 AM

colewhite subscribed.Sep 16 2021, 3:45 PM

Below is my current understanding of the change, impact, and a proposal for how to roll it out.

Impact

Kibana dashboards and dashboard templates.
Custom queries saved and permalinked in Puppet for Scap, Icinga alerts, and more.
Custom queries submitted to Logstash from Scap for canary checking.
Links generated by WikimediaDebug browser extension.
Links generated by Phatality for Phabricator task descriptions.
Fields read by Phatality within Logstash.
Logstash documentation and tutorials.
Human habits.

The change

The main difference between our override and the new standard is that "extra" keys (known keys added by a Processor), and "context" keys (free-form array passed with a message to Logger::debug, etc) will now be grouped under a single key, instead of flattened as top-level keys in logstash record.

The secondary difference is that in the case of key conflicts between the "record", "extra" and "context" keys we currently prepend c_, and also report it for monitoring queries via logstash_formatter_key_conflict: $keys. This logic will naturally no longer be needed.

Where we are

{
  "@timestamp": ...,
  "host": ...,
  "message": "Foo Barbara, Baz Quux",

  "exception": { "class": .. "file": ..., "trace": .. },
  "bar": "Barbara",
  "quux": "Quuxified",

  "wiki": ...,
  "url": ...,
  "reqId": ...,
  "normalized_message": "Foo {bar}, Baz {quux}",
}

Where we are going

{
  "@timestamp": ...,
  "host": ...,
  "message": "Foo Barbara, Baz Quux",

  "context": {
    "exception": { "class": .. "file": ..., "trace": .. },
    "bar": "Barbara",
    "quux": "Quuxified"
  },

  "extra": {
    "wiki": ...,
    "url": ...,
    "reqId": ...,
    "normalized_message": "Foo {bar}, Baz {quux}"
  }
}

Proposal

High-level:

Clean the house.
Start doing the new way, but keep the old keys for now. Also remove conflict handling (conflicts are rare, and if they do happen it's generally not through any of the above methods of access but rather something ad-hoc. People will find those keys naturally in the new structure in that case).
Wait 90 days (our Logstash retention).
After the 90 days, update dashboards.
Remove the old logic after a week or two.

Concrete:

Clean the house
- Remove unused V0 handling from MW-LogstashFormatter override. Prod uses V1 (via CeeFormatter; wmf-config/logging).
- Update production to no longer pass the then-unused $version parameter.
Start the new, and remove obsolete conflict handling.
1. - Update MW-LogstashFormatter override as follows:
  - For the contextKey is not empty case, instead of calling our old override, return upstream's format() handling as-is. This means doing the grouping logic instead of the prefixing logic.
  - For the contextKey is empty case (current production), use our override as today, but update our override to match the current upstream format code (grouping assignments), and do an extra iteration like we have today over $extra and $context to put each key also in $record if-and-only-if they they do not already exist.
  - Remove unused fixKeyConflicts method.
2. Wait 90 days.
  - During these 90 days, we can start updating impacted consumers that only act on "today" information such as WikimediaDebug and Phatality.
After the 90 days.
- Email ops-l and wikitech-l and ask people to update their dashboards. At this time people can cleanly switch from querying old-style fields to new-style fields and still match all the same data. Remember that most Kibana features don't support OR-conditions across different keys, hence we wait to backfill 90 days first.
Remove all code in MW-LogstashFormatter except normalizeException.

Joe subscribed.Sep 20 2021, 1:12 PM

Is there any desire/effort to migrate MW logging to ECS format? If this ticket results in a high impact logging format change, perhaps it would be worth going all the way and trying to make MW do ECS instead?

I ask because I just filed T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL and would love to see MediaWiki software logs queryable in Hive and elsewhere.

Krinkle added a parent task: T245289: Create sniff to help prevent logstash_formatter_key_conflict in mediawiki logs.Oct 14 2021, 8:26 PM

Krinkle mentioned this in T245289: Create sniff to help prevent logstash_formatter_key_conflict in mediawiki logs.

On IRC, @Krinkle wrote:

TLDR: Can we afford to double the size of mediaiwki messages within the same document/index?

Assuming that doubling the size is accurate and reflected as actual usage on-disk, my spreadsheet says it will increase disk usage by ~15% across the cluster.

Current cluster capacity used: 57%.
Projected cluster capacity used: 72%.
~6.5TB per storage tier.

Total cluster usage seems to increase by about 10% in the last 60 days. Adding a 5% increase each 30d interval projects 87% cluster usage by the 90d mark when the proposed duplication is expected to end.

I would expect Kibana performance to degrade due to larger indexes and bigger responses from Kibana. This makes an additional 1744 fields for Kibana to track and report on. How much degradation, I'm not sure. This would be only for the duration of the duplication.

There are a few other options to consider:

Begin the migration to ECS - This is a big project and likely beyond the scope of this task, but this task presents an opportunity. :)
Duplicate the fields. - Potential impacts beyond those previously described include reduced cluster headroom and Kibana performance degradation. These issues are time-limited.
Leverage field aliases to query the fields in their new location from the old locations. - A bit tricky to implement, but should have no adverse effects on dashboards or human habits. We would need to declare the field aliases in the schema (index template) which would make them available at index generation time. We'll want to test to see if Kibana will handle this situation gracefully. Fields becoming unqueryable would make this option a no-go.

I would guess that 1 is the heaviest project but ultimately where we hope to be. 2 is probably the easiest but more risky. 3 would be a little harder to implement, but less risk of cluster overload and performance degradation (among other benefits).

lmata edited projects, added Observability-Logging; removed SRE Observability.Jan 17 2022, 11:26 PM

Stop overriding LogstashFormatter formatOpen, LowPublicActions