Page MenuHomePhabricator

Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array
Closed, ResolvedPublic

Description

Log messages in the 'VisualEditor' channel are not recorded in Kibana if any extra data ($context array) is passed.

I discovered this after SWAT-ting these two patches: https://gerrit.wikimedia.org/r/q/Ide5138d8f8f462b9c3d7da10f26ff57c9d17f1c9 https://gerrit.wikimedia.org/r/q/I0f321df057068e1dd396a04a9e194853c2b0e9e4 (debug logging for T233320 and T233127)

And seeing nothing in this Kibana search: https://logstash.wikimedia.org/goto/4d053e9de35d9fccb40fabcabd46ff00

@Jdforrester-WMF and @Krinkle helped me experiment, deploying several versions of this patch to mwdebug1002 and testing what gets logged: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464

In the end, I found out that things are only logged if the $context parameter is not given or empty: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540464/3/includes/ApiVisualEditorEdit.php#281

And changed our actual debug logging to put all the extra data in the message: https://gerrit.wikimedia.org/r/q/Iae778f95774df2e24b30387221e39e097e25a4cf, which immediately resulted in them being logged properly (they are now visible in the same search I linked).

However this is obviously quite inconvenient to work with. Can anyone help investigate why this happens?

Details

Related Gerrit Patches:
mediawiki/extensions/VisualEditor : masterTry using structured logging again
mediawiki/extensions/VisualEditor : wmf/1.35.0-wmf.2Revert "Revert "Try using structured logging again""
mediawiki/extensions/VisualEditor : wmf/1.35.0-wmf.2Try using structured logging again
mediawiki/extensions/VisualEditor : wmf/1.35.0-wmf.1Try using structured logging again
operations/puppet : productionlogstash: set template_overwrite true in elasticsearch outputs
operations/puppet : productionlogstash: raise elasticsearch mapping limit
operations/puppet : productionlogstash: add an index for deployment related logs

Event Timeline

matmarex created this task.Oct 3 2019, 7:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2019, 7:39 PM
matmarex updated the task description. (Show Details)
Anomie added a subscriber: Anomie.

I see the logs with context showing up on mwlog1001:

2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing test2  
2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing test2 {"etag":"test2"} 
2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing {"etag":"test2"} 
2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing {"etag":"test2"} 
2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing {etag}  
2019-10-02 21:03:36 [XZUQqApAAC4AAAhfMooAAAAD] mwdebug1002 enwiki 1.34.0-wmf.24 VisualEditor INFO: ApiVisualEditorEdit::postData: Testing

So they seem to be making it out of MediaWiki fine. I'm going to untag Core Platform Team and MediaWiki-Debug-Logger accordingly, and add Wikimedia-Logstash.

In some quick testing, it seems that the ES/Kibana pipeline may be dropping any log event with an "unknown" field. I sent some log messages from mwmaint1002 with shell.php, and events with "ip" and "feature" worked but events with "etag", "XXXtag", "X", "xxx", and "lksnfbewgoibrwogibeoibapifbewgew" all didn't show up in Kibana.

herron added a subscriber: herron.Oct 9 2019, 4:31 PM
herron added a comment.Oct 9 2019, 5:23 PM

Looking backwards through logs I see:

logstash-plain-2019-10-03.log:[2019-10-02T22:11:49,346][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2019.10.02", :_type=>"mediawiki", :_routing=>nil}, 2019-10-02T22:11:48.837Z mw1282 ApiVisualEditorEdit::postData: Received funny ETag from client: ], :response=>{"index"=>{"_index"=>"logstash-2019.10.02", "_type"=>"mediawiki", "_id"=>"AW2Oh4TFP44edBvOl78Q", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Limit of total fields [1000] in index [logstash-2019.10.02] has been exceeded"}}}}

Just yesterday I merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/540486/ in hopes to improve the "Limit of total fields [1000] in index [logstash-2019.10.02] has been exceeded" situation, but sadly some events are still hitting the field limit today on the logstash-mediawiki-YYYY.MM.DD index.

I think we'll need to reduce the field count on these events, and reuse existing fields where possible. Is there a raw example available of a problem message with the would-be fields and values?

herron added a comment.Oct 9 2019, 7:10 PM

Actually pulling an example raw message from kafka should answer my own question about an example problem message

{
  "timestamp": "2019-10-02T17:24:13+00:00",
  "logsource": "mw1341",
  "host": "mw1341",
  "program": "mediawiki",
  "severity": "info",
  "facility": "user",
  "@timestamp": "2019-10-02T17:24:13.117249+00:00",
  "@version": 1,
  "message": "ApiVisualEditorEdit::postData: Received funny ETag from client: \"56160056/8a415220-cf2f-11e9-zzzz-f7f94064aead-df\"",
  "type": "mediawiki",
  "channel": "VisualEditor",
  "level": "INFO",
  "shard": "s2",
  "normalized_message": "ApiVisualEditorEdit::postData: Received funny ETag from client: {etag}",
  "phpversion": "7.2.22-1+0~20190902.26+debian9~1.gbpd64eb7+wmf1",
  "url": "/w/api.php",
  "ip": "zzz",
  "http_method": "POST",
  "server": "pt.wikipedia.org",
  "referrer": "https://pt.m.wikipedia.org/wiki/Partido_Justicialista",
  "unique_id": "XZTdPQpAIDUAAFk774AAAABW",
  "wiki": "ptwiki",
  "mwversion": "1.34.0-wmf.24",
  "reqId": "zzz",
  "etag": "\"56160056/8a415220-cf2f-11e9-zzzz-f7f94064aead-df\"",
  "requestPath": "transform/html/to/wikitext/Partido_Justicialista/56160056"
}
Krinkle added a comment.EditedOct 10 2019, 11:42 PM

MediaWiki deployments through Scap depend on being able to find errors reliably in Logstash (within a few seconds). This is used automatically to decide to proceed or abort a staged deployment. Discarding messages in this way could easily cause a preventable 10-15min outage of Wikipedia.

Since when is Logstash discarding messages for containing an not-before-seen key with the message? And through what means would an unknown key become known? (I'm not aware of a registration mechanism existing).

I agree we should have fewer keys overall however, these aren't written by a single code component. Anywhere in MediaWiki core or an extension, code may write a string to Logstash with a hashtable as second parameter with freeform key-value pairs that developers expect to see when inspecting a row in Logstash. There is no central coordination or registration of what keys are used or common, so re-use is rare and does not seem feasible to introduce.

Most of these keys don't need to be indexed as far I know. The only things we really need index for search/filtering are type, channel, level and maybe a few term-based ones like message, url or exception.trace. If it is possible to have keys existing without being indexed, that might be a solution forward (we'd whitelist which keys to index).

At the very least exclusion from index optimisations would be a more graceful fallback than discarding the message outright.

My past experience from T189333 tells me that we should have fewer keys, but the fact that it grew so effortlessly also tells me that there is definitely not a hard restriction on using new keys - but this is incompatible with the findings of this task. I'd like to better understand where this comes from, and how we can move forward.

In the short-term, to be able to deploy every day with confidence it is important we unbreak this contract by not discarding type:mediawiki messages - except when rate limited (by the recently introduced plugin) for being a high-volume exact duplicate.

Krinkle renamed this task from Log messages in the 'VisualEditor' channel are not recorded in Kibana if any extra data ($context array) is passed to Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array.Oct 10 2019, 11:43 PM
Krinkle triaged this task as High priority.
Krinkle moved this task from Untriaged to Meta on the Wikimedia-production-error board.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a comment.EditedOct 10 2019, 11:56 PM
… Could not index event to Elasticsearch. {:status=>400, … routing=>nil}, … "reason"=>"Limit of total fields [1000] in index [logstash-2019.10.02] has been exceeded"}}}}

Just yesterday I merged https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/540486/ in hopes to improve the "Limit of total fields [1000] in index [logstash-2019.10.02] has been exceeded" situation, but sadly some events are still hitting the field limit today on the logstash-mediawiki-YYYY.MM.DD index.

Would it be possible to give type:mediawiki channel:(error OR exception OR fatal) a separate index as well? These are the only critical ones involved in deployment and should not suffer due to spam from random info/debug channels.

We might want to include type:syslog program:php72-fpm and type:scap in there as well.

These follow a strict structure (no arbitrary key pairs). I assume they use < 1000 unique keys as well. But given how many scenarios I've seen past months that resulted in dropping messages for one reason or another, it'd be great if we could make this index less sensitive to such things, and/or to have the act of dropping messages result in a loud and critical alert that can't be missed by SRE or deployers.

Change 542557 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add an index for deployment related logs

https://gerrit.wikimedia.org/r/542557

! In T234564#5565056, @Krinkle wrote:
Would it be possible to give type:mediawiki channel:(error OR exception OR fatal) a separate index as well? These are the only critical ones involved in deployment and should not suffer due to spam from random info/debug channels.
We might want to include type:syslog program:php72-fpm and type:scap in there as well.

That makes sense to me. Uploaded https://gerrit.wikimedia.org/r/542557 as a starting point.

These follow a strict structure (no arbitrary key pairs). I assume they use < 1000 unique keys as well. But given how many scenarios I've seen past months that resulted in dropping messages for one reason or another, it'd be great if we could make this index less sensitive to such things, and/or to have the act of dropping messages result in a loud and critical alert that can't be missed by SRE or deployers.

+1 to both, will think a bit about how to approach implementation for the latter

Change 543192 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] logstash: raise elasticsearch mapping limit

https://gerrit.wikimedia.org/r/543192

Change 542557 merged by Herron:
[operations/puppet@production] logstash: add an index for deployment related logs

https://gerrit.wikimedia.org/r/542557

Change 543192 merged by Herron:
[operations/puppet@production] logstash: raise elasticsearch mapping limit

https://gerrit.wikimedia.org/r/543192

Do these patches mean that we can try using the structured format in VE logging again? (Reverting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/540487)

hashar added a subscriber: hashar.Fri, Oct 18, 2:13 PM

Tentatively the new index created for today should have a higher limit (2048 instead of 1000). So potentially yes new fields should be accepted.

Change 544209 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set template_overwrite true in elasticsearch outputs

https://gerrit.wikimedia.org/r/544209

Change 544209 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: set template_overwrite true in elasticsearch outputs

https://gerrit.wikimedia.org/r/544209

Change 544941 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@master] Try using structured logging again

https://gerrit.wikimedia.org/r/544941

Change 544957 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.1] Try using structured logging again

https://gerrit.wikimedia.org/r/544957

Change 544958 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.2] Try using structured logging again

https://gerrit.wikimedia.org/r/544958

Change 544957 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.1] Try using structured logging again

https://gerrit.wikimedia.org/r/544957

Mentioned in SAL (#wikimedia-operations) [2019-10-21T17:08:06Z] <jforrester@deploy1001> Synchronized php-1.35.0-wmf.1/extensions/VisualEditor/: Update VisualEditor for set of back-ports in wmf.1 T233320, T234564, T235959 (duration: 00m 56s)

Change 544958 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.2] Try using structured logging again

https://gerrit.wikimedia.org/r/544958

Change 544958 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.2] Try using structured logging again
https://gerrit.wikimedia.org/r/544958

Reverted by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/545001.

Change 544941 abandoned by Bartosz Dziewoński:
Try using structured logging again

https://gerrit.wikimedia.org/r/544941

I tested that on mwdebug1001, triggered the case that should log stuff, and did not see anything being logged in Kibana. I think it still doesn't work.

FTR: as of indices starting on Oct 22nd the limit is now 2048 fields instead of 1000 previously

I tested that on mwdebug1001, triggered the case that should log stuff, and did not see anything being logged in Kibana. I think it still doesn't work.

Seems that the index limit also required another change to have ElasticSearch to notice the change when it creates a new index. That was:

[operations/puppet@production] logstash: set template_overwrite true in elasticsearch outputs
https://gerrit.wikimedia.org/r/544209

Which got merged on Oct 21st, so when you enabled the context logging in VisualEditor on the same day, the index for that day did not have the new limit yet.

A day has passed the index for Oct 22nd has the new limit. So you can try again https://gerrit.wikimedia.org/r/544941

Oh, thanks! Just in time for the SWAT, let's try…

Change 545234 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.2] Revert "Revert "Try using structured logging again""

https://gerrit.wikimedia.org/r/545234

I tested that on mwdebug1001, triggered the case that should log stuff, and did not see anything being logged in Kibana. I think it still doesn't work.

Did the logs with context show up on mwlog1001 when you tried yesterday? Also note that if indeed the 1000 fields per index limitation is indeed the root cause then indices starting today (22nd) have a higher limit (2048)

Did the logs with context show up on mwlog1001 when you tried yesterday?

No idea, I don't have production access. You can grep the logs for 'matmarex-test' if you want to check. (Note that I also used this text for testing changes to non-structured logging later that day.)

Change 545234 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.35.0-wmf.2] Revert "Revert "Try using structured logging again""

https://gerrit.wikimedia.org/r/545234

It works now! Example: https://logstash.wikimedia.org/goto/a8cc9833aa3f3a21a672322a89740154

The custom fields have mildly concerning warning symbols next to them, though:

From my perspective (VisualEditor's), this is resolved, but since it seems that you expanded the scope of this task, I'll leave it for you to close :)

Change 544941 restored by Bartosz Dziewoński:
Try using structured logging again

https://gerrit.wikimedia.org/r/544941

Change 544941 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Try using structured logging again

https://gerrit.wikimedia.org/r/544941

Mentioned in SAL (#wikimedia-operations) [2019-10-23T07:28:24Z] <hashar> logstash: refreshing index fields for logstash-* indices (via https://logstash.wikimedia.org/app/kibana#/management/kibana/indices/logstash-* ) # T234564

Great!!!

The custom fields have mildly concerning warning symbols next to them, though:

I have refreshed the index mapping cache ( via https://logstash.wikimedia.org/app/kibana#/management/kibana/indices/logstash-* ). The warning is gone now.

From my perspective (VisualEditor's), this is resolved, but since it seems that you expanded the scope of this task, I'll leave it for you to close :)

Indeed, what we need now is a way to monitor the number of fields in the indices and have a bold alarm whenever we are about to reach it.

Looking backwards through logs I see:

logstash-plain-2019-10-03.log:[2019-10-02T22:11:49,346][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2019.10.02", :_type=>"mediawiki", :_routing=>nil}, 2019-10-02T22:11:48.837Z mw1282 ApiVisualEditorEdit::postData: Received funny ETag from client: ], :response=>{"index"=>{"_index"=>"logstash-2019.10.02", "_type"=>"mediawiki", "_id"=>"AW2Oh4TFP44edBvOl78Q", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Limit of total fields [1000] in index [logstash-2019.10.02] has been exceeded"}}}}

So at least we have the error logged somewhere, we should thus be able to monitor that somehow? Maybe as a follow sub task.

I also note that when refreshing the indices at https://logstash.wikimedia.org/app/kibana#/management/kibana/indices/logstash-* , the number of fields went from ~ 9900 to 11091. That does not even match the previous default of index.mapping.total_fields.limit = 1000, so maybe that is not related :D

Since the specific issue is resolved I'm for resolving this task, I've created a followup on getting alerted on fields limit hit in T236343: Alert on logstash index failures on too many fields

fgiunchedi closed this task as Resolved.Mon, Oct 28, 1:11 PM
fgiunchedi claimed this task.

Resolving as logs are collected as expected now, reopen if you disagree!

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptMon, Oct 28, 1:11 PM