Page MenuHomePhabricator

Indexing errors / malformed logs for aqs on cassandra timeout
Closed, DeclinedPublic

Description

Noticed this due to high indexing failures reported by elasticsearch today, looks like the log is malformed and can't be parsed as valid json. Not sure if this is aqs-specific or related to some library.

[2020-09-15T12:52:43,315][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2020.09.15", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x2e39b975>], :response=>{"index"=>{"_index"=>"logstash-2020.09.15", "_type"=>"_doc", "_id"=>"5SbTkXQBGIXcFGALEnhm", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse field [res.body] of type [text] in document with id '5SbTkXQBGIXcFGALEnhm'. Preview of field's value: '{stack=Error: Operation timed out - received only 1 responses.
    at ResponseError.DriverError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/errors.js:14:19)
    at new ResponseError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/errors.js:51:24)
    at FrameReader.readError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/readers.js:317:13)
    at Parser.parseBody (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/streams.js:194:66)
    at Parser._transform (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/streams.js:137:10)
    at Parser.Transform._read (_stream_transform.js:190:10)
    at Parser.Transform._write (_stream_transform.js:178:12)
    at doWrite (_stream_writable.js:410:12)
    at writeOrBuffer (_stream_writable.js:394:5)
    at Parser.Writable.write (_stream_writable.js:294:11)
    at Protocol.ondata (_stream_readable.js:667:20)
    at Protocol.emit (events.js:189:13)
    at addChunk (_stream_readable.js:284:12)
    at readableAddChunk (_stream_readable.js:265:11)
    at Protocol.Readable.push (_stream_readable.js:220:10)
    at Protocol.Transform.push (_stream_transform.js:151:32), err={consistencies=6, stack=Error: Operation timed out - received only 1 responses.
    at ResponseError.DriverError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/errors.js:14:19)
    at new ResponseError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/errors.js:51:24)
    at FrameReader.readError (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/readers.js:317:13)
    at Parser.parseBody (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/streams.js:194:66)
    at Parser._transform (/srv/deployment/analytics/aqs/deploy-cache/revs/95d6432bab3f85ff30a70cc776ee5dd06f478340/node_modules/cassandra-driver/lib/streams.js:137:10)
    at Parser.Transform._read (_stream_transform.js:190:10)
    at Parser.Transform._write (_stream_transform.js:178:12)
    at doWrite (_stream_writable.js:410:12)
    at writeOrBuffer (_stream_writable.js:394:5)
    at Parser.Writable.write (_stream_writable.js:294:11)
    at Protocol.ondata (_stream_readable.js:667:20)
    at Protocol.emit (events.js:189:13)
    at addChunk (_stream_readable.js:284:12)
    at readableAddChunk (_stream_readable.js:265:11)
    at Protocol.Readable.push (_stream_readable.js:220:10)
    at Protocol.Transform.push (_stream_transform.js:151:32), coordinator=10.64.16.78:9042, code=4608, isDataPresent=1, query=select \"project\",\"article\",\"granularity\",\"timestamp\",\"aa\",\"ab\",\"as\",\"au\",\"da\",\"db\",\"ds\",\"du\",\"maa\",\"mab\",\"mas\",\"mau\",\"mwa\",\"mwb\",\"mws\",\"mwu\",\"_domain\",\"_del\",\"_tid\" from \"local_group_default_T_pageviews_per_article_flat\".\"data\" where \"project\" = ? AND \"article\" = ? AND \"granularity\" = ? AND \"timestamp\" >= ? AND \"timestamp\" <= ? AND \"_domain\" = ?, name=ResponseError, received=1, message=Operation timed out - received only 1 responses., blockFor=2, info=Represents an error message from the server}, title=Error in Cassandra table storage backend, type=query_error, req={headers={}, body={\"table\":\"pageviews.per.article.flat\",\"attributes\":{\"project\":\"es.wikipedia\",\"article\":\"Primera_Guerra_Mundial\",\"granularity\":\"daily\",\"timestamp\":{\"between\":[\"2020090700\",\"2020091100\"]},\"_domain\":\"ana, uri=/analytics.wikimedia.org/sys/table/pageviews.per.article.flat/}}'", "caused_by"=>{"type"=>"illegal_state_exception", "reason"=>"Can't get text on a START_OBJECT at 1:792"}}}}}

Event Timeline

mforns triaged this task as Medium priority.Sep 17 2020, 4:00 PM
mforns moved this task from Operational Excellence to Ops Week on the Analytics board.

After talking with @elukey we're not sure if there is anything we want to do here.
The error happened when Luca roll-restarted the AQS hosts.
@Pchelolo : Do you have any idea of upgrading hyperswitch would change anything here?

@JAllemandou this is a known issue with the current Logstash configuration and one of the primary drivers behind adopting a Common Logging Schema (T234565).

In a nutshell, a field gets assigned its type based on the first log message with that field present, whether that's an object, string, or number. Any subsequent messages with that field present must be of the same type or Elasticsearch will throw this mapping exception. In this case, res.body was a string when it was first encountered, and when this event was generated, res.body was an object.

Can we decline or resolve this ticket? It's been open for 18 months, it's a known issue, with an ultimate resolution of adopting ECS common logging standard everywhere possible.
I'm not sure if there is anything more we can do, but please do let me know if I'm mistaken.

Agreed @BTullis ! I don't think we've seen a reoccurence (though I might be mistaken) so I'm boldly closing. Please reopen if you see/know otherwise!