Page MenuHomePhabricator

Make cxserver Logstash logs readable and reliable
Open, HighPublic8 Estimated Story Points

Description

After deploying cxserver as a part of T357950 logstash entries seem probably scattered in multiple entries for a single log message. For example: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.10.23?id=3FzXuZIB8AvXVHu8uWhv is not useful.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It looks like the new lines in ECS Json data is treated as seperate log. And the log.level is interpreted as NOTICE when the value is "error". I cann't tell whether this is related to containerd migration. If new lines in messages are not allowed in ECS log messages, we can fix it in cxserver side(stacktraces if present in log message, will have new line). But it would be nice if there is a way to validate without trying to deploy a change and see what happens.

An example sequence of logs from logstash- Notice they are mixed up in timeline. and each line is considered as log.

levellog
NOTICEHTTPError: 503: api_error
NOTICEthrow new HTTPError( {
NOTICE/srv/service/lib/mw/MwApiRequest.js:111
NOTICEstatus: 503,
NOTICEat process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
NOTICEat TitleInfoRequest.mwGet (/srv/service/lib/mw/MwApiRequest.js:111:10)
NOTICENode.js v20.5.1
NOTICE}
NOTICEtitle: 'MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&prop=pageprops%7Cpageimages%7Cdescription&piprop=thumbnail&pithumbsize=80&pilimit=2&ppprop=wikibase_item&titles=Fender+%28vehicle%29%7CPenny+%28British+pre-decimal+coin%29&redirects=true&continue='
NOTICEtype: 'api_error',

I don't think it's related given the example log in the description is from a node running docker.

I was able to grep a raw log from the disk of mw1485 (running docker as well) which looks like this:

{"log":"{\"@timestamp\":\"2024-11-11T10:04:22.279Z\",\"ecs.version\":\"8.10.0\",\"log.level\":\"error\",\"message\":\"Error suggesting sections for title \\\"Manchester United F.C.\\\" in language \\\"en\\\" to \\\"el\\\": HTTPError: 503: api_error\\n    at MWApiRequest.mwGet (/srv/service/lib/mw/MwApiRequest.js:111:10)\\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\\n    at async SectionSuggester.getSections (/srv/service/lib/suggestion/SectionSuggester.js:98:20)\\n    at async Promise.all (index 0)\\n    at async SectionSuggester.getMissingSections (/srv/service/lib/suggestion/SectionSuggester.js:154:54)\\n    at async RoutesV2.suggestSections (/srv/service/lib/routes/v2.js:344:21)\",\"service\":\"cxserver\"}\n","stream":"stdout","time":"2024-11-11T10:04:22.280184455Z"}

Without having checked for correct escaping it seems like the multi-line output is at least captured as such and then maybe gets split up wrongly in the logging pipeline? I think observability might be able to figure that out more efficiently.

mw1485
{"log":"Using config file: /etc/cxserver/config.yaml\n","stream":"stdout","time":"2024-11-13T14:57:28.479752632Z"}
{"log":"/srv/service/lib/mw/MwApiRequest.js:111\n","stream":"stderr","time":"2024-11-13T15:14:49.841987918Z"}
{"log":"\u0009\u0009\u0009throw new HTTPError( {\n","stream":"stderr","time":"2024-11-13T15:14:49.842048435Z"}
{"log":"\u0009\u0009\u0009      ^\n","stream":"stderr","time":"2024-11-13T15:14:49.842053868Z"}
{"log":"\n","stream":"stderr","time":"2024-11-13T15:14:49.842056813Z"}
{"log":"HTTPError: 503: api_error\n","stream":"stderr","time":"2024-11-13T15:14:49.842059353Z"}
{"log":"    at TitlePairRequest.mwGet (/srv/service/lib/mw/MwApiRequest.js:111:10)\n","stream":"stderr","time":"2024-11-13T15:14:49.842062054Z"}
{"log":"    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {\n","stream":"stderr","time":"2024-11-13T15:14:49.84206637Z"}
{"log":"  status: 503,\n","stream":"stderr","time":"2024-11-13T15:14:49.842069493Z"}
{"log":"  type: 'api_error',\n","stream":"stderr","time":"2024-11-13T15:14:49.842072127Z"}
{"log":"  title: MW API error from URL: http://localhost:6500/w/api.php?<snip>\n","stream":"stderr","time":"2024-11-13T15:14:49.842074953Z"}
{"log":"}\n","stream":"stderr","time":"2024-11-13T15:14:49.842084263Z"}
{"log":"\n","stream":"stderr","time":"2024-11-13T15:14:49.842087139Z"}
{"log":"Node.js v20.5.1\n","stream":"stderr","time":"2024-11-13T15:14:49.842089989Z"}

Some part of cxserver is sending each log line as a separate event.

There are other cxserver logs in Logstash that are ECS-formatted - possibly indicates an incomplete migration?

Maybe the issue is uncaught exceptions?

Change #1092771 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Improve error handling and logging

https://gerrit.wikimedia.org/r/1092771

Change #1092771 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Improve error handling and logging

https://gerrit.wikimedia.org/r/1092771

Change #1098488 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-11-20-121713-production

https://gerrit.wikimedia.org/r/1098488

Change #1098488 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-11-20-121713-production

https://gerrit.wikimedia.org/r/1098488

Something is still wrong.

I don't see any expected log entries from logger. It's only two async fatal errors that appear, and having the same multiline issue.

Nikerabbit changed the point value for this task from 4 to 8.Nov 28 2024, 7:42 AM

Change #1101744 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] logging: 404s for page fetches are not application errors.

https://gerrit.wikimedia.org/r/1101744

@KartikMistry

  • Please take a tail of k8s log and compare it with what we see in logstash to analyse if any structured logs coming out from application is parsed or missed in logstash. if so provide examples here.
  • Please look at application log and see if there is any message that is not in json format(ECS logs) and report here.

Change #1101744 merged by jenkins-bot:

[mediawiki/services/cxserver@master] logging: 404s for page fetches are not application errors

https://gerrit.wikimedia.org/r/1101744

Change #1108544 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-01-07-045930-production

https://gerrit.wikimedia.org/r/1108544

Change #1108544 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-01-07-045930-production

https://gerrit.wikimedia.org/r/1108544

Nikerabbit renamed this task from cxserver: Logstash entries seems difficult to read to Make cxserver Logstash logs readable and reliable.Jan 9 2025, 10:03 AM

Change #1109751 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] Error handling: Skip passing the error to Express

https://gerrit.wikimedia.org/r/1109751

This patch related to this task was tagged with an incorrect bug ID: 1109751: Error handling: Skip passing the error to Express | https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1109751

Change #1111521 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] Error handling: Avoid calling next if response is sent

https://gerrit.wikimedia.org/r/1111521

Change #1111566 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] Remove stray console.log when fetching MT provider

https://gerrit.wikimedia.org/r/1111566

Change #1111566 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Remove stray console.log when fetching MT provider

https://gerrit.wikimedia.org/r/1111566

Change #1111601 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-01-15-103159-production

https://gerrit.wikimedia.org/r/1111601

Change #1111601 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-01-15-103159-production

https://gerrit.wikimedia.org/r/1111601

Mentioned in SAL (#wikimedia-operations) [2025-01-15T12:02:09Z] <kart_> Updated cxserver to 2025-01-15-103159-production (T377966)

This patch related to this task was tagged with an incorrect bug ID: 1109751: Error handling: Skip passing the error to Express | https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1109751

This patch updated the error middleware to avoid passing the errors to Express if the response was already sent to the client. Express was logging the errors again to stderr in a multi-line format which was then picked up by Logstash.

After the deployment, I'm noticing a few more instances of these errors:

    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7)
    at Protocol.handleNetworkError (/srv/service/node_modules/mysql/lib/protocol/Protocol.js:369:14)
    at Handshake.<anonymous> (/srv/service/node_modules/mysql/lib/Connection.js:526:10)
    at Object.onceWrapper (node:events:628:28)
    at Handshake.<anonymous> (file:///srv/service/lib/suggestion/SectionSuggester.js:39:12)
    at Connection._handleConnectTimeout (/srv/service/node_modules/mysql/lib/Connection.js:414:8)
    at Socket.emit (node:events:514:28)
    at SectionSuggester.getMissingSections (file:///srv/service/lib/suggestion/SectionSuggester.js:157:30)
    at Sequence.end (/srv/service/node_modules/mysql/lib/protocol/sequences/Sequence.js:83:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async suggestSections (file:///srv/service/lib/routes/v2.js:344:20)
    at Connection._handleNetworkError (/srv/service/node_modules/mysql/lib/Connection.js:418:18)
    at Handshake._callback (/srv/service/node_modules/mysql/lib/Connection.js:488:16)
    at SectionSuggester.getSectionMapping (file:///srv/service/lib/suggestion/SectionSuggester.js:117:30)
    at SectionSuggester.querySectionMapping (file:///srv/service/lib/suggestion/SectionSuggester.js:37:15)
    at Connection.connect (/srv/service/node_modules/mysql/lib/Connection.js:116:18)
    at Protocol._enqueue (/srv/service/node_modules/mysql/lib/protocol/Protocol.js:144:48)
    at process.processTimers (node:internal/timers:514:7)
    --------------------
Error: Could not connect to section mapping databaseError: connect ETIMEDOUT
    at Object.onceWrapper (node:events:628:28)
    at Socket.emit (node:events:514:28)
    at Connection._handleConnectTimeout (/srv/service/node_modules/mysql/lib/Connection.js:409:13)
    at listOnTimeout (node:internal/timers:573:17)
    at Socket._onTimeout (node:net:589:8)
    at Protocol.handshake (/srv/service/node_modules/mysql/lib/protocol/Protocol.js:51:23)
file:///srv/service/lib/suggestion/SectionSuggester.js:39
					throw new Error( 'Could not connect to section mapping database' + err.stack );
					^

My assumption is that this error is not being handled.

Change #1111675 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] SectionSuggester: Open db connection in the same execution context

https://gerrit.wikimedia.org/r/1111675

Change #1111675 merged by jenkins-bot:

[mediawiki/services/cxserver@master] SectionSuggester: Open db connection in the same execution context

https://gerrit.wikimedia.org/r/1111675

Change #1111948 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-01-16-103443-production

https://gerrit.wikimedia.org/r/1111948

Change #1111948 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-01-16-103443-production

https://gerrit.wikimedia.org/r/1111948

Mentioned in SAL (#wikimedia-operations) [2025-01-16T11:49:28Z] <kart_> Updated cxserver to 2025-01-16-103443-production (T383854, T377966)

We're now monitoring to see if there are more multiline logs that appear in Logstash. These appear to be happening due to improperly handled exceptions, causing the exceptions to be caught by Express and then logged to stderr.

No further QA needs.

Change #1112779 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] Use ecs format for logging

https://gerrit.wikimedia.org/r/1112779

Change #1112779 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use ecs format for logging

https://gerrit.wikimedia.org/r/1112779

Change #1112822 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] Logging: Log request path and http method along with every request

https://gerrit.wikimedia.org/r/1112822

Change #1112823 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] logging: Add a request id field to every log

https://gerrit.wikimedia.org/r/1112823

Change #1112824 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/services/cxserver@master] logging: Log error directly in error handling middleware

https://gerrit.wikimedia.org/r/1112824

Change #1112871 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-01-20-172318-production

https://gerrit.wikimedia.org/r/1112871

Change #1112871 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-01-20-172318-production

https://gerrit.wikimedia.org/r/1112871

Mentioned in SAL (#wikimedia-operations) [2025-01-21T12:15:53Z] <kart_> Updated cxserver to 2025-01-20-172318-production (T377966, T377813)

The new dashboard that shows all the logs from cxserver is here: https://logstash.wikimedia.org/app/dashboards#/view/AXDAG2TFarkxubcmr_Pw?_g=h@c823129&_a=h@9299df7

I've submitted a few more patches to improve the logs further:

  1. 1112822: logging: Log request path and http method along with every request | https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1112822
  2. 1112823: logging: Add a unique request id field to every log | https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1112823
  3. 1112824: logging: Log error directly in error handling middleware | https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1112824

Change #1111521 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Error handling: Avoid calling next if response is sent

https://gerrit.wikimedia.org/r/1111521

Change #1112822 merged by jenkins-bot:

[mediawiki/services/cxserver@master] logging: Log request path and http method along with every request

https://gerrit.wikimedia.org/r/1112822

Change #1116912 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-02-03-095815-production

https://gerrit.wikimedia.org/r/1116912

Change #1116912 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-02-03-095815-production

https://gerrit.wikimedia.org/r/1116912

Thanks for all the work on this.

I have a question. Per this log entry somehow, querying /_info ends up spewing an error about failing to fetch the Frankenstein Test page, due to a 404.

Leaving aside the fact that the displayed Frankenstein test page's title is malformed (it has a url encoded + in the middle), how does this log line end up happening? Per lib/routes/info.js there seems to be nothing in the /_info endpoint that could trigger such a call.