Page MenuHomePhabricator

Improve wikifunctions logging
Closed, ResolvedPublic

Description

Currently wikifunctions pods log ~2500 lines per hour in production (in one DC). The absolute majority of those logs are from "levelPath: trace/req" for the /metrics and /_info endpoints, which could probably suppressed. There are also a bunch of "levelPath: debug" logs logging memory and cpu usage, lacking any other identifier (like a request-id). Same goes for "levelPath: info" which seems to print the function call (plus arguments?) and some "levelPath: notice".

But especially the amount of trace/req logs probably made it hard to spot the problem that lead to T344998 right away, see https://logstash.wikimedia.org/goto/aa3497c585530d231916e0d2b479c026 vs. https://logstash.wikimedia.org/goto/d9f2a3f0cab9474b5d48bed41cf7b8c1

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Replace all remaining log writes to trace/req with real valuesrepos/abstract-wiki/wikifunctions/function-evaluator!61jforresterT346264main
Replace all remaining log writes to trace/req with real valuesrepos/abstract-wiki/wikifunctions/function-orchestrator!71jforresterT346264main
Update function-schemata sub-module to HEAD (958de5c)repos/abstract-wiki/wikifunctions/function-orchestrator!64jforrestersync-function-schematamain
Update function-schemata sub-module to HEAD (958de5c)repos/abstract-wiki/wikifunctions/function-evaluator!56jforrestersync-function-schematamain
Update function-schemata sub-module to HEAD (958de5c)repos/abstract-wiki/wikifunctions/wikilambda-cli!15jforrestersync-function-schematamain
Logging: Actually use the logger in the correct way, passing data objectsrepos/abstract-wiki/wikifunctions/function-orchestrator!60jforresterT346264main
LoggerWrapper: Pass data object to logger service toorepos/abstract-wiki/wikifunctions/function-schemata!39jforresterT346264main
Customize query in GitLab

Event Timeline

JMeybohm renamed this task from Improcve wikifunctions logging to Improve wikifunctions logging.
Jdforrester-WMF changed the task status from Open to In Progress.Sep 26 2023, 2:09 PM
Jdforrester-WMF claimed this task.
Jdforrester-WMF triaged this task as High priority.
Jdforrester-WMF moved this task from To Triage to In Progress on the Abstract Wikipedia team board.

Change 961848 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (958de5c)

https://gerrit.wikimedia.org/r/961848

Change 961848 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (958de5c)

https://gerrit.wikimedia.org/r/961848

Change 962650 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Use function-orchestrator image with better logging

https://gerrit.wikimedia.org/r/962650

Change 962650 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Use function-orchestrator image with better logging

https://gerrit.wikimedia.org/r/962650

When I check https://logstash.wikimedia.org/goto/dbb318a11eaeb4b54503b9878c805f15 for the last 1 hour and remove NOT levelPath:trace/req, the number of events is unchanged, so the levelPath:trace/req is suppressed.
But levelPath:trace is still present on Apps Logs (Kubernetes) (about 3,600 in 1 hour). We don't need to suppress them too?

Also, the dashboard wikifunctions has events with level 10 (`levelPath
trace`), ~8,404; and level INFO accounts for 4,551 events. Overall (without filtering), wikifunctions dashboard presents 13,110 events in 1 hour. Should the dashboard filters be adjusted further?

Additional improvements for logging will be addressed in other tasks if necessary. trace/req is not present anymore. Closing this task as Resolved.