Page MenuHomePhabricator

Find out why function_orchestrator_nonrequesterror and function_orchestrator_outgoingresponsecount aren't being emitted, and fix them
Closed, ResolvedPublic

Description

In the function-orchestrator, we call makeMetric() in four places; two of them are showing up in the data stream we can query in Grafana (function_orchestrator_incomingrequestcount and function_orchestrator_router_request_duration_seconds), but two aren't (function_orchestrator_nonrequesterror and function_orchestrator_outgoingresponsecount).

  • Determine why they aren't being called / they're not emitting
  • Fix them to be called / to emit, or adjust other code to do so instead
  • Have a new production release of the orchestrator image made and demonstrate that they're being
  • Wire the data into the appropriate place in the dashboard, if needed.

Details

TitleReferenceAuthorSource BranchDest Branch
Prod monitoring: fix so we can emit all four metricsrepos/abstract-wiki/wikifunctions/function-orchestrator!181ecarggrace/T364410/debugging-prod-resp-count-metricsmain
Customize query in GitLab

Event Timeline

Findings so far:

  • In /function-orchestrator/app.js, the system doesn't like req, res, or next, the params in app.all(). Anytime I try to log them, the logs get held up.
  • Therefore or also, we never get inside app.all( '*', ( req, res, next ) => { ... which calls sUtil.initAndLogRequest( req, app ); that calls the metrics that are currently not being emitted

In /function-orchestrator/app.js, the system doesn't like req, res, or next, the params in app.all(). Anytime I try to log them, the logs get held up.

Do you have a branch where we can see this? I'll be happy to investigate if you can show me how to repro locally.

Apologies, I did not notice this comment last wk T_T!

I'll just push up all of the console.log's I have atm here: https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/181

Jdforrester-WMF changed the task status from Open to In Progress.Tue, Jun 4, 5:12 PM
Jdforrester-WMF assigned this task to ecarg.

Update: successful 🍐 w/ James today. Moving the two metrics out of the unused function,req.issueRequest = async ( request ) => {... solves this bug. If I observe which metrics were triggered locally, I can see that all four have been:

...
function_orchestrator_router_request_duration_seconds_count{service="function-orchestrator",path="--domain/v1/evaluate",method="POST",status="200"} 1

# HELP function_orchestrator_incomingrequestcount Wikifunctions function-orchestrator request count
# TYPE function_orchestrator_incomingrequestcount counter
function_orchestrator_incomingrequestcount 1

# HELP function_orchestrator_nonrequesterror Wikifunctions function-orchestrator non-request error count
# TYPE function_orchestrator_nonrequesterror counter
function_orchestrator_nonrequesterror 1

# HELP function_orchestrator_outgoingresponsecount Wikifunctions function-orchestrator response count
# TYPE function_orchestrator_outgoingresponsecount counter
function_orchestrator_outgoingresponsecount 1

Merged and available in image 2024-06-06-134535 and later. General plan is to deploy on Wednesdays in our regular slot – think it's OK to wait here?

Change #1042269 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2024-06-05-003919 to 2024-06-11-223956

https://gerrit.wikimedia.org/r/1042269

Change #1042269 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2024-06-05-003919 to 2024-06-11-223956

https://gerrit.wikimedia.org/r/1042269