Page MenuHomePhabricator

Function Evaluator log data loss due to ECS nonconforming fields
Closed, ResolvedPublic

Description

Uncovered while working on T382105: Increase in Logstash warning logs

function-evaluator-javascript-evaluator
{
  "@timestamp": "2025-01-10T21:42:16.231Z",
  "ecs.version": "8.10.0",
  "log.level": "info",
  "message": "Incoming evaluator request",
  "request": {                                          <-- Must be nested inside a "http" hash or dot-delimited e.g. "http.request"  https://doc.wikimedia.org/ecs/#ecs-http
    "headers": {
      "user-agent": "kube-probe/1.23",
      "x-request-id": "258db3fa-d55c-4d72-a76d-8f93f944f9e2"
    },
    "method": "GET",
    "params": {
      "0": "/_info"
    },
    "query": {},
    "remoteAddress": "10.64.0.105",
    "remotePort": 44150,
    "url": "/_info"
  },
  "requestId": "258db3fa-d55c-4d72-a76d-8f93f944f9e2",   <-- Only need one of these and should be part of http hash e.g. "http.request.id" https://doc.wikimedia.org/ecs/#field-http-request-id
  "request_id": "258db3fa-d55c-4d72-a76d-8f93f944f9e2",
  "service": "function-evaluator"                        <-- Service must be a hash, e.g. "service.name"  https://doc.wikimedia.org/ecs/#field-service-name
}
function-evaluator-python-evaluator
{
  "@timestamp": "2025-01-10T21:31:29.435Z",
  "ecs.version": "8.10.0",
  "log.level": "info",
  "message": "Incoming evaluator request",
  "request": {
    "headers": {                                          <-- Must be nested inside a "http" hash or dot-delimited e.g. "http.request" https://doc.wikimedia.org/ecs/#ecs-http
      "user-agent": "kube-probe/1.23",
      "x-request-id": "b3059da9-be9a-47c5-be35-eda150f0227f"
    },
    "method": "GET",
    "params": {
      "0": "/_info"
    },
    "query": {},
    "remoteAddress": "10.64.32.110",
    "remotePort": 55134,
    "url": "/_info"
  },
  "requestId": "b3059da9-be9a-47c5-be35-eda150f0227f",   <-- Only need one of these and should be part of http hash e.g. "http.request.id" https://doc.wikimedia.org/ecs/#field-http-request-id
  "request_id": "b3059da9-be9a-47c5-be35-eda150f0227f",
  "service": "function-evaluator"                        <-- Service must be a hash, e.g. "service.name"  https://doc.wikimedia.org/ecs/#field-service-name
}

Event Timeline

Jdforrester-WMF subscribed.

This might be a bug in service-utils, which is I think what's controlling the log structure. To investigate.

Thank you, looking into this

ecarg changed the task status from Open to In Progress.Jan 17 2025, 3:28 AM

notes:

  • one of the last occurrences of a 'normal' log and then goes silent once the clock turns 11/30
  • yesterday, above was true; but revisiting the same date range today (1/17), the shift is now from 11/19-11/20, just as in the orchestrator (why is this?!?!) log event
  • interesting that this can also be seen in the orchestrator but a couple days prior from 11/19-20 ticket for that issue
  • when the evaluator events were coming through, we see different 'host's firing the same log: mw1421, parse2020, kubestage1004, wikikube-worker2032 (attachment in next comment block)
  • whereas, when that stops/thing aren't 'working' we only see the generic stuff from kubestage2002
This comment was removed by ecarg.

Hi @colewhite would you be able to provide a link or snapshot from LogStash of these particular events or similar? I'm having difficulty trying to find these specific events even when filtering with the given request-id and timestamps; the thing is, when I'm looking at kubernetes.namespace_name: wikifunctions I can't find that request-id at all, it's only when I have all filters off do I find a bunch of events (with that request-id) but have yet to find these occurrences. I'm trying to rule out some things but it would be super helpful bc log structure varies depending on where the event is coming from and its host! Thanks a lot!

Hi @colewhite would you be able to provide a link or snapshot from LogStash of these particular events or similar? I'm having difficulty trying to find these specific events even when filtering with the given request-id and timestamps; the thing is, when I'm looking at kubernetes.namespace_name: wikifunctions I can't find that request-id at all, it's only when I have all filters off do I find a bunch of events (with that request-id) but have yet to find these occurrences. I'm trying to rule out some things but it would be super helpful bc log structure varies depending on where the event is coming from and its host! Thanks a lot!

These particular events are going to be near impossible to find because the useful attributes, (request id, etc) are dropped due to the problem described by this bug report.

Here is a page showing recent events in the affected log stream: https://logstash.wikimedia.org/goto/918c58bc408d126bbf640ef5af989aa9

Change #1113572 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-01-08-143723 to 2025-01-22-212306

https://gerrit.wikimedia.org/r/1113572

Change #1113572 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-01-08-143723 to 2025-01-22-212306

https://gerrit.wikimedia.org/r/1113572

Did this deploy solve it?

It looks like it solved most of it!

There are some misplaced http.request fields making it through:

  • http.request.params.0 looks like url.path
  • http.request.remoteAddress looks like source.ip
  • http.request.remotePort looks like source.port
  • http.request.url looks like url.full

Change #1115028 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249

https://gerrit.wikimedia.org/r/1115028

Change #1115029 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344

https://gerrit.wikimedia.org/r/1115029

Change #1115032 had a related patch set uploaded (by Cory Massaro; author: Cory Massaro):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249

https://gerrit.wikimedia.org/r/1115032

Change #1115032 abandoned by Cory Massaro:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from version: 2025-01-22-203140 to 2025-01-28-144249

Reason:

already done

https://gerrit.wikimedia.org/r/1115032

Change #1115028 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-01-22-203140 to 2025-01-28-144249

https://gerrit.wikimedia.org/r/1115028

Change #1115029 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-01-22-212306 to 2025-01-29-140344

https://gerrit.wikimedia.org/r/1115029

I can see our logs coming through now, is this issue resolved from SRE end?

cc: @colewhite

Thank you! And @tchin for the utils updates!

Change #1121380 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756

https://gerrit.wikimedia.org/r/1121380

Change #1121381 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923

https://gerrit.wikimedia.org/r/1121381

Change #1121380 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756

https://gerrit.wikimedia.org/r/1121380

Change #1121381 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923

https://gerrit.wikimedia.org/r/1121381