Page MenuHomePhabricator

Epic: Improve logging for ipoid app
Closed, DeclinedPublic

Description

Logs can be viewed via the ipoid dashboard on logstash.

These need a few improvements:

  • All logs are currently at NOTICE level, including errors (example)
  • We're logging too much - e.g. a log line for each batch that is imported (several thousand per day)
  • The logs are unstructured, so each output line of a multi-line error has its own log

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
update-db: Adjust error loggingrepos/mediawiki/services/ipoid!227kharlanmain-Ieb75bf687679ac57a26e97c38ed03cd483bc26b3main
logging: Switch pipeline logging to winstonrepos/mediawiki/services/ipoid!224kharlanmain-Ifcc5f1d87852bb7dc6084f27f8f3cca9823daf8dmain
logging: Use winston logger with ECS formattingrepos/mediawiki/services/ipoid!213kharlanmain-I2f4167934135b466a03d09a713f4ae763efa1f76main
Log at info level instead of debug levelrepos/mediawiki/services/ipoid!211stranmain-I64b09ff1b146227e95a956048a2d15f4e58bf391main
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One option would be to remove our dependency on bash scripts; in that case we'd do something like T348479: Consider a CLI framework for ipoid and rewrite the bash scripts as nodejs scripts, which shell out as needed. Then we'd make use of our existing logging library (bunyan) and replace existing usages of console.log with the bunyan logger.

If we want to do something that involves less rewriting, https://sevic.dev/notes/bash-script-json-logging/ has an example of a JSON logger function we could include in the existing shell scripts.

Maybe it makes sense to do both: start with the JSON logger function for bash scripts, but eventually migrate the scripts to nodejs so that they're testable and easier to update and maintain.

We're logging too much - e.g. a log line for each batch that is imported (several thousand per day)

I've opted not to do anything about this. A few thousand (structured) entries at an info/debug level don't seem to be a problem, IMHO. Perhaps we could revisit that if we find that it is still too much.

Change 989503 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/989503

Change 989503 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/989503

Change 989726 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/989726

Change 989726 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/989726

Change 990056 had a related patch set uploaded (by Tchanders; author: Tchanders):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/990056

Change 990056 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/990056

Looks like the level wasn't the problem: logstash dashboard

I think there's one more thing to do in this task, which is to convert the logging format used for requests to make use of https://github.com/pinojs/pino and the elastic common scheme format.

Looks like the level wasn't the problem: logstash dashboard

It is working after all, we just need to use the ecs-* data interface, e.g. https://logstash.wikimedia.org/goto/cc0625d83eeea78edcb8ea54602d6eab. Working on converting the existing dashboard to reference that.

Change 995197 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/995197

Change 995197 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/995197

stran merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/213

logging: Use winston logger with ECS formatting

@jijiki @akosiaris I've deployed that one to staging. Here's an example log entry. Before we move that patch forward to eqiad/codfw, could you please review the metadata we're logging with the request and let us know if there's anything we should be filtering out before we emit a log event?

stran merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/213

logging: Use winston logger with ECS formatting

@jijiki @akosiaris I've deployed that one to staging. Here's an example log entry. Before we move that patch forward to eqiad/codfw, could you please review the metadata we're logging with the request and let us know if there's anything we should be filtering out before we emit a log event?

It looks pretty good to me. Note that the fields with the warning have no mappings in elasticsearch and as such they won't be searchable (e.g. you won't be able to search with a filter of http.response.headers.content-length)

Adding @colewhite so they can override me and object in case I am wrong, but I 'd say feel free to proceed.

P.S. I assume that http.response.body.bytes equaling 404 is coincidental, right?

stran merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/213

logging: Use winston logger with ECS formatting

@jijiki @akosiaris I've deployed that one to staging. Here's an example log entry. Before we move that patch forward to eqiad/codfw, could you please review the metadata we're logging with the request and let us know if there's anything we should be filtering out before we emit a log event?

It looks pretty good to me. Note that the fields with the warning have no mappings in elasticsearch and as such they won't be searchable (e.g. you won't be able to search with a filter of http.response.headers.content-length)

Sure, that's fine.

Adding @colewhite so they can override me and object in case I am wrong, but I 'd say feel free to proceed.

Ack

P.S. I assume that http.response.body.bytes equaling 404 is coincidental, right?

Yeah I was mildly concerned about that too -- it is a coincidence, see e.g. this request with a different value.

Yeah I was mildly concerned about that too -- it is a coincidence, see e.g. this request with a different value.

😆 😆. Cool, thanks for confirming.

It looks pretty good to me. Note that the fields with the warning have no mappings in elasticsearch and as such they won't be searchable (e.g. you won't be able to search with a filter of http.response.headers.content-length)

Our schema diverges from upstream slightly to support our needs. "http.response.headers" is one such field that has no upstream equivalent.

The errors you saw are from OpenSearch Dashboards having never encountered this specific "content-length" field.

I refreshed the field mapping - the warnings are gone and the fields are now searchable.

Change 1002930 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002930

Change 1002930 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002930

Change 1002981 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002981

Change 1002981 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002981

@STran @kostajh Should I move this task into Done, or into QA on our current sprint?

@STran @kostajh Should I move this task into Done, or into QA on our current sprint?

There are still the two subtasks.

I'd like to also clean up the log messages a bit more, but I can file a new subtask for that.

@STran @kostajh Should I move this task into Done, or into QA on our current sprint?

There are still the two subtasks.

I'd like to also clean up the log messages a bit more, but I can file a new subtask for that.

Filed as T358617: Simplify log messages and add more useful information to messages

Moving to the backlog, as we've got too many other things going on right now.

Moving to the backlog, as we've got too many other things going on right now.

I propose we decline this task and its subtasks in favor of iPoid-Service (IPoid OpenSearch)

STran renamed this task from Improve logging for ipoid app to Epic: Improve logging for ipoid app.Nov 26 2024, 12:28 PM

As we're no longer certain of the timeline on migrating to OpenSearch, let's improve our logging for our current situation. I'm editing this task to make it an epic since we've got a bunch of small improvement tickets under this topic.