Page MenuHomePhabricator

Configure datahub to produce structured logs by configuring slf4j or similar
Open, HighPublic

Description

datahub-mae-consumer and datahub-mce-consumer appears to produce logs at ~15k events/sec when experiencing kafka issues. The excessive rate appears to be due to the project not using structured logs and instead logging each line of a java stack trace as a separate event.

We should enable datahub-mae-consumer to produce structured logs so that all lines of a stack trace show up on one event and reduce the rate of log production.

Event Timeline

Thanks @colewhite - we tracked this incident earlier in T363843: Datahub mae consumer spamming logstash in wikikube staging and I did a rolling restart of that deployment, which I believe fixed the log spamming issue.
Are you able to confirm whether this ticket relates to another similar incident, or is this more of a follow-up task to convert the datahub pod logs to a structured format?

I see the kafka consumer lag dropping here, which I believe correlates to the time at which I restarted the deployment:

image.png (226×1 px, 49 KB)

... so hopefully the incident is resolved and the high event processing rate will drop back to a nominal level once the backlog has completely cleared. Is this assumption right?

I very much like the idea of generating structured logs from DataHub components, but I'm not sure how feasible it will be. As it is an upstream project and I cannot find any reference to different logging formats within their codebase.
e.g. https://datahubproject.io/docs/how/extract-container-logs/

Are there any other logging options available to us, other than getting the application to log directly into a structured format?

You might look into configuring log4j/slf4j (whichever they're using for the mae component).

Thanks @colewhite. I have asked a question on the DataHub slack channel.

image.png (225×498 px, 33 KB)

I had a quick search of the docs and I'm not aware of anyone having done this, but it seems reasonable.

BTullis renamed this task from datahub-mae-consumer producing logs at excessive rate to Configure datahub to produce structured logs by configuring slf4j or similar.May 1 2024, 10:47 AM
BTullis triaged this task as Medium priority.May 1 2024, 10:56 AM

It looks like the only way to get structured logs out of DataHub is to modify the logback configuration at build time.

image.png (265×651 px, 43 KB)

This is do-able as part of the build we already do here: https://gitlab.wikimedia.org/repos/data-engineering/datahub but it might take a bit of time to get it right.

BTullis subscribed.

Change #1037495 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: drop all logs from datahub-mae-consumer-main

https://gerrit.wikimedia.org/r/1037495

Change #1037496 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop datahub-mae-consumer-main logs

https://gerrit.wikimedia.org/r/1037496

Change #1037495 merged by Cwhite:

[operations/puppet@production] profile: drop all logs from datahub-mae-consumer-main

https://gerrit.wikimedia.org/r/1037495

Change #1037496 merged by Cwhite:

[operations/puppet@production] logstash: drop datahub-mae-consumer-main logs

https://gerrit.wikimedia.org/r/1037496

This is an issue again today. We had to start dropping logs from datahub-mae-consumer-main.

Change #1038787 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: expand datahub drop filters to match all consumers

https://gerrit.wikimedia.org/r/1038787

Change #1038787 merged by Cwhite:

[operations/puppet@production] logstash: expand datahub drop filters to match all consumers

https://gerrit.wikimedia.org/r/1038787

Change #1081998 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop datahub-upgrade-job logs

https://gerrit.wikimedia.org/r/1081998

Change #1081998 merged by Cwhite:

[operations/puppet@production] logstash: drop datahub-upgrade-job logs

https://gerrit.wikimedia.org/r/1081998