Maniphest T292099

Logs from Toolhub's Django container not reaching logstash from Kubernetes
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	bd808
	Sep 29 2021, 5:33 PM

Description

Toolhub is configured to emit log records to stderr as json blobs in the Elastic Common Schema format (T276374). These log events can be seen inside the k8s cluster with a command like kubectl logs -f --since=10m --all-containers=true --max-log-requests=15 -lapp=toolhub. The events look something like:

{"@timestamp":"2021-09-29T15:00:58.365Z","log.level":"error","message":"Timeout connecting to https://toolhub.toolforge.org/toolinfo.json","ecs":{"version":"1.6.0"},"error":{"message":"HTTPSConnectionPool(host='toolhub.toolforge.org', port=443): Max retries exceeded with url: /toolinfo.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f2075fca7f0>, 'Connection to toolhub.toolforge.org timed out. (connect timeout=5)'))","stack_trace":"  File \"/srv/app/toolhub/apps/crawler/tasks.py\", line 178, in fetch_content\n    timeout=(5, 13),\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py\", line 75, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py\", line 61, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py\", line 542, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py\", line 655, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/adapters.py\", line 504, in send\n    raise ConnectTimeout(e, request=request)\n","type":"ConnectTimeout"},"log":{"logger":"toolhub.apps.crawler.tasks","origin":{"file":{"line":199,"name":"tasks.py"},"function":"fetch_content"}},"process":{"name":"MainProcess","pid":1,"thread":{"id":139777462257472,"name":"MainThread"}},"trace":{"id":"none"}}

In the logstash cluster, there are log events related to Toolhub (https://logstash.wikimedia.org/goto/c1d0a6ef181ed59da2bdb4f0fe1962eb), but these events are all from Kubernetes itself and not from the Django pod.

Details

Subject	Repo	Branch	Lines +/-
logstash: move kubernetes_docker parsing to priority 15	operations/puppet	production	+20 -57
logstash: dot_expander: better handling of field collisions	operations/puppet	production	+14 -5
logstash: test moving the k8s parsing to earlier in the pipeline	operations/puppet	production	+50 -4
logstash: set log field aside prior to parsing k8s logs	operations/puppet	production	+66 -3

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		colewhite	T292585 Evaluate long-term viability of feature gating on ECS major version
Resolved	BUG REPORT	colewhite	T292099 Logs from Toolhub's Django container not reaching logstash from Kubernetes
Resolved		bd808	T292574 Wikimedia ELK requires ECS v1.7.0, python-ecs reports v1.6.0 in its output
Open		None	T292881 Mutate mmkubernetes k8s fields into ECS fields
Resolved		colewhite	T294581 Upgrade ECS to 1.11.0

Event Timeline

bd808 created this task.Sep 29 2021, 5:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 29 2021, 5:33 PM

Based on patches attached to T207200: Revisit the logging work done on Q1 2017-2018 for the standard pod setup it looks like the @cee: cookie should be automatically added to the events inside rsyslog.

bd808 moved this task from Backlog to Research needed on the Toolhub board.Sep 29 2021, 5:43 PM

fgiunchedi added a project: observability.Sep 30 2021, 1:30 PM

bd808 triaged this task as High priority.Oct 4 2021, 4:18 PM

The situation for logging in k8s is complicated.

I'll have a look.

Change 726129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move kubernetes_docker parsing towards the front of the pipeline

https://gerrit.wikimedia.org/r/726129

gerritbot added a project: Patch-For-Review.Oct 4 2021, 11:34 PM

The logs aren't showing up at all because the app is producing ecs.version == 1.6.0. The pipeline only recognizes ecs.version == 1.7.0.

All unsupported versions of ECS are currently dropped in the logging pipeline. We may want to re-evaluate this behavior and perhaps leverage the DLQ.

Note: The log.level field is not a part of the log object in the description. K8s log parsing happens much later than the early-stage ECS filters which does dot expansion and log level normalization. If we want to take advantage of those, we'll want to move the filter to earlier in the pipeline.

In T292099#7400751, @colewhite wrote:

The logs aren't showing up at all because the app is producing ecs.version == 1.6.0. The pipeline only recognizes ecs.version == 1.7.0.

The app is using v1.0.1 of the https://github.com/elastic/ecs-logging-python library. That library is setting the 1.6.0 version number.

Note: The log.level field is not a part of the log object in the description. K8s log parsing happens much later than the early-stage ECS filters which does dot expansion and log level normalization. If we want to take advantage of those, we'll want to move the filter to earlier in the pipeline.

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

In T292099#7400757, @bd808 wrote:

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:

The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.

The exception here is that k8s log content isn't available at the beginning of the pipeline, bypassing the dot expansion filter for ECS. K8s logs only become available at priority 30 which, I'd argue, is a bug. CC: @akosiaris @fgiunchedi

K8s logs should probably be moved to priority 15, but the ramifications of changing priority are unknown and ~~would likely~~ most definitely affect other k8s producers. This is due to k8s logs being type: syslog and priority 30 is after the main syslog filters (20).

Agreed it seems to me wrt dot expansion we'd like all producers to be treated the same, thanks for the investigation @colewhite ! I wonder if we could reduce uncertainty via logstash tests in puppet?

In T292099#7400871, @colewhite wrote:
In T292099#7400757, @bd808 wrote:

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:
The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.
The exception here is that k8s log content isn't available at the beginning of the pipeline, bypassing the dot expansion filter for ECS. K8s logs only become available at priority 30 which, I'd argue, is a bug. CC: @akosiaris @fgiunchedi

Nice finding! And thanks for that change.

K8s logs should probably be moved to priority 15, but the ramifications of changing priority are unknown and ~~would likely~~ most definitely affect other k8s producers. This is due to k8s logs being type: syslog and priority 30 is after the main syslog filters (20).

That fact, type: syslog for all k8s logs is something that has been bugging me every now and then for some time now. It's useless to filter on in kibana so nobody uses it and thus it adds 0 value user wise. The same stands for the facility (local0). I 'd be ok with changing parts of that and we even have a testbed for that (staging clusters). In my understanding, that would allow to fix the mess with priority 30 and allow to move to 15 ?

In T292099#7400871, @colewhite wrote:
In T292099#7400757, @bd808 wrote:

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:
The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.

Why is the log.level field not nested like other fields?

The reason that log.level is dotted like that is to make ECS more human readable. Basically the fields @timestamp, log.level, and message are always printed before the rest of the fields to imitate an unstructured log.
[...]
You can also see this encoded in the ECS spec, see the index field and the log.level.comment field

In T292099#7400757, @bd808 wrote:

Why is the log.level field not nested like other fields?

The reason that log.level is dotted like that is to make ECS more human readable. Basically the fields @timestamp, log.level, and message are always printed before the rest of the fields to imitate an unstructured log.
[...]
You can also see this encoded in the ECS spec, see the index field and the log.level.comment field

Good find! The reasoning doesn't make sense to me because they're prioritizing console human-readability over consistency. IMO, ECS exists to solve inconsistencies that make the ELK stack difficult to use.

The pipeline will expand log.level and place it appropriately if the log is parsed at the appropriate point in the pipeline. This is not the core of this task though so I think we should refocus on the real issues preventing ingestion, namely:

ecs.version: 1.6.0 Should be ecs.version: 1.7.0
k8s log event parsing should happen earlier in the pipeline.

Change 726671 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: test moving the k8s parsing to earlier in the pipeline

https://gerrit.wikimedia.org/r/726671

Change 726129 merged by Cwhite:

[operations/puppet@production] logstash: set log field aside prior to parsing k8s logs

https://gerrit.wikimedia.org/r/726129

bd808 changed the status of subtask T292574: Wikimedia ELK requires ECS v1.7.0, python-ecs reports v1.6.0 in its output from Open to In Progress.Oct 5 2021, 8:07 PM

colewhite added a parent task: T292585: Evaluate long-term viability of feature gating on ECS major version.Oct 5 2021, 8:58 PM

In T292099#7403127, @colewhite wrote:

ecs.version: 1.6.0 Should be ecs.version: 1.7.0

Fixed by T292574: Wikimedia ELK requires ECS v1.7.0, python-ecs reports v1.6.0 in its output.

Change 726671 merged by Cwhite:

[operations/puppet@production] logstash: test moving the k8s parsing to earlier in the pipeline

https://gerrit.wikimedia.org/r/726671

Maintenance_bot removed a project: Patch-For-Review.Oct 6 2021, 11:10 PM

herron edited projects, added SRE Observability; removed observability.Oct 7 2021, 5:21 PM

herron changed the task status from Open to In Progress.Oct 7 2021, 5:28 PM

herron moved this task from Inbox to FY2021/2022-Q2 on the SRE Observability board.

herron edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability.

herron moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.

colewhite added a subtask: T292881: Mutate mmkubernetes k8s fields into ECS fields.Oct 8 2021, 11:23 PM

Change 728682 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: dot_expander: better handling of field collisions

https://gerrit.wikimedia.org/r/728682

gerritbot added a project: Patch-For-Review.Oct 8 2021, 11:24 PM

Change 728683 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move kubernetes_docker parsing to priority 15

https://gerrit.wikimedia.org/r/728683

Change 728682 merged by Cwhite:

[operations/puppet@production] logstash: dot_expander: better handling of field collisions

https://gerrit.wikimedia.org/r/728682

Change 728683 merged by Cwhite:

[operations/puppet@production] logstash: move kubernetes_docker parsing to priority 15

https://gerrit.wikimedia.org/r/728683

Maintenance_bot removed a project: Patch-For-Review.Oct 13 2021, 11:10 PM

colewhite closed this task as Resolved.Oct 13 2021, 11:11 PM

(note to future self) These logs are stored in the ecs-* indexes so if you are looking for logs in the logstash discovery screen, don't forget to change to that index collection.

lmata moved this task from In progress to Done on the SRE Observability (FY2021/2022-Q2) board.Oct 29 2021, 4:24 PM

Logs from Toolhub's Django container not reaching logstash from KubernetesClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

Logs from Toolhub's Django container not reaching logstash from Kubernetes
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...