Page MenuHomePhabricator

Logs from Toolhub's Django container not reaching logstash from Kubernetes
Closed, ResolvedPublicBUG REPORT

Description

Toolhub is configured to emit log records to stderr as json blobs in the Elastic Common Schema format (T276374). These log events can be seen inside the k8s cluster with a command like kubectl logs -f --since=10m --all-containers=true --max-log-requests=15 -lapp=toolhub. The events look something like:

{"@timestamp":"2021-09-29T15:00:58.365Z","log.level":"error","message":"Timeout connecting to https://toolhub.toolforge.org/toolinfo.json","ecs":{"version":"1.6.0"},"error":{"message":"HTTPSConnectionPool(host='toolhub.toolforge.org', port=443): Max retries exceeded with url: /toolinfo.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f2075fca7f0>, 'Connection to toolhub.toolforge.org timed out. (connect timeout=5)'))","stack_trace":"  File \"/srv/app/toolhub/apps/crawler/tasks.py\", line 178, in fetch_content\n    timeout=(5, 13),\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py\", line 75, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/api.py\", line 61, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py\", line 542, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/sessions.py\", line 655, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/requests/adapters.py\", line 504, in send\n    raise ConnectTimeout(e, request=request)\n","type":"ConnectTimeout"},"log":{"logger":"toolhub.apps.crawler.tasks","origin":{"file":{"line":199,"name":"tasks.py"},"function":"fetch_content"}},"process":{"name":"MainProcess","pid":1,"thread":{"id":139777462257472,"name":"MainThread"}},"trace":{"id":"none"}}

In the logstash cluster, there are log events related to Toolhub (https://logstash.wikimedia.org/goto/c1d0a6ef181ed59da2bdb4f0fe1962eb), but these events are all from Kubernetes itself and not from the Django pod.

Event Timeline

Based on patches attached to T207200: Revisit the logging work done on Q1 2017-2018 for the standard pod setup it looks like the @cee: cookie should be automatically added to the events inside rsyslog.

bd808 triaged this task as High priority.Oct 4 2021, 4:18 PM
colewhite subscribed.

The situation for logging in k8s is complicated.

I'll have a look.

Change 726129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move kubernetes_docker parsing towards the front of the pipeline

https://gerrit.wikimedia.org/r/726129

The logs aren't showing up at all because the app is producing ecs.version == 1.6.0. The pipeline only recognizes ecs.version == 1.7.0.

All unsupported versions of ECS are currently dropped in the logging pipeline. We may want to re-evaluate this behavior and perhaps leverage the DLQ.

Note: The log.level field is not a part of the log object in the description. K8s log parsing happens much later than the early-stage ECS filters which does dot expansion and log level normalization. If we want to take advantage of those, we'll want to move the filter to earlier in the pipeline.

The logs aren't showing up at all because the app is producing ecs.version == 1.6.0. The pipeline only recognizes ecs.version == 1.7.0.

The app is using v1.0.1 of the https://github.com/elastic/ecs-logging-python library. That library is setting the 1.6.0 version number.

Note: The log.level field is not a part of the log object in the description. K8s log parsing happens much later than the early-stage ECS filters which does dot expansion and log level normalization. If we want to take advantage of those, we'll want to move the filter to earlier in the pipeline.

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:

The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.

The exception here is that k8s log content isn't available at the beginning of the pipeline, bypassing the dot expansion filter for ECS. K8s logs only become available at priority 30 which, I'd argue, is a bug. CC: @akosiaris @fgiunchedi

K8s logs should probably be moved to priority 15, but the ramifications of changing priority are unknown and would likely most definitely affect other k8s producers. This is due to k8s logs being type: syslog and priority 30 is after the main syslog filters (20).

Agreed it seems to me wrt dot expansion we'd like all producers to be treated the same, thanks for the investigation @colewhite ! I wonder if we could reduce uncertainty via logstash tests in puppet?

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:

The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.

The exception here is that k8s log content isn't available at the beginning of the pipeline, bypassing the dot expansion filter for ECS. K8s logs only become available at priority 30 which, I'd argue, is a bug. CC: @akosiaris @fgiunchedi

Nice finding! And thanks for that change.

K8s logs should probably be moved to priority 15, but the ramifications of changing priority are unknown and would likely most definitely affect other k8s producers. This is due to k8s logs being type: syslog and priority 30 is after the main syslog filters (20).

That fact, type: syslog for all k8s logs is something that has been bugging me every now and then for some time now. It's useless to filter on in kibana so nobody uses it and thus it adds 0 value user wise. The same stands for the facility (local0). I 'd be ok with changing parts of that and we even have a testbed for that (staging clusters). In my understanding, that would allow to fix the mess with priority 30 and allow to move to 15 ?

Placing the log.level field in the object done this way by the official python ECS logging library, so it would be nice if our log ingestion stack could handle it.

This sounds like a bug in the library. The ECS docs define the structure:

The document structure should be nested JSON objects. If you use Beats or Logstash, the nesting of JSON objects is done for you automatically. If you’re ingesting to Elasticsearch using the API, your fields must be nested objects, not strings containing dots.

Why is the log.level field not nested like other fields?

The reason that log.level is dotted like that is to make ECS more human readable. Basically the fields @timestamp, log.level, and message are always printed before the rest of the fields to imitate an unstructured log.
[...]
You can also see this encoded in the ECS spec, see the index field and the log.level.comment field

Why is the log.level field not nested like other fields?

The reason that log.level is dotted like that is to make ECS more human readable. Basically the fields @timestamp, log.level, and message are always printed before the rest of the fields to imitate an unstructured log.
[...]
You can also see this encoded in the ECS spec, see the index field and the log.level.comment field

Good find! The reasoning doesn't make sense to me because they're prioritizing console human-readability over consistency. IMO, ECS exists to solve inconsistencies that make the ELK stack difficult to use.

The pipeline will expand log.level and place it appropriately if the log is parsed at the appropriate point in the pipeline. This is not the core of this task though so I think we should refocus on the real issues preventing ingestion, namely:

  1. ecs.version: 1.6.0 Should be ecs.version: 1.7.0
  2. k8s log event parsing should happen earlier in the pipeline.

Change 726671 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: test moving the k8s parsing to earlier in the pipeline

https://gerrit.wikimedia.org/r/726671

Change 726129 merged by Cwhite:

[operations/puppet@production] logstash: set log field aside prior to parsing k8s logs

https://gerrit.wikimedia.org/r/726129

Change 726671 merged by Cwhite:

[operations/puppet@production] logstash: test moving the k8s parsing to earlier in the pipeline

https://gerrit.wikimedia.org/r/726671

herron changed the task status from Open to In Progress.Oct 7 2021, 5:28 PM
herron moved this task from Inbox to FY2021/2022-Q2 on the SRE Observability board.
herron moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q2) board.

Change 728682 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: dot_expander: better handling of field collisions

https://gerrit.wikimedia.org/r/728682

Change 728683 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move kubernetes_docker parsing to priority 15

https://gerrit.wikimedia.org/r/728683

Change 728682 merged by Cwhite:

[operations/puppet@production] logstash: dot_expander: better handling of field collisions

https://gerrit.wikimedia.org/r/728682

Change 728683 merged by Cwhite:

[operations/puppet@production] logstash: move kubernetes_docker parsing to priority 15

https://gerrit.wikimedia.org/r/728683

(note to future self) These logs are stored in the ecs-* indexes so if you are looking for logs in the logstash discovery screen, don't forget to change to that index collection.