Page MenuHomePhabricator

Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana
Closed, ResolvedPublic

Description

We had this in the past already and it kind of healed itself IIRC but we now have the situation again that staging-eqiad do not show up in Elasticsearch/Kibana at all.

Last log lines in elasticsearch:

Aug 20, 2021 @ 10:43:39.207 kubestage1002.eqiad.wmnet
Aug 20, 2021 @ 10:43:39.193 kubestage1001.eqiad.wmnet
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: mmkubernetes: failed to connect to [https://kubestagemaster.svc.eqiad.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-6674d59bd8-dhl56] - 7:Couldn't connect to server  [v8.1901.0]
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: action 'action-0-mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

It turns out the mmkubernetes plugin is terminated rather than suspended in case of connection errors to k8s API. So a retry would never happen: https://github.com/rsyslog/rsyslog/issues/4669

Event Timeline

JMeybohm created this task.

AIUI the logs do reach Elasticsearch, but they lack the Kubernetes API metadata and therefore do not show up in the usual queries/dashboards.
https://logstash.wikimedia.org/goto/c64588ec490c741fbc8f89d05549a7df

And that since...

/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: mmkubernetes: failed to connect to [https://kubestagemaster.svc.eqiad.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-6674d59bd8-dhl56] - 7:Couldn't connect to server  [v8.1901.0]
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: action 'action-0-mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

Yeah...great:

Error Handling

If the plugin encounters a 404 Not Found [...]

If the plugin gets a 429 Busy response [...]

For other errors, the plugin will assume they are “hard” errors requiring admin intervention, return an error code, and rsyslog will suspend the plugin. Use the Statistic Counter to monitor for problems getting data from the Kubernetes API service.

Change 715024 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Add explicit name to mmkubernetes action

https://gerrit.wikimedia.org/r/715024

Change 715024 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Add explicit name to mmkubernetes action

https://gerrit.wikimedia.org/r/715024

Change 715054 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Fix name in rsyslog action

https://gerrit.wikimedia.org/r/715054

Change 715054 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Fix name in rsyslog action

https://gerrit.wikimedia.org/r/715054

I tried to convince rsyslog to retry/resume the action with:

action(type="mmkubernetes"
       name="mmkubernetes"
       action.resumeRetryCount="-1"
       action.resumeInterval="1"
       action.resumeIntervalMax="300"
       action.reportSuspensionContinuation="on")

So far I have seen zero attempts to re-establish a connection, though. Also I don't see the mmkubernetes module being listed as suspended in the prometheus metrics, so maybe I'm totally running in the wrong direction here and RS_RET_ERR actually means "there is no way out": https://github.com/rsyslog/rsyslog/blob/25e6b014b8bc9ec5fe0621aa2146cfd304ba4f67/contrib/mmkubernetes/mmkubernetes.c#L1439-L1443

...which actually seems to be the case:
https://github.com/rsyslog/rsyslog/blob/25e6b014b8bc9ec5fe0621aa2146cfd304ba4f67/action.c#L1105-L1114

I've patched the mmkubernetes module to actually be suspended (and thus can be resumed) rather than to be disabled. Although that works, I'm not sure it is the correct approach here. Opened upstream issue https://github.com/rsyslog/rsyslog/issues/4669 to discuss.

Change 715227 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/rsyslog@debian/stretch-wikimedia] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/715227

Change 715227 merged by JMeybohm:

[operations/debs/rsyslog@debian/stretch-wikimedia] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/715227

The patch needs to be ported to the buster package as well, T277739 /cc @elukey

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:19:06Z] <jayme> importes rsyslog 8.1901.0-1~bpo9+wmf2 to stretch-wikimedia - T289766

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:25:36Z] <jayme> updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-staging - T289766

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:58:46Z] <jayme> updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-workers - T289766

Change 720670 had a related patch set uploaded (by Elukey; author: JMeybohm):

[operations/debs/rsyslog@debian/buster-wikimedia-k8s] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/720670

Change 720670 merged by Elukey:

[operations/debs/rsyslog@debian/buster-wikimedia-k8s] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/720670

Change 720677 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Enable resume/retry for mmkubernetes

https://gerrit.wikimedia.org/r/720677

Change 720677 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Enable resume/retry for mmkubernetes

https://gerrit.wikimedia.org/r/720677

JMeybohm updated the task description. (Show Details)
JMeybohm reopened this task as Open.EditedDec 10 2021, 5:28 PM

Unfortunately this does not seem to be enough. During a kubernetes master reboot (for T294119) mmkubernetes failed on both staging-codfw nodes and never revovered:

Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:24:19 kubestage2001 rsyslogd[27364]: mmkubernetes: Forbidden: no access - check permissions to view url [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48]  [v8.1901.0]
Nov 24 15:24:19 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:24:19 kubestage2002 rsyslogd[29758]: mmkubernetes: Forbidden: no access - check permissions to view url [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater]  [v8.1901.0]
Nov 24 15:24:19 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

The generic "rsyslog_action" matrics for mmkubernetes don't show any issues:

jayme@kubestage2002:~$ curl -s 10.192.16.137:9105/metrics | grep mmkuber
rsyslog_action_failed{action="mmkubernetes"} 0
rsyslog_action_processed{action="mmkubernetes"} 3.625691e+06
rsyslog_action_resumed{action="mmkubernetes"} 0
rsyslog_action_suspended{action="mmkubernetes"} 0
rsyslog_action_suspended_duration{action="mmkubernetes"} 0

But rsyslog_kubernetes_pod_metadata_error_total and rsyslog_kubernetes_namespace_metadata_error_total do have values > 0

Change 752652 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Add rule to alert on terminated/failing rsyslog mmkubernetes

https://gerrit.wikimedia.org/r/752652

Change 752652 merged by jenkins-bot:

[operations/alerts@master] Add rule to alert on terminated/failing rsyslog mmkubernetes

https://gerrit.wikimedia.org/r/752652

JMeybohm lowered the priority of this task from High to Low.Jan 11 2022, 9:45 AM

I've added a prometheus alert that fires if mmkubernetes has not processed any records for 5min.
Lowering priority but keeping this open as the actual issue is not resolved and we should do something about it "when we have time"

Mar 11 08:45:01 kubernetes2011 rsyslogd: mmkubernetes: server returned unexpected code [500] for url [https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/cert-manager/pods/cert-manager-58cf4b4966-4znql]  [v8.1901.0]
Mar 11 08:45:01 kubernetes2011 rsyslogd: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

Mentioned in SAL (#wikimedia-operations) [2022-03-11T09:00:51Z] <jayme> kubernetes2011:~# systemctl restart rsyslog.service - T289766

Starting from v8.2108.0 an implementation for error handling has been added to rsyslog: https://github.com/rsyslog/rsyslog/commit/80e4ce176951754da7a94c953cc031a83c411709 and v8.2108.0 has been accepted to bullseye-backports recently.

We should import that version into bullseye-wikimedia/component/rsyslog-k8s when upstream binary packages become available.

Mentioned in SAL (#wikimedia-operations) [2022-09-14T12:54:52Z] <jayme> imported rsyslog 8.2208.0-1~bpo11+1 into bullseye-wikimedia component/rsyslog-k8s - T289766

Unfortunately v8.2208.0 added a new pstats component that brought back a lot "error handling stats line" messages (T210137: Handle unknown stats in rsyslog_exporter). I tried to fight them and potential future versions off with: https://github.com/aleroyer/rsyslog_exporter/pull/7

Change 832453 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/prometheus-rsyslog-exporter@master] Allow to silence "error handling stats line" messages

https://gerrit.wikimedia.org/r/832453

Change 832453 merged by JMeybohm:

[operations/debs/prometheus-rsyslog-exporter@master] Allow to silence "error handling stats line" messages

https://gerrit.wikimedia.org/r/832453

Mentioned in SAL (#wikimedia-operations) [2022-09-15T11:22:30Z] <jayme> importing prometheus-rsyslog-exporter 0.0.0+git20201008-4 to stretch-wikimedia, buster-wikimedia, bullseye-wikimedia - T289766

New rsyslog & exporter rolled out to A:wikikube-staging-master or A:wikikube-staging-worker

Mentioned in SAL (#wikimedia-operations) [2022-09-15T12:17:42Z] <jayme> fleet wide update of prometheus-rsyslog-exporter to 0.0.0+git20201008-4 - T289766

Change 832493 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Prevent rsyslog-exporter from logging "error handling stats line" messages

https://gerrit.wikimedia.org/r/832493

Change 832493 merged by JMeybohm:

[operations/puppet@production] Prevent rsyslog-exporter from logging "error handling stats line" messages

https://gerrit.wikimedia.org/r/832493

Mentioned in SAL (#wikimedia-operations) [2022-09-15T13:50:58Z] <jayme> updated rsyslog to 8.2208.0-1~bpo11+1 on all kubernetes masters and nodes - T289766

Resolving this in a hopeful manner