Page MenuHomePhabricator

Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana
Open, LowPublic

Description

We had this in the past already and it kind of healed itself IIRC but we now have the situation again that staging-eqiad do not show up in Elasticsearch/Kibana at all.

Last log lines in elasticsearch:

Aug 20, 2021 @ 10:43:39.207 kubestage1002.eqiad.wmnet
Aug 20, 2021 @ 10:43:39.193 kubestage1001.eqiad.wmnet
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: mmkubernetes: failed to connect to [https://kubestagemaster.svc.eqiad.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-6674d59bd8-dhl56] - 7:Couldn't connect to server  [v8.1901.0]
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: action 'action-0-mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

It turns out the mmkubernetes plugin is terminated rather than suspended in case of connection errors to k8s API. So a retry would never happen: https://github.com/rsyslog/rsyslog/issues/4669

Event Timeline

JMeybohm created this task.

AIUI the logs do reach Elasticsearch, but they lack the Kubernetes API metadata and therefore do not show up in the usual queries/dashboards.
https://logstash.wikimedia.org/goto/c64588ec490c741fbc8f89d05549a7df

And that since...

/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: mmkubernetes: failed to connect to [https://kubestagemaster.svc.eqiad.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-6674d59bd8-dhl56] - 7:Couldn't connect to server  [v8.1901.0]
/var/log/syslog.6.gz:Aug 20 10:43:39 kubestage1002 rsyslogd: action 'action-0-mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

Yeah...great:

Error Handling

If the plugin encounters a 404 Not Found [...]

If the plugin gets a 429 Busy response [...]

For other errors, the plugin will assume they are “hard” errors requiring admin intervention, return an error code, and rsyslog will suspend the plugin. Use the Statistic Counter to monitor for problems getting data from the Kubernetes API service.

Change 715024 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Add explicit name to mmkubernetes action

https://gerrit.wikimedia.org/r/715024

Change 715024 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Add explicit name to mmkubernetes action

https://gerrit.wikimedia.org/r/715024

Change 715054 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Fix name in rsyslog action

https://gerrit.wikimedia.org/r/715054

Change 715054 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Fix name in rsyslog action

https://gerrit.wikimedia.org/r/715054

I tried to convince rsyslog to retry/resume the action with:

action(type="mmkubernetes"
       name="mmkubernetes"
       action.resumeRetryCount="-1"
       action.resumeInterval="1"
       action.resumeIntervalMax="300"
       action.reportSuspensionContinuation="on")

So far I have seen zero attempts to re-establish a connection, though. Also I don't see the mmkubernetes module being listed as suspended in the prometheus metrics, so maybe I'm totally running in the wrong direction here and RS_RET_ERR actually means "there is no way out": https://github.com/rsyslog/rsyslog/blob/25e6b014b8bc9ec5fe0621aa2146cfd304ba4f67/contrib/mmkubernetes/mmkubernetes.c#L1439-L1443

...which actually seems to be the case:
https://github.com/rsyslog/rsyslog/blob/25e6b014b8bc9ec5fe0621aa2146cfd304ba4f67/action.c#L1105-L1114

I've patched the mmkubernetes module to actually be suspended (and thus can be resumed) rather than to be disabled. Although that works, I'm not sure it is the correct approach here. Opened upstream issue https://github.com/rsyslog/rsyslog/issues/4669 to discuss.

Change 715227 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/rsyslog@debian/stretch-wikimedia] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/715227

Change 715227 merged by JMeybohm:

[operations/debs/rsyslog@debian/stretch-wikimedia] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/715227

The patch needs to be ported to the buster package as well, T277739 /cc @elukey

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:19:06Z] <jayme> importes rsyslog 8.1901.0-1~bpo9+wmf2 to stretch-wikimedia - T289766

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:25:36Z] <jayme> updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-staging - T289766

Mentioned in SAL (#wikimedia-operations) [2021-09-10T07:58:46Z] <jayme> updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-workers - T289766

Change 720670 had a related patch set uploaded (by Elukey; author: JMeybohm):

[operations/debs/rsyslog@debian/buster-wikimedia-k8s] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/720670

Change 720670 merged by Elukey:

[operations/debs/rsyslog@debian/buster-wikimedia-k8s] Suspend mmkubernetes on connection errors

https://gerrit.wikimedia.org/r/720670

Change 720677 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] rsyslog/kubernetes: Enable resume/retry for mmkubernetes

https://gerrit.wikimedia.org/r/720677

Change 720677 merged by JMeybohm:

[operations/puppet@production] rsyslog/kubernetes: Enable resume/retry for mmkubernetes

https://gerrit.wikimedia.org/r/720677

JMeybohm updated the task description. (Show Details)
JMeybohm reopened this task as Open.EditedDec 10 2021, 5:28 PM

Unfortunately this does not seem to be enough. During a kubernetes master reboot (for T294119) mmkubernetes failed on both staging-codfw nodes and never revovered:

Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:24:19 kubestage2001 rsyslogd[27364]: mmkubernetes: Forbidden: no access - check permissions to view url [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-wdw48]  [v8.1901.0]
Nov 24 15:24:19 kubestage2001 rsyslogd[27364]: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/kube-system/pods/coredns-dc4c94565-crsl5] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: mmkubernetes: failed to connect to [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater] - 7:Couldn't connect to server  [v8.1901.0]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' suspended (module 'mmkubernetes'), retry 0. There should be messages before this one giving the reason for suspension. [v8.1901.0 try https://www.rsyslog.com/e/2007 ]
Nov 24 15:23:46 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' resumed (module 'mmkubernetes') [v8.1901.0 try https://www.rsyslog.com/e/2359 ]
Nov 24 15:24:19 kubestage2002 rsyslogd[29758]: mmkubernetes: Forbidden: no access - check permissions to view url [https://kubestagemaster.svc.codfw.wmnet:6443/api/v1/namespaces/rdf-streaming-updater]  [v8.1901.0]
Nov 24 15:24:19 kubestage2002 rsyslogd[29758]: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

The generic "rsyslog_action" matrics for mmkubernetes don't show any issues:

jayme@kubestage2002:~$ curl -s 10.192.16.137:9105/metrics | grep mmkuber
rsyslog_action_failed{action="mmkubernetes"} 0
rsyslog_action_processed{action="mmkubernetes"} 3.625691e+06
rsyslog_action_resumed{action="mmkubernetes"} 0
rsyslog_action_suspended{action="mmkubernetes"} 0
rsyslog_action_suspended_duration{action="mmkubernetes"} 0

But rsyslog_kubernetes_pod_metadata_error_total and rsyslog_kubernetes_namespace_metadata_error_total do have values > 0

Change 752652 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Add rule to alert on terminated/failing rsyslog mmkubernetes

https://gerrit.wikimedia.org/r/752652

Change 752652 merged by jenkins-bot:

[operations/alerts@master] Add rule to alert on terminated/failing rsyslog mmkubernetes

https://gerrit.wikimedia.org/r/752652

JMeybohm lowered the priority of this task from High to Low.Jan 11 2022, 9:45 AM

I've added a prometheus alert that fires if mmkubernetes has not processed any records for 5min.
Lowering priority but keeping this open as the actual issue is not resolved and we should do something about it "when we have time"

Mar 11 08:45:01 kubernetes2011 rsyslogd: mmkubernetes: server returned unexpected code [500] for url [https://kubemaster.svc.codfw.wmnet:6443/api/v1/namespaces/cert-manager/pods/cert-manager-58cf4b4966-4znql]  [v8.1901.0]
Mar 11 08:45:01 kubernetes2011 rsyslogd: action 'mmkubernetes' (module 'mmkubernetes') message lost, could not be processed. Check for additional error messages before this one. [v8.1901.0]

Mentioned in SAL (#wikimedia-operations) [2022-03-11T09:00:51Z] <jayme> kubernetes2011:~# systemctl restart rsyslog.service - T289766