Page MenuHomePhabricator

Jupyter notebook logs should appear in Logstash
Closed, ResolvedPublic

Description

As part of T287339: jupyter notebook causing syslog/etc.. to fill up with error messages a configuration change was made which should have caused Jupyter notebooks to be sent to our ELK+ system.

  • A field of SyslogIdentifier=jupyterhub-conda-singleuser was added to the systemd service related to each user's notebook
  • An entry for this programname was added to the file: /etc/rsyslog.lookup.d/lookup_table_output.json of the form: {"index" : "jupyterhub-conda-singleuser", "value" : "kafka" }

However the logs aren't yet appearing in Logstash, although I thought that this configuration would be all that was required.

As an aside, we submitted a patch to the systemdspawner (https://github.com/jupyterhub/systemdspawner/pull/83) which enables interpolation of {USERNAME} and {USERID} variables. This might be useful down the line in differentiating different users' notebooks from each other in Logstash.

Event Timeline

BTullis added subscribers: Ottomata, razzi.

I think that I can see why logs aren't appearing in Logstash for this.

On each of the stat100x servers we can list the notebooks that have been run since this configuration change was put in place. e.g. on stat1006

btullis@stat1006:~$ find /var/run/systemd/transient/ -name jupyter-* -print0|xargs -0 grep -i syslogIdentifier
/var/run/systemd/transient/jupyter-ebernhardson-singleuser.service:SyslogIdentifier=jupyterhub-conda-singleuser
/var/run/systemd/transient/jupyter-milimetric-singleuser.service:SyslogIdentifier=jupyterhub-conda-singleuser
/var/run/systemd/transient/jupyter-btullis-singleuser.service:SyslogIdentifier=jupyterhub-conda-singleuser
/var/run/systemd/transient/jupyter-piccardi-singleuser.service:SyslogIdentifier=jupyterhub-conda-singleuser
/var/run/systemd/transient/jupyter-iflorez-singleuser.service:SyslogIdentifier=jupyterhub-conda-singleuser

I can see that log entries have been correctly created with SyslogIdentifier=jupyterhub-conda-singleuserwhen I set up my conda environment etc. with...

btullis@stat1006:~$ sudo journalctl -u jupyter-btullis-singleuser.service|head -n 5
-- Logs begin at Sat 2021-10-02 18:41:11 UTC, end at Mon 2021-10-11 10:50:27 UTC. --
Oct 05 19:15:29 stat1006 systemd[1]: Started /bin/bash -c cd /home/btullis && exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh __NEW__ --port=54875 --NotebookApp.default_url=/lab.
Oct 05 19:15:29 stat1006 jupyterhub-conda-singleuser[15526]: Creating new stacked conda env 2021-10-05T19.15.29_btullis...
Oct 05 19:15:36 stat1006 jupyterhub-conda-singleuser[15526]: Collecting package metadata (current_repodata.json): ...working... done
Oct 05 19:15:37 stat1006 jupyterhub-conda-singleuser[15526]: Solving environment: ...working... done

I deployed this change to rsyslog which seems to have been applied correctly.

btullis@stat1006:~$ grep jupyterhub-conda-singleuser /etc/rsyslog.lookup.d/lookup_table_output.json 
    {"index" : "jupyterhub-conda-singleuser", "value" : "kafka" }

However since these logs are not native JSON, I think that I need to apply the ecs_170 conversion template to them.

Here is a snippet from /etc/rsyslog.d/30-output-kafka.conf:

# if ecs_170 in log_outputs, use that template to format
# non-json-formatted syslog events into an ecs-compatible form
if ( $.log_outputs contains "ecs_170" ) then {
    action(type="omkafka"
           broker=["kafka-logging1001.eqiad.wmnet:9093","kafka-logging1002.eqiad.wmnet:9093","kafka-logging1003.eqiad.wmnet:9093"]
           topic="kafka_topic"
           dynatopic="on"
           dynatopic.cachesize="1000"
           partitions.auto="on"
           template="ecs_170"

Without this is tries to parse the logs as JSON and will fail. I think that this is why we're not yet seeing these logs in Logstash here: https://logstash.wikimedia.org/app/discover

Change 729957 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the ecs_170 tag to the jupyterjab log pipeline

https://gerrit.wikimedia.org/r/729957

Change 729957 merged by Btullis:

[operations/puppet@production] Add the ecs_170 tag to the jupyterjab log pipeline

https://gerrit.wikimedia.org/r/729957

That ecs_170 patch has definitely worked to some extent. I have seen my first message in logstash, but not yet as many as I was expecting.

image.png (770×1 px, 197 KB)

I will keep investigating and possibly make a dashboard to make these logs more user-accessible.

I have created a basic dashboard for these JupyterHub user notebook logs in Kibana.
Jupyter Hub

I've also added a link to this to the Home Dashboard

It's pretty rudimentary at the moment, because it doesn't allow us to differentiate one user's notebook from another. The only thing that it allows us to do is to filter easily based on which stat host is sending the messages.

image.png (941×1 px, 256 KB)

However, I think that it should not be too hard to add extra functionality to this, now that we have the log shipping mechanism in place and our upstream patch to systemdspawner was merged.
https://github.com/jupyterhub/systemdspawner/pull/83

razzi changed the task status from Open to In Progress.Oct 19 2021, 4:15 PM
razzi moved this task from Done to In Code Review on the Analytics-Kanban board.

Discussed in standup today, this has stopped producing: https://logstash.wikimedia.org/app/dashboards#/view/af2a44f0-2c0e-11ec-81e9-e1226573bad4?_g=h@bb5fdf8&_a=h@df0be1b

Kicking back to code review

The last message reported to be in logstash with the program field of jupyterhub-conda-singleuser was timestamped at 09:41 on October 13th.

image.png (766×1 px, 214 KB)

I know that fresh are still being produced by JupyterHub, because I can tail the current logs of my own workbook as shown:

sudo journalctl -f -u jupyter-btullis-singleuser

Log lines just from browsing the notebook look like this:

Oct 27 10:18:43 stat1008 jupyterhub-conda-singleuser[29896]: [I 2021-10-27 10:18:43.390 SingleUserNotebookApp log:174] 200 GET /user/btullis/api/contents?content=1&1635329923155 (btullis@127.0.0.1) 4.86ms
Oct 27 10:18:43 stat1008 jupyterhub-conda-singleuser[29896]: [I 2021-10-27 10:18:43.616 SingleUserNotebookApp log:174] 200 GET /user/btullis/api/sessions?1635329923331 (btullis@127.0.0.1) 3.74ms
Oct 27 10:18:43 stat1008 jupyterhub-conda-singleuser[29896]: [I 2021-10-27 10:18:43.914 SingleUserNotebookApp log:174] 200 GET /user/btullis/api/kernelspecs?1635329923687 (btullis@127.0.0.1) 5.18ms

So the questions I have are:

  • Where do these messages go, if not into Logstash/Elasticsearch?
  • Do they make as far as to the Kafka topic?
  • Are they consumed from the Kafka topic?
  • Are they filtered out by Logstash for some reason?

Do they make as far as to the Kafka topic?

To check this, log into kafka-logging1001 and use kafkacat -C -b localhost:9092 -t <topic-name> -o end to tail the logs.

I have verified that there are logs appearing in the kafka topic: rsyslog-info

{"timestamp":"2021-10-21T10:18:45.128895+00:00", "message":"[I 2021-10-21 10:18:45.127 SingleUserNotebookApp log:174] 200 GET \/user\/akhatun\/api\/kernels?1634811507303 (akhatun@127.0.0.1) 3.24ms", "host.name":"stat1008", "log.syslog.severity.code":"6", "log.syslog.severity.name":"info", "log.level":"info", "log.syslog.facility.code":"3", "log.syslog.facility.name":"daemon", "log.syslog.priority":"30", "service.type":"jupyterhub-conda-singleuser", "ecs.version": "1.7.0"}
{"timestamp":"2021-10-21T10:18:45.305505+00:00", "message":"[I 2021-10-21 10:18:45.305 SingleUserNotebookApp log:174] 200 GET \/user\/akhatun\/api\/contents\/Notebooks\/WD%20subgraph?content=1&1634811507504 (akhatun@127.0.0.1) 12.06ms", "host.name":"stat1008", "log.syslog.severity.code":"6", "log.syslog.severity.name":"info", "log.level":"info", "log.syslog.facility.code":"3", "log.syslog.facility.name":"daemon", "log.syslog.priority":"30", "service.type":"jupyterhub-conda-singleuser", "ecs.version": "1.7.0"}
{"timestamp":"2021-10-21T10:42:17.604496+00:00", "message":"[I 2021-10-21 10:42:17.604 SingleUserNotebookApp log:174] 200 GET \/user\/akhatun\/api\/contents\/Notebooks\/scholarly%20article?content=1&1634812845859 (akhatun@127.0.0.1) 17.00ms", "host.name":"stat1008", "log.syslog.severity.code":"6", "log.syslog.severity.name":"info", "log.level":"info", "log.syslog.facility.code":"3", "log.syslog.facility.name":"daemon", "log.syslog.priority":"30", "service.type":"jupyterhub-conda-singleuser", "ecs.version": "1.7.0"}

K sounds like a logstash filter/ingestor problem them. @colewhite maybe can help?

Thanks @Ottomata

The -o end helped there. I was able to identify one refresh of my notebook with a single kafka event.
I was browsing from a stat box, but otherwise the same.

kafkacat -q -C -b kafka-logging1001.eqiad.wmnet:9092 -t rsyslog-info -o end |grep jupyterhub-conda-singleuser
{"timestamp":"2021-10-27T13:46:31.647139+00:00", "message":"[I 2021-10-27 13:46:31.646 SingleUserNotebookApp log:174] 200 GET \/user\/btullis\/api\/contents?content=1&1635342391442 (btullis@127.0.0.1) 39.25ms", "host.name":"stat1008", "log.syslog.severity.code":"6", "log.syslog.severity.name":"info", "log.level":"info", "log.syslog.facility.code":"1", "log.syslog.facility.name":"user", "log.syslog.priority":"14", "service.type":"jupyterhub-conda-singleuser", "ecs.version": "1.7.0"}

I have found and fixed the problem!

The logs were being saved to a different index pattern by the logstash output filter: index => ecs-1.7.0-5-%{[@metadata][partition]}-%{+YYYY.ww} and the program field was mapped to the service.type field.

image.png (576×1 px, 135 KB)

I have updated the saved query in Logstash and the JupyterHub dashboard.

Related: T288623

It seems to me important to get the word out that ECS adoption necessitates the use of a separate index pattern. Also, that there is documentation around field names and types: https://doc.wikimedia.org/ecs/#ecs-field-reference