Page MenuHomePhabricator

[toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Open, HighPublic

Description

Currently, tools just default to writing log files on to NFS. While simple, this causes a number of problems:

  1. It adds additional load on our NFS server, which isn't already doing great
  2. There's a delay between the logs being written on the exec node and being readable on bastion, which is both confusing and annoying
  3. Logrotate is a PITA with GridEngine + NFS

A solution (based on ElasticSearch, probably - to mirror what we have in production), should allow us to do the following:

  1. Take load off NFS
  2. Make it far faster to see the actual logs from processes
  3. Be able to search through logs easier
  4. Automatically drop older logs
  5. Provide a Filesystem based interface for log ingress
  6. Provide more standard and modern interfaces (gelf? etc) for log ingress
  7. Provide a filesystem based interface for log reading
  8. Provide a more modern interface for log reading as well
  9. Be secure in allowing only authenticated members to read a particular tool's logs.

This is the tracking ticket for this overhaul.

This is specifically *only* for Toolforge, and not for use by general Cloud VPS projects, mostly due to concern 9.

Other related tasks:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Graylog (https://github.com/graylog2/graylog2-server, https://www.graylog.org/) seems to cover most of these points:

  1. Take load off NFS - logs are stored on an Elasticsearch cluster
  2. Make it far faster to see the actual logs from processes - doesn't depend on NFS, so should be fast
  3. Be able to search through logs easier - searching is easy: http://docs.graylog.org/en/2.0/pages/queries.html, "The search syntax is very close to the Lucene syntax. By default all message fields are included in the search if you don’t specify a message field to search in."
  4. Automatically drop older logs - index rotation can be configured based on message count, index size, or index time: http://docs.graylog.org/en/2.0/pages/index_model.html#eviction-of-indices-and-messages
  5. Provide a Filesystem based interface for log ingress - Graylog supports this: http://docs.graylog.org/en/2.0/pages/sending_data.html#reading-from-files, "we provide the Collector Sidecar which acts as a supervisor process for other programs, such as nxlog and Filebeats, which have specifically been built to collect log messages from local files and ship them to remote systems like Graylog."
  6. Provide more standard and modern interfaces (gelf? etc) for log ingress - Graylog supports GELF and syslog: http://docs.graylog.org/en/2.0/pages/sending_data.html
  7. Provide a filesystem based interface for log reading - I don't think this is supported, but you can export search results to CSV: http://docs.graylog.org/en/2.0/pages/queries.html#export-results-as-csv
  8. Provide a more modern interface for log reading as well - Graylog's interface looks fairly modern and easy to use to me:
    overview_drilldown-64639ff834e585dc61e8a7276bbb9f3370ce4310f912e38d755fc3953a884717.png (1×2 px, 341 KB)
    overview_dashboard-3817012f923d4a00e9bde1c0547796319cf0396149464f434cfc2ae83a9da826.png (1×2 px, 345 KB)
  9. Be secure in allowing only authenticated members to read a particular tool's logs. - Graylog has access control out-of-the-box, and can integrate with LDAP users and groups: http://docs.graylog.org/en/2.0/pages/users_and_roles/external_auth.html#ldap-active-directory

Graylog has Streams (basically categories for log messages): http://docs.graylog.org/en/2.0/pages/streams.html, alerting based on those streams: http://docs.graylog.org/en/2.0/pages/getting_started/stream_alerts.html, and dashboards: http://docs.graylog.org/en/2.0/pages/dashboards.html.

Thank you for finding this, and taking the time to work out all the points @tom29739! I think 7) is not too important -- the use case (as far as I can think of) is users who need to use output of a script as input to the next script. For those users, it's totally fine to just write to nfs.

In general, I think our first target is the webservice logs, and not so much the generic job output files -- those tend to not grow to as crazily large sizes.

bd808 renamed this task from Overhaul logging setup for Tools (Tracking) to Provide modern, non-NFS error log solution for Toolforge webservices and bots.Nov 27 2018, 11:38 PM
bd808 added a project: Epic.
bd808 updated the task description. (Show Details)

Another related (maybe dependent?) task T293672

https://kubernetes.io/docs/concepts/cluster-administration/logging/ also contains a few pointers on how such architecture may be introduced on kubernetes.

dcaro renamed this task from Provide modern, non-NFS error log solution for Toolforge webservices and bots to [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots.Feb 21 2024, 10:18 AM
dcaro renamed this task from [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots to [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots.Mar 5 2024, 4:10 PM

For anyone curious

for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done

Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

For anyone curious

for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done

Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

I fear that this query is missing most of the logs. It ignores most toolforge-jobs jobs that are configured to log onto an NFS file, as well as any cron jobs that are not running at that specific time.

Fair enough, do you have any estimate on how much those logs would account for in a day?