[toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Open, HighPublic
Actions

Assigned To

None

Authored By

	yuvipanda
	Feb 18 2016, 9:19 PM

Description

Currently, tools just default to writing log files on to NFS. While simple, this causes a number of problems:

It adds additional load on our NFS server, which isn't already doing great
There's a delay between the logs being written on the exec node and being readable on bastion, which is both confusing and annoying
Logrotate is a PITA with GridEngine + NFS

A solution (based on ElasticSearch, probably - to mirror what we have in production), should allow us to do the following:

Take load off NFS
Make it far faster to see the actual logs from processes
Be able to search through logs easier
Automatically drop older logs
Provide a Filesystem based interface for log ingress
Provide more standard and modern interfaces (gelf? etc) for log ingress
Provide a filesystem based interface for log reading
Provide a more modern interface for log reading as well
Be secure in allowing only authenticated members to read a particular tool's logs.

This is the tracking ticket for this overhaul.

This is specifically *only* for Toolforge, and not for use by general Cloud VPS projects, mostly due to concern 9.

Other related tasks:

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• Bstorm	T126083 overhaul labstore setup [tracking]
Resolved		• GTirloni	T216988 labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%):
Resolved		• Bstorm	T217993 2019-03-10: tools and NFS share cleanup (high usage)
Resolved		• Bstorm	T122508 Prevent overly-large log files
Declined		None	T286847 Add webservice flag to mount project directory read-only
Open		None	T127367 [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Declined		None	T127368 Estimate hardware requirements for Toolforge logging elastic cluster
Resolved		dcaro	T152235 Simple logrotate service for users of Tools as stopgap before central logging
Resolved	Feature	Raymond_Ndibe	T302211 toolforge-jobs: merge stdout/stderr output
Duplicate	Feature	Raymond_Ndibe	T304421 Allow customizing the out/err files with toolforge-jobs
Resolved	Feature	Raymond_Ndibe	T301901 Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework
Resolved		None	T327165 toolforge-jobs: add logrotate
Resolved	Feature	taavi	T330715 Allow TJF job logs to go to Kuberenetes output buffer rather than disk
Resolved		taavi	T336057 Add commands to `webservice` and `jobs` to query logs from Kubernetes
Open		Raymond_Ndibe	T372025 [jobs-api] prepend date and pod name to filelog lines

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

valhallasw updated the task description. (Show Details)May 27 2016, 4:36 PM

valhallasw updated the task description. (Show Details)

• bd808 updated the task description. (Show Details)May 27 2016, 4:38 PM

• bd808 mentioned this in T136265: Develop evaluation criteria for comparing Platform as a Service (PaaS) solutions.May 27 2016, 4:47 PM

valhallasw mentioned this in T122508: Prevent overly-large log files.May 27 2016, 7:25 PM

Graylog (https://github.com/graylog2/graylog2-server, https://www.graylog.org/) seems to cover most of these points:

Take load off NFS - logs are stored on an Elasticsearch cluster
Make it far faster to see the actual logs from processes - doesn't depend on NFS, so should be fast
Be able to search through logs easier - searching is easy: http://docs.graylog.org/en/2.0/pages/queries.html, "The search syntax is very close to the Lucene syntax. By default all message fields are included in the search if you don’t specify a message field to search in."
Automatically drop older logs - index rotation can be configured based on message count, index size, or index time: http://docs.graylog.org/en/2.0/pages/index_model.html#eviction-of-indices-and-messages
Provide a Filesystem based interface for log ingress - Graylog supports this: http://docs.graylog.org/en/2.0/pages/sending_data.html#reading-from-files, "we provide the Collector Sidecar which acts as a supervisor process for other programs, such as nxlog and Filebeats, which have specifically been built to collect log messages from local files and ship them to remote systems like Graylog."
Provide more standard and modern interfaces (gelf? etc) for log ingress - Graylog supports GELF and syslog: http://docs.graylog.org/en/2.0/pages/sending_data.html
Provide a filesystem based interface for log reading - I don't think this is supported, but you can export search results to CSV: http://docs.graylog.org/en/2.0/pages/queries.html#export-results-as-csv
Provide a more modern interface for log reading as well - Graylog's interface looks fairly modern and easy to use to me:
Be secure in allowing only authenticated members to read a particular tool's logs. - Graylog has access control out-of-the-box, and can integrate with LDAP users and groups: http://docs.graylog.org/en/2.0/pages/users_and_roles/external_auth.html#ldap-active-directory

Graylog has Streams (basically categories for log messages): http://docs.graylog.org/en/2.0/pages/streams.html, alerting based on those streams: http://docs.graylog.org/en/2.0/pages/getting_started/stream_alerts.html, and dashboards: http://docs.graylog.org/en/2.0/pages/dashboards.html.

Thank you for finding this, and taking the time to work out all the points @tom29739! I think 7) is not too important -- the use case (as far as I can think of) is users who need to use output of a script as input to the next script. For those users, it's totally fine to just write to nfs.

In general, I think our first target is the webservice logs, and not so much the generic job output files -- those tend to not grow to as crazily large sizes.

Ladsgroup subscribed.Nov 5 2016, 2:31 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:44 PM

Liuxinyu970226 subscribed.Aug 7 2017, 1:27 AM

• bd808 mentioned this in T183920: 2018-01-02: labstore Tools and Misc share very full.Jun 7 2018, 6:15 PM

Framawiki subscribed.Jun 7 2018, 7:04 PM

• GTirloni subscribed.Oct 31 2018, 5:34 PM

[not a Tracking-Neverending bug per definition; maybe Epic or such was meant?]

• bd808 renamed this task from Overhaul logging setup for Tools (Tracking) to Provide modern, non-NFS error log solution for Toolforge webservices and bots.Nov 27 2018, 11:38 PM

• bd808 added a project: Epic.

• bd808 updated the task description. (Show Details)

• GTirloni unsubscribed.Mar 21 2019, 9:06 PM

• GTirloni added a project: cloud-services-team (Kanban).Mar 23 2019, 9:20 PM

• bd808 moved this task from Inbox to Epics on the cloud-services-team (Kanban) board.Jun 16 2019, 4:31 PM

JeanFred subscribed.Aug 29 2019, 10:10 AM

• bd808 mentioned this in T233347: Remove access.log generation from default lighttpd.conf generated by `webservice`.Sep 19 2019, 7:04 PM

• Bstorm awarded a token.Jun 11 2020, 10:53 PM

• Bstorm added a subtask: T152235: Simple logrotate service for users of Tools as stopgap before central logging.Oct 16 2020, 6:07 PM

• Bstorm merged a task: T97861: [toolforge.infra] Provide centralized logging (logstash) for Toolforge.Oct 16 2020, 6:11 PM

• Bstorm added subscribers: • Bstorm, fgiunchedi, intracer, coren.

• Bstorm added a parent task: T122508: Prevent overly-large log files.Oct 16 2020, 6:15 PM

• Bstorm changed the status of subtask T127368: Estimate hardware requirements for Toolforge logging elastic cluster from Open to Stalled.Oct 16 2020, 6:19 PM

taavi mentioned this in T286847: Add webservice flag to mount project directory read-only.Jul 17 2021, 2:01 PM

taavi added a parent task: T286847: Add webservice flag to mount project directory read-only.Jul 18 2021, 7:19 AM

taavi moved this task from Tracking to Ready to be worked on on the Toolforge board.Aug 5 2021, 2:04 PM

AntiCompositeNumber subscribed.Oct 30 2021, 12:21 AM

T256426 is related to this.

RoySmith mentioned this in T256426: Extremely high latency over NFS between kubernetes node and bastion host.Dec 22 2021, 2:13 AM

Another related (maybe dependent?) task T293672

dcaro updated the task description. (Show Details)Mar 29 2022, 3:54 PM

https://kubernetes.io/docs/concepts/cluster-administration/logging/ also contains a few pointers on how such architecture may be introduced on kubernetes.

aborrero mentioned this in T302211: toolforge-jobs: merge stdout/stderr output.Sep 26 2022, 11:51 AM

aborrero added a subtask: T302211: toolforge-jobs: merge stdout/stderr output.

aborrero mentioned this in T304421: Allow customizing the out/err files with toolforge-jobs.

aborrero added a subtask: T304421: Allow customizing the out/err files with toolforge-jobs.

linking this here: https://wikitech.wikimedia.org/wiki/User:Majavah/Loki_notes

aborrero mentioned this in T304893: Rethink job retries in case of failures.Sep 26 2022, 11:55 AM

aborrero mentioned this in T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework.Sep 26 2022, 11:59 AM

aborrero added a subtask: T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework.

Raymond_Ndibe subscribed.Oct 3 2022, 5:17 PM

aborrero changed the status of subtask T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework from Open to In Progress.Dec 23 2022, 5:41 PM

aborrero added a subtask: T327165: toolforge-jobs: add logrotate.Jan 17 2023, 11:54 AM

Raymond_Ndibe closed subtask T302211: toolforge-jobs: merge stdout/stderr output as Resolved.Jan 17 2023, 2:10 PM

Raymond_Ndibe closed subtask T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework as Resolved.

Raymond_Ndibe reopened subtask T302211: toolforge-jobs: merge stdout/stderr output as In Progress.Jan 17 2023, 2:35 PM

Raymond_Ndibe reopened subtask T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework as In Progress.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:45 PM

fnegri moved this task from Kanban to Epics on the cloud-services-team board.

aborrero closed subtask T301901: Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework as Resolved.Jan 24 2023, 4:32 PM

aborrero closed subtask T302211: toolforge-jobs: merge stdout/stderr output as Resolved.

aborrero added a subtask: T330715: Allow TJF job logs to go to Kuberenetes output buffer rather than disk.Mar 1 2023, 4:58 PM

fnegri subscribed.Mar 29 2023, 4:34 PM

taavi closed subtask T330715: Allow TJF job logs to go to Kuberenetes output buffer rather than disk as Resolved.May 5 2023, 12:18 PM

aborrero added a subtask: T336057: Add commands to `webservice` and `jobs` to query logs from Kubernetes.May 5 2023, 2:42 PM

Don-vip awarded a token.May 7 2023, 12:54 PM

Don-vip subscribed.

Benjavalero subscribed.Jun 15 2023, 7:39 AM

LucasWerkmeister mentioned this in T341919: Support probes in kubernetes webservices.Jul 15 2023, 11:49 AM

• bd808 closed subtask T127368: Estimate hardware requirements for Toolforge logging elastic cluster as Declined.Jul 20 2023, 3:31 PM

Chicocvenancio subscribed.Jul 20 2023, 3:54 PM

Wbm1058 mentioned this in T327165: toolforge-jobs: add logrotate.Aug 18 2023, 5:14 PM

taavi closed subtask T336057: Add commands to `webservice` and `jobs` to query logs from Kubernetes as Resolved.Oct 5 2023, 10:29 AM

Count_Count subscribed.Nov 22 2023, 5:34 PM

Mmarx subscribed.Dec 13 2023, 10:48 PM

taavi closed subtask T327165: toolforge-jobs: add logrotate as Resolved.Dec 15 2023, 2:01 PM

dcaro moved this task from Ready to be worked on to Workspace for triaging whenever needed on the Toolforge board.Jan 24 2024, 1:32 PM

dcaro renamed this task from Provide modern, non-NFS error log solution for Toolforge webservices and bots to [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots.Feb 21 2024, 10:18 AM

dcaro moved this task from Workspace for triaging whenever needed to Ready to be worked on on the Toolforge board.Feb 21 2024, 4:04 PM

dcaro renamed this task from [toolforge] Provide modern, non-NFS log solution for Toolforge webservices and bots to [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots.Mar 5 2024, 4:10 PM

Slst2020 subscribed.Jun 20 2024, 1:03 PM

dcaro closed subtask T152235: Simple logrotate service for users of Tools as stopgap before central logging as Resolved.Jul 11 2024, 2:57 PM

aborrero added a project: User-aborrero.Sep 4 2024, 10:56 AM

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

aborrero added a subtask: T372025: [jobs-api] prepend date and pod name to filelog lines .Sep 25 2024, 8:12 AM

For anyone curious

for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done

Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

In T127367#10406539, @rook wrote:
For anyone curious
for namespace in $(kubectl get ns | tail -n +2 | awk '{print $1}') ;
do
    for pod in $(kubectl get pods -n ${namespace} | tail -n +2 | awk '{print $1}') ;
    do
        kubectl -n ${namespace} logs ${pod} --all-containers --since=24h
    done
done
Suggests that there are about 500 megabytes of logs in the last twenty four hours. Suggesting that a monolithic loki could work.

I fear that this query is missing most of the logs. It ignores most toolforge-jobs jobs that are configured to log onto an NFS file, as well as any cron jobs that are not running at that specific time.

Fair enough, do you have any estimate on how much those logs would account for in a day?

	F4403398: overview_dashboard-3817012f923d4a00e9bde1c0547796319cf0396149464f434cfc2ae83a9da826.png
	Aug 26 2016, 6:04 PM

	F4403395: overview_drilldown-64639ff834e585dc61e8a7276bbb9f3370ce4310f912e38d755fc3953a884717.png
	Aug 26 2016, 6:04 PM

[toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and botsOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

[toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots
Open, HighPublic
Actions

Related Objects
Search...