One of the recurring issues in Toolforge is NFS disk utilization and log files (T183920, T206239, T233120). For various reasons, we have never come up with a solid recommendation for Tool maintainers about how to manage the log files that their processes generate (T68623, T152235, T127367), but the TL;DR version is that several of the common log generation points including lighttpd and grid engine jobs do not handle file rotation robustly.
With rotation being challenging, the next best option is looking for ways to reduce logging in general. One of the potential low hanging fruits for log reduction is disabling access.log generation for lighttpd (the web server software used by many webservice backends). Today we generate a lighttpd config that includes:
server.modules = ( ... "mod_accesslog", ... ) accesslog.use-syslog = "disable" accesslog.filename = "/data/project/<tool>/access.log"
Based on my reading of the upstream docs for mod_accesslog, if we removed the accesslog.filename = "..." line from the default config no access.log output would be generated. It would then be possible for a given tool to use the $HOME/.lighttpd.conf file to add the accesslog.filename = "..." stanza back into their composite configuration if desired.
My operating theory here is that the majority of tool maintainers never use/process their access.log data. Some polling should be done to test this theory before making any sweeping change.
Update:
The decision going forward is to not to enable access.log by default. If needed, the user can enable this by override it in their ~/.lighttpd.conf file.
TODO:
- Code the new behavior (remove the accesslog.filename = "..." line from the generated lighttpd config)
- Document the new behavior on wikitech as a News/... article including instructions on how to add local config to generate the access.log if needed by a Tool
- Add a message that will be shown on start and restart commands linking to the News/... article
- Merge and deploy the updated webservice code including Docker image rebuilds
- Announce the new behavior to cloud-announce including link to News/... article
- Wait one week for folks to restart things normally and pick up the new behavior in their tools
- Force restart lighttpd powered webservices on the grid and Kubernetes that have been running since before the new webservice was deployed
- Remove notice of behavior change from webservice in gerrit so that the next build + deploy will remove the notice