Page MenuHomePhabricator

Remove access.log generation from default lighttpd.conf generated by `webservice`
Open, Needs TriagePublic

Description

One of the recurring issues in Toolforge is NFS disk utilization and log files (T183920, T206239, T233120). For various reasons, we have never come up with a solid recommendation for Tool maintainers about how to manage the log files that their processes generate (T68623, T152235, T127367), but the TL;DR version is that several of the common log generation points including lighttpd and grid engine jobs do not handle file rotation robustly.

With rotation being challenging, the next best option is looking for ways to reduce logging in general. One of the potential low hanging fruits for log reduction is disabling access.log generation for lighttpd (the web server software used by many webservice backends). Today we generate a lighttpd config that includes:

/var/run/lighttpd/<tool>
server.modules = (
  ...
  "mod_accesslog",
  ...
)

accesslog.use-syslog = "disable"
accesslog.filename = "/data/project/<tool>/access.log"

Based on my reading of the upstream docs for mod_accesslog, if we removed the accesslog.filename = "..." line from the default config no access.log output would be generated. It would then be possible for a given tool to use the $HOME/.lighttpd.conf file to add the accesslog.filename = "..." stanza back into their composite configuration if desired.

My operating theory here is that the majority of tool maintainers never use/process their access.log data. Some polling should be done to test this theory before making any sweeping change.

Update:

The decision going forward is to not to enable access.log by default. If needed, the user can enable this by override it in their ~/.lighttpd.conf file.

TODO:

  • Code the new behavior (remove the accesslog.filename = "..." line from the generated lighttpd config)
  • Document the new behavior on wikitech as a News/... article including instructions on how to add local config to generate the access.log if needed by a Tool
  • Add a message that will be shown on start and restart commands linking to the News/... article
  • Merge and deploy the updated webservice code including Docker image rebuilds
  • Announce the new behavior to cloud-announce including link to News/... article
  • Wait one week for folks to restart things normally and pick up the new behavior in their tools
  • Force restart lighttpd powered webservices on the grid and Kubernetes that have been running since before the new webservice was deployed
  • Remove notice of behavior change from webservice in gerrit so that the next build + deploy will remove the notice

Event Timeline

bd808 created this task.Sep 19 2019, 7:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 19 2019, 7:04 PM
bd808 added a comment.Sep 19 2019, 7:06 PM

Dropping this in the "Discussion Needed" column on the cloud-services-team workboard to get some initial input on the idea. Comments from others outside the team very much welcome as well, especially comments describing how you actually user access.log data if you are opposed to this basic idea.

Phamhi added a subscriber: Phamhi.Sep 20 2019, 2:28 PM

I also wanted to add that there's no way to truncate a live access.log (https://phabricator.wikimedia.org/T152235#3521216). I can confirm this behaviour. The only workaround is shutting down the webservice, truncate the access.log then start the webservice again

I wasn't able to override the accesslog.filename configuration directive from $HOME/.lighttpd.conf. For example, if I do this `accesslog.filename = "{home}/xxxaccess.log", it prevents lighttpd from starting (I still have to investigate why)

Phamhi added a comment.EditedSep 20 2019, 2:48 PM

I did some napkin math... and all of the access.log files take up approximately 342GB of disk space right now. And I agree with bd808 where most of the users most likely don't really care about this data.

I know why I couldn't override the "accesslog.filename" in the user's .lighttpd.conf file.. this is because our current version lighttpd (1.4.45) doesn't support this (https://redmine.lighttpd.net/boards/2/topics/6279)...

# /usr/sbin/lighttpd -v
lighttpd/1.4.45 (ssl) - a light and fast webserver
Build-Date: Jan 14 2017 21:07:19

We have to upgrade to https://www.lighttpd.net/2017/10/21/1.4.46/ for the ability to override prior config values (https://redmine.lighttpd.net/issues/2799) ..sigh

I can confirm that if we comment out the accesslog.filename = "..." line, the lighttpd webservice won't produce any access.log.
If the user wants access.log, it's a simple as adding the line accesslog.filename=/data/project/<homedir>/access.log to their local ~/.lighttpd.conf file.

So that part works as hoped. :)

Andrew added a subscriber: Andrew.Tue, Sep 24, 4:16 PM

Approved during WMCS meeting

Andrew assigned this task to Phamhi.Wed, Sep 25, 3:30 PM
Andrew moved this task from Needs discussion to Doing on the cloud-services-team (Kanban) board.
bd808 added a comment.Wed, Oct 2, 8:58 PM

Some polling should be done to test this theory before making any sweeping change.

This line in the original task description made us think about doing a community consultation of some sort. Having thought about the nature of the change some more, @Phamhi and I are now thinking that life will be easier for everyone if we instead just treat this like a breaking change for the Toolforge community (and hopefully a breaking change that really does not break much). The rollout plan that we discussed for this looks like:

  • Code the new behavior (remove the accesslog.filename = "..." line from the generated lighttpd config)
  • Document the new behavior on wikitech as a News/... article including instructions on how to add local config to generate the access.log if needed by a Tool
  • Add a message that will be shown on start and restart commands linking to the News/... article
  • Merge and deploy the updated webservice code including Docker image rebuilds
  • Announce the new behavior to cloud-announce including link to News/... article
  • Wait one week for folks to restart things normally and pick up the new behavior in their tools
  • Force restart lighttpd powered webservices on the grid and Kubernetes that have been running since before the new webservice was deployed
  • Remove notice of behavior change from webservice in gerrit so that the next build + deploy will remove the notice

One TODO from this list is to determine which runtimes are using lighttpd and make sure we know how to force restarts of them.

Change 541609 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/software/tools-webservice@master] tools-webservice: Disable access.log feature by default

https://gerrit.wikimedia.org/r/541609

Phamhi updated the task description. (Show Details)Tue, Oct 8, 8:06 PM
Phamhi updated the task description. (Show Details)Wed, Oct 9, 2:34 PM
Phamhi updated the task description. (Show Details)
Phamhi updated the task description. (Show Details)Tue, Oct 15, 10:23 AM

Change 541609 merged by jenkins-bot:
[operations/software/tools-webservice@master] tools-webservice: Disable access.log feature by default

https://gerrit.wikimedia.org/r/541609

Phamhi updated the task description. (Show Details)Tue, Oct 15, 2:53 PM

Wikitech documentation https://w.wiki/9go has been updated

Phamhi updated the task description. (Show Details)Thu, Oct 17, 4:04 PM