Page MenuHomePhabricator

Remove access.log generation from default lighttpd.conf generated by `webservice`
Closed, ResolvedPublic

Description

One of the recurring issues in Toolforge is NFS disk utilization and log files (T183920, T206239, T233120). For various reasons, we have never come up with a solid recommendation for Tool maintainers about how to manage the log files that their processes generate (T68623, T152235, T127367), but the TL;DR version is that several of the common log generation points including lighttpd and grid engine jobs do not handle file rotation robustly.

With rotation being challenging, the next best option is looking for ways to reduce logging in general. One of the potential low hanging fruits for log reduction is disabling access.log generation for lighttpd (the web server software used by many webservice backends). Today we generate a lighttpd config that includes:

/var/run/lighttpd/<tool>
server.modules = (
  ...
  "mod_accesslog",
  ...
)

accesslog.use-syslog = "disable"
accesslog.filename = "/data/project/<tool>/access.log"

Based on my reading of the upstream docs for mod_accesslog, if we removed the accesslog.filename = "..." line from the default config no access.log output would be generated. It would then be possible for a given tool to use the $HOME/.lighttpd.conf file to add the accesslog.filename = "..." stanza back into their composite configuration if desired.

My operating theory here is that the majority of tool maintainers never use/process their access.log data. Some polling should be done to test this theory before making any sweeping change.

Update:

The decision going forward is to not to enable access.log by default. If needed, the user can enable this by override it in their ~/.lighttpd.conf file.

TODO:

  • Code the new behavior (remove the accesslog.filename = "..." line from the generated lighttpd config)
  • Document the new behavior on wikitech as a News/... article including instructions on how to add local config to generate the access.log if needed by a Tool
  • Add a message that will be shown on start and restart commands linking to the News/... article
  • Merge and deploy the updated webservice code including Docker image rebuilds
  • Announce the new behavior to cloud-announce including link to News/... article
  • Wait one week for folks to restart things normally and pick up the new behavior in their tools
  • Force restart lighttpd powered webservices on the grid and Kubernetes that have been running since before the new webservice was deployed
  • Remove notice of behavior change from webservice in gerrit so that the next build + deploy will remove the notice

Event Timeline

Dropping this in the "Discussion Needed" column on the cloud-services-team workboard to get some initial input on the idea. Comments from others outside the team very much welcome as well, especially comments describing how you actually user access.log data if you are opposed to this basic idea.

I also wanted to add that there's no way to truncate a live access.log (https://phabricator.wikimedia.org/T152235#3521216). I can confirm this behaviour. The only workaround is shutting down the webservice, truncate the access.log then start the webservice again

I wasn't able to override the accesslog.filename configuration directive from $HOME/.lighttpd.conf. For example, if I do this `accesslog.filename = "{home}/xxxaccess.log", it prevents lighttpd from starting (I still have to investigate why)

I did some napkin math... and all of the access.log files take up approximately 342GB of disk space right now. And I agree with bd808 where most of the users most likely don't really care about this data.

I know why I couldn't override the "accesslog.filename" in the user's .lighttpd.conf file.. this is because our current version lighttpd (1.4.45) doesn't support this (https://redmine.lighttpd.net/boards/2/topics/6279)...

# /usr/sbin/lighttpd -v
lighttpd/1.4.45 (ssl) - a light and fast webserver
Build-Date: Jan 14 2017 21:07:19

We have to upgrade to https://www.lighttpd.net/2017/10/21/1.4.46/ for the ability to override prior config values (https://redmine.lighttpd.net/issues/2799) ..sigh

I can confirm that if we comment out the accesslog.filename = "..." line, the lighttpd webservice won't produce any access.log.

If the user wants access.log, it's a simple as adding the line accesslog.filename=/data/project/<homedir>/access.log to their local ~/.lighttpd.conf file.

So that part works as hoped. :)

Approved during WMCS meeting

Some polling should be done to test this theory before making any sweeping change.

This line in the original task description made us think about doing a community consultation of some sort. Having thought about the nature of the change some more, @Phamhi and I are now thinking that life will be easier for everyone if we instead just treat this like a breaking change for the Toolforge community (and hopefully a breaking change that really does not break much). The rollout plan that we discussed for this looks like:

  • Code the new behavior (remove the accesslog.filename = "..." line from the generated lighttpd config)
  • Document the new behavior on wikitech as a News/... article including instructions on how to add local config to generate the access.log if needed by a Tool
  • Add a message that will be shown on start and restart commands linking to the News/... article
  • Merge and deploy the updated webservice code including Docker image rebuilds
  • Announce the new behavior to cloud-announce including link to News/... article
  • Wait one week for folks to restart things normally and pick up the new behavior in their tools
  • Force restart lighttpd powered webservices on the grid and Kubernetes that have been running since before the new webservice was deployed
  • Remove notice of behavior change from webservice in gerrit so that the next build + deploy will remove the notice

One TODO from this list is to determine which runtimes are using lighttpd and make sure we know how to force restarts of them.

Change 541609 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/software/tools-webservice@master] tools-webservice: Disable access.log feature by default

https://gerrit.wikimedia.org/r/541609

Change 541609 merged by jenkins-bot:
[operations/software/tools-webservice@master] tools-webservice: Disable access.log feature by default

https://gerrit.wikimedia.org/r/541609

Wikitech documentation https://w.wiki/9go has been updated

Hi @bd808 .. I think I got a grip on creating new debian package, uploading it to the repo and updating/push new docker images.. can we proceed with this step Merge and deploy the updated webservice code including Docker image rebuilds ?

Hi @bd808 .. I think I got a grip on creating new debian package, uploading it to the repo and updating/push new docker images.. can we proceed with this step Merge and deploy the updated webservice code including Docker image rebuilds ?

Sounds good to me. I am really interested to see if we can actually notice reduce load on the NFS server once this is live everywhere. :)

Mentioned in SAL (#wikimedia-cloud) [2019-10-23T12:09:24Z] <phamhi> Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools (T233347)

The new version of toollabs-webservice package 0.47 has been pushed out to:

tools-checker-03
tools-sgebastion-[07-09]
tools-sgecron-01
tools-sgewebgrid-generic-[0901-0904]
tools-sgewebgrid-lighttpd-[0902-0928]

Mentioned in SAL (#wikimedia-cloud) [2019-10-23T20:00:43Z] <phamhi> Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 (T233347)

Will you do delete of these files?

In T233347#5600635, @Zoranzoki21 wrote:

Will you do delete of these files?

We could consider that after folks have had some time to decide if they need to keep generating $HOME/access.log files. It would be relatively simple to use find on the NFS primary itself to delete all of the access.log files that have an mtime that is stale by several days/weeks.

I am looking for a sane way to restart all of the lighttpd powered webservices to force the change to take effect...

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T16:44:56Z] <phamhi> restarted lighttpd based webservice pods on tools-worker-100[1-9] (T233347)

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T17:06:43Z] <phamhi> restarted lighttpd based webservice pods on tools-worker-101[0-9] (T233347)

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T17:34:24Z] <phamhi> restarted lighttpd based webservice pods on tools-worker-102[0-9] (T233347)

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T17:38:15Z] <phamhi> restarted lighttpd based webservice pods on tools-worker-103x and 1040 (T233347)

All of the lighttpd based k8s pods have been restarted

sudo cumin --force "O{project:tools name:^tools-worker-10..}" 'for i in $(docker ps|grep webservice|cut -d " " -f1); do docker exec $i grep "access.log" /usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py ;done'

Mentioned in SAL (#wikimedia-cloud) [2019-11-06T11:57:03Z] <phamhi> restarted all webservices in grid (T233347)

Change 549075 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/software/tools-webservice@master] tools-webservice: Remove user warning on start/restart on access.log

https://gerrit.wikimedia.org/r/549075

Change 549075 merged by Phamhi:
[operations/software/tools-webservice@master] tools-webservice: Remove user warning on start/restart on access.log

https://gerrit.wikimedia.org/r/549075

Glad that the old files are still around and the change has been documented so that browser version data can be gathered again. Hope moving the logs to my local machine once every few years will help freeing up some disk space while complying with the privacy Terms