Page MenuHomePhabricator

Delete Django logs after two weeks
Closed, ResolvedPublic

Description

In T256465 we identified that the Library Card platform has a range of logs containing various data, and we deal with those in a highly inconsistent way.

For the Django logs, they are truncated/dropped only when new deployments happen. While this currently happens on a regular basis, it's variable and inconsistent. Since we don't typically need to look back further than about 1 week, we could set up a clear retention schedule which removes logs older than 2 weeks.

Event Timeline

Samwalton9-WMF raised the priority of this task from Medium to High.Jan 29 2021, 11:49 AM

This turned into more of a head-scratcher than I anticipated. Docker's builtin log rotation can only rotate based on size, not date:
https://docs.docker.com/config/containers/logging/configure/
Meaning we needed to handle it externally on the host.
On twl staging, I added the following file:
/etc/logrotate.d/twl_docker_container_logs
with these contents:

/var/lib/docker/containers/*/*-json.log {
    rotate 2
    daily
    missingok
    copytruncate
}

This should rotate logs daily, and delete logfiles after two rotations. I set it to daily instead of weekly, just to allow for quicker observation of the rotation behavior. I'm pretty sure once the logs get rotated out they will disappear from any docker log commands, so those files would have to be searched manually. If that's the case, we might do better to just retain 1 week instead of 2. I recommend we let this sit in review for a while while we suss out how all the pieces behave together.

I accidentally wiped the log here, so we'll need to wait an extra day.

I just verified that the log rotation is working out how I thought: logs are getting rotated, but when they do, they are no longer accessible via docker.

$ date
Thu 07 Oct 2021 07:31:10 PM UTC
$ docker service logs -t staging_twlight
2021-10-07T05:42:23.015568454Z staging_twlight.1.b8up62yt99e0@staging    | Not Found: /partners/22/
2021-10-07T07:04:12.542289195Z staging_twlight.1.b8up62yt99e0@staging    | Sending requests with default User-Agent.  Set 'user_agent' on mwoauth.flask.MWOAuth to quiet this message.
2021-10-07T07:04:13.620970307Z staging_twlight.1.b8up62yt99e0@staging    | Sending requests with default User-Agent.  Set 'user_agent' on mwoauth.flask.MWOAuth to quiet this message.
2021-10-07T07:04:17.321498981Z staging_twlight.1.b8up62yt99e0@staging    | Not Found: /users/my_library/..
$ ls /var/lib/docker/containers/<id>/
<id>-json.log    checkpoints     hostconfig.json  hosts   resolv.conf
<id>-json.log.1  config.v2.json  hostname	      mounts  resolv.conf.hash
$ cat /var/lib/docker/containers/<id>/<id>-json.log
{"log":"Not Found: /partners/22/\n","stream":"stderr","time":"2021-10-07T05:42:23.015568454Z"}
{"log":"Sending requests with default User-Agent.  Set 'user_agent' on mwoauth.flask.MWOAuth to quiet this message.\n","stream":"stderr","time":"2021-10-07T07:04:12.542289195Z"}
{"log":"Sending requests with default User-Agent.  Set 'user_agent' on mwoauth.flask.MWOAuth to quiet this message.\n","stream":"stderr","time":"2021-10-07T07:04:13.620970307Z"}
{"log":"Not Found: /users/my_library/...\n","stream":"stderr","time":"2021-10-07T07:04:17.321498981Z"}
$ cat /var/lib/docker/containers/<id>/<id>-json.log.1
{"log":"[2021-10-05 13:16:56 +0000] [1] [INFO] Starting gunicorn 20.1.0\n","stream":"stderr","time":"2021-10-05T13:16:56.616907987Z"}
{"log":"[2021-10-05 13:16:56 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)\n","stream":"stderr","time":"2021-10-05T13:16:56.617659792Z"}
{"log":"[2021-10-05 13:16:56 +0000] [1] [INFO] Using worker: gthread\n","stream":"stderr","time":"2021-10-05T13:16:56.617913218Z"}
{"log":"[2021-10-05 13:16:56 +0000] [29] [INFO] Booting worker with pid: 29\n","stream":"stderr","time":"2021-10-05T13:16:56.62417655Z"}
{"log":"[2021-10-05 13:16:56 +0000] [30] [INFO] Booting worker with pid: 30\n","stream":"stderr","time":"2021-10-05T13:16:56.636739373Z"}
{"log":"[2021-10-05 13:16:56 +0000] [31] [INFO] Booting worker with pid: 31\n","stream":"stderr","time":"2021-10-05T13:16:56.700445614Z"}
{"log":"[2021-10-05 13:16:56 +0000] [32] [INFO] Booting worker with pid: 32\n","stream":"stderr","time":"2021-10-05T13:16:56.789011343Z"}
{"log":"Not Found: /users/my_library/...\n","stream":"stderr","time":"2021-10-05T13:34:48.343415943Z"}
{"log":"Sending requests with default User-Agent.  Set 'user_agent' on mwoauth.flask.MWOAuth to quiet this message.\n","stream":"stderr","time":"2021-10-05T13:34:48.746940466Z"}
...
<truncated for clarity>
...
{"log":"Not Found: /users/my_library/...\n","stream":"stderr","time":"2021-10-06T11:00:56.684903065Z"}
{"log":"Not Found: /users/my_library/...\n","stream":"stderr","time":"2021-10-06T12:26:07.385185365Z"}

I've updated the rotation file to only keep one day:

/var/lib/docker/containers/*/*-json.log {
    rotate 1
    daily
    missingok
    copytruncate
}

I'll check back tomorrow to verify that we then just get a truncation with no extra file.

Here are my concerns about using this method:
This is not rolling retention; when the rotation happens, everything from the moment before the rotation is gone. If something breaks just before rotation, then we don't have any logs to look at.
My concern about keeping an extra week of logs that aren't visible in docker is that it will make troubleshooting and understanding our log retention confusing.

this change still left us with the extra file, so now we have:

/var/lib/docker/containers/*/*-json.log {
    rotate 0
    daily
    missingok
}

which I think should just wipe the log once per day.

It looks like we need the truncate operation to avoid breaking the logfile that docker is trying to write to.
Changing to:

/var/lib/docker/containers/*/*-json.log {
    rotate 0
    daily
    missingok
    copytruncate
}

which may still leave us with an extra file, but shouldn't break logging.

@Samwalton9
That did leave us with a dangling log file that wasn't available to docker. I've now landed on:

/var/lib/docker/containers/*/*-json.log {
    rotate 0
    daily
    missingok
    copytruncate
    postrotate
        rm "${1}.1"
    endscript
}

This cleans up that extra file so we don't have more log data on disk that what we would expect based on docker logs.
If we're okay with the caveat that we won't have log data immediately after rotation, then I think this is good to go (after setting it to weekly)

So if I understand correctly: The log will clear once per week, meaning we won't always have 7 days of logs, but will have 0-7 based on how recently the clear was?

If that's the case and you think that's OK then it's fine by me.

Correct: there's no nicely packaged tool for us to do a rolling deletion of logs based on timestamp unless we want to setup something more complicated like setup a syslog server or do a custom cron shell script to edit the live log files in place with sed or something.

With this solution, I think there will be times when we wished we had more log data, but it will reliably keep drop old logs without breaking docker or requiring any more engineering time.

Okay, I've deployed this to twl production and staging, and I updated the wiki page for server setup since we want to make sure to get this on new servers too:
https://github.com/WikipediaLibrary/TWLight/wiki/Debian-Server-setup