Page MenuHomePhabricator

BDB transaction files on OpenLDAP servers
Closed, ResolvedPublic

Description

Icinga flagged a disk space warning for seaborgium's root partition. It currently has 1221 transaction log file chunks (10 megs each) in /var/lib/ldap/labs named like /var/lib/ldap/labs/log.0000001 etc. This is described in the OpenLDAP FAQ: http://www.openldap.org/faq/index.cgi?_highlightWords=bdb&file=738

I ran db_archive on pollux and seaborgium and it showed that all files expect the last one are not currently in use and could be removed, Still the documentation advises to keep them around:

The  db5.3_archive utility writes the pathnames of log files that are no longer in use (for example, no longer involved in active transactions), to the standard output, one pathname per line.
These log files should be written to backup media to provide for recovery in the case of catastrophic failure (which also requires a snapshot of the database files), but they may then be  deleted from the system to reclaim disk space.

We could add these to the backup and prune them from the servers (with a current disk requirement of 13 GB for nine months total life time of seaborgium), but that's something that is growing continuously) or we could discard older entries and only keep the last N transaction logs around, which would still allow to store previous snapshots I guess. In the event if a catastrophic failure of both serpens/seaborgium we would rather restore from backup anyway. Opinions?

Event Timeline

We already dump the DB to backups from both servers anyway. My take is "Let's just delete them". DB_LOG_AUTOREMOVE seems the best way since it's handled by openldap (actually BerkleyDB) anyway.

I think that setting it in slapd.conf is:

dbconfig set_flags DB_LOG_AUTOREMOVE

Of course setting it in slapd.conf only sets the default. We must also set it manually in DB_CONFIG in our already initialized installs.

I see we already have checkpoint 512 30 as well which is required for those log file chunks to no longer be relevant anyway.

Agreed on removing these. According to the slapd-bdb(5), with that option checkpointing occurs every 512k or 30 mins (whatever comes first), which a checkpoint being documented for that purpose explicity: (https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/checkpoint.html)

Performing checkpoints is necessary for two reasons. First, you can remove the Berkeley DB log files from your system only after a checkpoint. Second, the frequency of your checkpoints is inversely proportional to the amount of time it takes to run database recovery after a system or application failure.

What about the actual logs? Is it ok to reduce its verbosity or (probably better) rotate them a bit more often?

The actual OpenLDAP logs are harmless, slapd logs to syslog and on seaborgium that amounts to only 10 megabytes. We should rotate/compress the slapo-audit log file, though.

demon triaged this task as Medium priority.Aug 31 2016, 6:34 PM

Mentioned in SAL (#wikimedia-operations) [2016-10-03T10:11:09Z] <akosiaris> restarting slapd on pollux.wikimedia.org T143302

Mentioned in SAL (#wikimedia-operations) [2016-10-03T10:13:14Z] <akosiaris> restarting slapd on serpens.wikimedia.org T143302

Mentioned in SAL (#wikimedia-operations) [2016-10-03T10:16:49Z] <akosiaris> restarting slapd on seaborgium.wikimedia.org T143302

Mentioned in SAL (#wikimedia-operations) [2016-10-03T10:21:51Z] <akosiaris> restarting slapd on dubnium.wikimedia.org T143302

akosiaris claimed this task.

All 4 servers have been manually migrated and checked. Disk space usage dropped on all 4, but mostly on the labs ones. Everything seems ok.

There is the following line being logged quite often on seaborgium

Oct  3 10:19:06 seaborgium slapd[2660]: connection_read(1289): no connection!

but according to http://www.openldap.org/lists/openldap-bugs/201005/msg00023.html

it is pretty much unavoidable and irrelevant. It was happening before as well.

I 'll resolve this, feel free to reopen.