Tracking and Reducing cron-spam from root@
Open, NormalPublic

Description

Cronspam is annoying, but it carries (sometimes) very good information. This phab task should track the efforts to reduce the current level of cronspam from root@ and to keep it under control from now on.

The final goal must be to improve our docs/best-practices/etc.. alongside with the patches that we'll send, otherwise there is no point of proceeding with work.

Some questions:

  1. Do we really value cron-alerts in Wikimedia?
  2. What is an acceptable level of cron-spam?
  3. How do we improve our monitoring/alarming to avoid relying on people reading cron spam emails?

Related Objects

StatusAssignedTask
OpenNone
Resolvedelukey
Resolvedelukey
Resolvedfaidon
OpenNone
Resolvedfaidon
OpenNone
ResolvedAndrew
Resolvedfgiunchedi
OpenNone
OpenNone
Resolvedjcrespo
ResolvedNone
Resolvedelukey
ResolvedNone
ResolvedDzahn
Resolvedema
ResolvedMoritzMuehlenhoff
ResolvedCatrope
ResolvedNone
Resolvedelukey
OpenNone
ResolvedNone
OpenNone
ResolvedDzahn
Resolvedfaidon
DuplicateNone
Resolvedfgiunchedi
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedNone
ResolvedNone
OpenNone
OpenNone
Resolvedfgiunchedi
OpenNone
Resolvedelukey
There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 291752 merged by Ema:
update-ocsp-all: write output to logfile

https://gerrit.wikimedia.org/r/291752

Change 292936 had a related patch set uploaded (by Ema):
Don't install apt-show-versions

https://gerrit.wikimedia.org/r/292936

Change 292936 merged by Ema:
Don't install apt-show-versions

https://gerrit.wikimedia.org/r/292936

apt-xapian-index was throwing errors on random hosts. Since it's entirely unused (it was only installed on approx 350 systems which were installed at a time when apt was still configured to install "Recommended:" packages) I removed it across the fleet. It has no purpose for us (in Debian the index is used for debtags, which we don't care about)

Change 296707 had a related patch set uploaded (by Jcrespo):
Avoid log spam as 99% of these would be non-errors

https://gerrit.wikimedia.org/r/296707

Change 296707 merged by Jcrespo:
Avoid log spam as 99% of these would be non-errors

https://gerrit.wikimedia.org/r/296707

Change 298289 had a related patch set uploaded (by Filippo Giunchedi):
mediawiki: add delaycompress for jobrunner logs

https://gerrit.wikimedia.org/r/298289

Change 298289 merged by Filippo Giunchedi:
mediawiki: add delaycompress for jobrunner logs

https://gerrit.wikimedia.org/r/298289

Change 298779 had a related patch set uploaded (by Elukey):
Remove cronspam coming from Gerrit log deletion

https://gerrit.wikimedia.org/r/298779

Change 298785 had a related patch set uploaded (by Elukey):
Reduce cronspam from terbium related to the echo_mail_batch cron script

https://gerrit.wikimedia.org/r/298785

Change 298779 abandoned by Elukey:
Remove cronspam coming from Gerrit log deletion

Reason:
Spam seems to have stopped.

https://gerrit.wikimedia.org/r/298779

Change 298785 abandoned by Elukey:
Reduce cronspam from terbium related to the echo_mail_batch cron script

https://gerrit.wikimedia.org/r/298785

Change 307918 had a related patch set uploaded (by Elukey):
Remove g w flag to /srv/log/eventlogging to allow proper logrotation

https://gerrit.wikimedia.org/r/307918

Change 307918 merged by Elukey:
Remove g w flag to /srv/log/eventlogging to allow proper logrotation

https://gerrit.wikimedia.org/r/307918

Change 308526 had a related patch set uploaded (by Elukey):
Attempt to stop cronspam from graphite-web

https://gerrit.wikimedia.org/r/308526

Change 308526 merged by Elukey:
Attempt to stop cronspam from graphite-web

https://gerrit.wikimedia.org/r/308526

Change 308715 had a related patch set uploaded (by Elukey):
Reduce cronspam to root@ from zookeeper zkCleanup script

https://gerrit.wikimedia.org/r/308715

Change 308715 merged by Elukey:
Reduce cronspam to root@ from zookeeper zkCleanup script

https://gerrit.wikimedia.org/r/308715

Change 308940 had a related patch set uploaded (by Elukey):
Add the logrotate delaycompress setting everywhere

https://gerrit.wikimedia.org/r/308940

Mentioned in SAL [2016-09-07T07:58:19Z] <elukey> executed apt-get purge tmpreaper on gallium (T132324)

Change 308940 merged by Elukey:
Add the logrotate delaycompress setting everywhere

https://gerrit.wikimedia.org/r/308940

Mentioned in SAL (#wikimedia-operations) [2016-09-21T07:19:29Z] <elukey> Moved some hhvm logs (/var/log/hhvm) from root:adm to www-data:www-data on mw127[678] to remove cronspam (T132324)

Mentioned in SAL (#wikimedia-operations) [2016-09-23T07:33:59Z] <elukey> executed 'find /var/log/hhvm/ -type f -user root -exec chown www-data:www-data {} \;' for all the api and appservers to remove/prevent cronspam (root:adm files also related to new reimaged hosts, Rsyslog needs to be configured before hhvm) - T132324

Change 313563 had a related patch set uploaded (by Andrew Bogott):
l10nupdate: Add 'su' to logrotate script

https://gerrit.wikimedia.org/r/313563

Change 313563 merged by Andrew Bogott:
l10nupdate: Add 'su' to logrotate script

https://gerrit.wikimedia.org/r/313563

Change 314026 had a related patch set uploaded (by Elukey):
Set l10nupdate group in its related logrotate config file

https://gerrit.wikimedia.org/r/314026

Change 314026 merged by Elukey:
Set l10nupdate group in its related logrotate config file

https://gerrit.wikimedia.org/r/314026

Change 323528 had a related patch set uploaded (by Elukey):
Remove cron notifications to root@ for jobchron/runner service status

https://gerrit.wikimedia.org/r/323528

Change 323528 merged by Faidon Liambotis:
Remove cron notifications to root@ for jobchron/runner service status

https://gerrit.wikimedia.org/r/323528

Mentioned in SAL (#wikimedia-operations) [2016-12-20T08:27:54Z] <elukey> renamed some log files ($something.1.gz to $something.1a.gz) on cp1008 and rutherium to unblock logrotation and reduce cronspam - T132324

Mentioned in SAL (#wikimedia-operations) [2016-12-22T07:26:54Z] <elukey> created /var/log/squid3/access.log.1.gz on aluminum to fix cronspam - T132324

elukey moved this task from Backlog to In Progress on the User-Elukey board.Dec 22 2016, 8:55 AM
elukey moved this task from In Progress to Ops Backlog on the User-Elukey board.Dec 23 2016, 3:12 PM
elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.

Mentioned in SAL (#wikimedia-operations) [2017-01-03T07:58:28Z] <elukey> chown www-data:www-data all the root:adm hhvm log files on mw codfw hosts (T132324)

Mentioned in SAL (#wikimedia-operations) [2017-01-05T07:54:14Z] <elukey> chown www-data:www-data all the root:adm hhvm log files on mw eqiad hosts (T132324)

Change 336218 had a related patch set uploaded (by Elukey):
Silence apt rsync repo activities

https://gerrit.wikimedia.org/r/336218

Change 336218 merged by Dzahn:
Silence apt rsync repo activities

https://gerrit.wikimedia.org/r/336218

Change 343276 had a related patch set uploaded (by Elukey):
[operations/puppet] Add delay compress to upstart's logrotate

https://gerrit.wikimedia.org/r/343276

Change 343276 merged by Elukey:
[operations/puppet] Add delay compress to upstart's logrotate

https://gerrit.wikimedia.org/r/343276