Page MenuHomePhabricator

cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet
Closed, ResolvedPublic

Description

Hi!

root@ cronspam registered for (updated to Sept 16th 2016):

  1. labnet1002.eqiad.wmnet
Cron Daemon root@labnet1002.eqiad.wmnet via wikimedia.org 
to root 
/etc/cron.daily/logrotate:
error: nova-manage:1 duplicate log entry for /var/log/nova/nova-manage.log

This seems to be the issue:

elukey@labnet1002:~$ dpkg -L nova-common | grep logrotate
/etc/logrotate.d
/etc/logrotate.d/nova-common
/etc/logrotate.d/nova-manage

elukey@labnet1002:~$ cat /etc/logrotate.d/nova-common
/var/log/nova/*.log {
    daily
    missingok
    compress
    delaycompress
    copytruncate
}
elukey@labnet1002:~$ cat /etc/logrotate.d/nova-manage
/var/log/nova/nova-manage.log {
    daily
    missingok
    compress
    delaycompress
    notifempty
}
  1. labtestservices2001.wikimedia.org
/etc/dns-floating-ip-updater.py
..
..
keystoneclient.exceptions.Unauthorized: The request you have made requires authentication. (HTTP 401) (Request-ID: req-5722c432-fd24-449b-9bf6-df11f20994ff)
  1. labtestcontrol2001.wikimedia.org
to root 
/etc/cron.daily/logrotate:
 /usr/bin/mysqladmin: refresh failed; error: 'Unknown error'
error: error running shared postrotate script for '/var/log/mysql.log /var/log/mysql/mysql.log /var/log/mysql/mysql-slow.log /var/log/mysql/error.log '
run-parts: /etc/cron.daily/logrotate exited with return code 1
  1. labservices1001.wikimedia.org and labnet1001.eqiad.wmnet
/etc/cron.daily/logrotate:
gzip: stdin: file size changed while zipping

Any attempt to remove them would be really awesome!

Thanks!

Luca

Event Timeline

elukey created this task.Apr 12 2016, 6:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2016, 6:50 AM
elukey renamed this task from cronspam from labscontrol1001 and labstore1001 to cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet .Apr 12 2016, 6:57 AM
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)Apr 21 2016, 7:33 AM
chasemp triaged this task as Medium priority.May 31 2016, 3:26 PM
chasemp added a subscriber: chasemp.

@elukey is this still happening?

faidon raised the priority of this task from Medium to High.Aug 14 2016, 5:02 PM
faidon added subscribers: Andrew, faidon.

I've pinged @Andrew a few times last week on IRC about Labs cron spam. Some of it was fixed but Labs still accounts for the vast majority of our cron spam:

Also note that one of these leaks a database password as well (I've hidden it from the screenshot), so I'm going to raise this to High.

I've pinged @Andrew a few times last week on IRC about Labs cron spam. Some of it was fixed but Labs still accounts for the vast majority of our cron spam:


Also note that one of these leaks a database password as well (I've hidden it from the screenshot), so I'm going to raise this to High.

I've tested the dns-floating-ip-updater script and it returns nothing on stderr to me currently. Errors we have seen coming from it before were symptoms of problems elsewhere within labtest.
The mysql failure is interesting though. It says:

ERROR 1040 (08004): Too many connections

That most likely explains why we're having issues with labtesthorizon login and why restarting mysql (or as I've now found, after looking at the processlist, following https://www.percona.com/blog/2009/05/21/mass-killing-of-mysql-connections/) fixes the issue. I don't yet know a) why this doesn't happen in real-labs, and b) why all these connections are remaining open.

Edit: I started looking at the processes using the ports listed in mysql's processlist... there's a lot of keystone-all processes on labtestcontrol and nova-api processes on labtestnet. But maybe this is normal and labcontrol1001 just has a higher mysql max client connections limit, I have no way of knowing.

Mentioned in SAL [2016-08-15T20:17:11Z] <Krenair> labtestcontrol2001: Raise max_connections mysql global to 500 to match real-labs' on m5-master (db1009). old value was 151, see my comment at T132422

I've bumped labtestcontrol2001 mysql's max_connections from 151 to 500, to match real-labs' value of 500 on m5-master (db1009). You should no longer see that mysql failure.

If anything else in labtest spams root@, let me know and I'll take a look.

labtestservices2001 still warns daily about:
error: designate-common:1 duplicate log entry for /var/log/designate/designate-api.log
I think @Andrew recently fixed this in some (but not all) of the other hosts.

Other lab* hosts (non-labtest) are continuing to spam us too.

I fixed several of these just now; will keep an eye on my cronspam to see what I missed.

Thanks a lot, appreciated :) I'll keep an eye out too.

Still getting regular emails about gzip: stdin: file size changed while zipping, it shouldn't be anything related to the logrotate files in operations/puppet since I added delaycompress to all of them.

@Andrew any ideas?

Maybe these are coming from unpuppetised base services installed by Debian/Ubuntu?

Seeing again dns-floating-ip-updater.py exceptions:

Cron Daemon <root@labtestservices2001.wikimedia.org>
10:22 PM (10 hours ago)

to root 
No handlers could be found for logger "keystoneclient.auth.identity.base"
Traceback (most recent call last):
  File "/etc/dns-floating-ip-updater.py", line 81, in <module>
    for zone in designate_client.zones.list():
  File "/usr/lib/python2.7/dist-packages/designateclient/v2/zones.py", line 47, in list
    return self._get(url, response_key="zones")
  File "/usr/lib/python2.7/dist-packages/designateclient/client.py", line 59, in _get
    resp, body = self.client.session.get(url)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/adapter.py", line 170, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python2.7/dist-packages/designateclient/v2/client.py", line 51, in request
    response, body = super(self.__class__, self).request(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/adapter.py", line 206, in request
    resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/adapter.py", line 95, in request
    return self.session.request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/utils.py", line 337, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/session.py", line 386, in request
    resp = send(**kwargs)
  File "/usr/lib/python2.7/dist-packages/keystoneclient/session.py", line 430, in _send_request
    raise exceptions.ConnectionRefused(msg)
keystoneclient.exceptions.ConnectionRefused: Unable to establish connection to http://labtestservices2001.wikimedia.org:9001/zones?

Maybe these are coming from unpuppetised base services installed by Debian/Ubuntu?

Will investigate the ones without delaycompress rotating them one by one :)

AlexMonk-WMF added a comment.EditedSep 14 2016, 3:02 PM

Seeing again dns-floating-ip-updater.py exceptions:

keystoneclient.exceptions.ConnectionRefused: Unable to establish connection to http://labtestservices2001.wikimedia.org:9001/zones?

This probably got fixed when I restarted designate-api yesterday (I think it got stuck connecting to rabbitmq, which had been broken earlier due to a disk space issue). We should probably have proper monitoring for this type of thing (see T133656, T90784, T91226 etc. but for designate) instead of relying on dns-floating-ip-updater.py breaking :)

@AlexMonk-WMF and now we have this one :D

keystoneclient.exceptions.Unauthorized: The request you have made requires authentication. (HTTP 401) (Request-ID: req-9ab0ec01-49bb-4d4c-99e6-decebeef299f)
elukey updated the task description. (Show Details)
AlexMonk-WMF added a comment.EditedSep 16 2016, 10:37 AM

@AlexMonk-WMF and now we have this one :D

keystoneclient.exceptions.Unauthorized: The request you have made requires authentication. (HTTP 401) (Request-ID: req-9ab0ec01-49bb-4d4c-99e6-decebeef299f)

Apparently another project has appeared in labtest that doesn't have the correct rights for the novaadmin user. This one is called 'secretproject'

Edit: Ran openstack role add --project secretproject --user novaadmin projectadmin and openstack role add --project secretproject --user novaadmin user, things should be better now

About the gzip: stdin: file size changed while zipping email: it should be related to the upstart logrotate config. I logged onto labvirt1014.eqiad.wmnet and found all the files under /etc/logrotate.d that have compress but not delaycompress; I then filtered the ones that are daily, and only one came up:

/var/log/upstart/*.log {
        daily
        missingok
        rotate 7
        compress
        notifempty
       	nocreate
}

A quick research seems to confirm: https://lists.ubuntu.com/archives/foundations-bugs/2015-September/253771.html

Summary after today's hacking with @Andrew:

  1. logrotate errors while zipping should be resolved via https://gerrit.wikimedia.org/r/#/c/313558/
  1. mysql refresh errors on labtestcontrol were due to apparmor not allowing /var/log/mysql/** so we added it manually since it wasn't puppetized.
  1. the nova-manage logrotate file was removed manually and not re-installed when we tried to install nova-commons again. This is probably from an old installation.

So after these patches are merged we should see a big reduction in cronspam from labs \o/

elukey closed this task as Resolved.Oct 5 2016, 12:40 PM
elukey claimed this task.

Mentioned in SAL (#wikimedia-operations) [2017-06-09T07:15:03Z] <elukey> deleted /etc/logrotate.d/nova-manage from labtestvirt2003 to reduce cronspam (same solution used in T132422#2679434)