The job count in the webgrid-lighttpd queue among trusty nodes has been climbing(https://graphite-labs.wikimedia.org/render/?width=879&height=548&_salt=1474394266.622&target=tools.tools-services-01.sge.webgrid-lighttpd.job_count&from=-30d) - we need to add few instances to handle the load.
Description
Related Objects
Event Timeline
Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:14Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1415 (T146212)
Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:23Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1416 (T146212)
Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:42Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1418 (T146212)
Mentioned in SAL (#wikimedia-labs) [2016-09-20T21:17:06Z] <madhuvishy|food> Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212)
Mentioned in SAL (#wikimedia-labs) [2016-09-20T21:23:48Z] <madhuvishy|food> Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212)
I depooled these - they were running into issues wrt https://phabricator.wikimedia.org/T115194, causing problems when people started webservices:
File "/usr/bin/webservice-runner", line 26, in <module> - proxy.register(port) - File "/usr/lib/python2.7/dist-packages/toollabs/webservice/proxy.py", line 31, in register - current_ip = socket.gethostbyname(socket.getfqdn())Options
thanks @yuvipanda
@madhuvishy you'll probably have to sync w/ @Krenair or @Andrew on some DNS leak cleanup here :)
We looked into it last night, but weren't able to find the cause. We do know that the last instance to leave a reverse DNS entry behind was deleted around 2016-09-08 22:46 (@madhuvishy ran select max(nova.instances.deleted_at) from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null; for me, which eventually resulted in 2016-09-08 22:46:45 after the query ran for 10-15 minutes) - you'd expect the DNS entry to be deleted within a minute of that
Unfortunately @madhuvishy found nothing useful from the logs around that time, and I'm not sure whether it's now a historical thing that needs a one-off cleanup, or an ongoing issue.
@AlexMonk-WMF With the offsite coming up and needing to add exec nodes to handle load, do you think we can do a clean up now so we can pool these, and then look into the underlying causes?
Cleanup script has been run for existing cases.
krenair@bastion-01:~$ host tools-webgrid-lighttpd-1418 tools-webgrid-lighttpd-1418.eqiad.wmflabs has address 10.68.20.200 krenair@bastion-01:~$ host tools-webgrid-lighttpd-1416 tools-webgrid-lighttpd-1416.eqiad.wmflabs has address 10.68.19.50
krenair@bastion-01:~$ host 10.68.20.200 200.20.68.10.in-addr.arpa domain name pointer tools-webgrid-lighttpd-1418.tools.eqiad.wmflabs. krenair@bastion-01:~$ host 10.68.19.50 50.19.68.10.in-addr.arpa domain name pointer tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs.
Mentioned in SAL (#wikimedia-labs) [2016-09-21T18:42:45Z] <madhuvishy> Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
Mentioned in SAL (#wikimedia-labs) [2016-09-21T18:56:41Z] <madhuvishy> Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup