Page MenuHomePhabricator

Add 3 webgrid-lighttpd trusty nodes to tools project
Closed, ResolvedPublic

Description

The job count in the webgrid-lighttpd queue among trusty nodes has been climbing(https://graphite-labs.wikimedia.org/render/?width=879&height=548&_salt=1474394266.622&target=tools.tools-services-01.sge.webgrid-lighttpd.job_count&from=-30d) - we need to add few instances to handle the load.

Event Timeline

Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:14Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1415 (T146212)

Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:23Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1416 (T146212)

Mentioned in SAL (#wikimedia-labs) [2016-09-20T20:34:42Z] <madhuvishy> Created new instance tools-webgrid-lighttpd-1418 (T146212)

Mentioned in SAL (#wikimedia-labs) [2016-09-20T21:17:06Z] <madhuvishy|food> Pooled new sge exec node tools-webgrid-lighttpd-1415 (T146212)

Mentioned in SAL (#wikimedia-labs) [2016-09-20T21:23:48Z] <madhuvishy|food> Pooled new sge exec node tools-webgrid-lighttpd-1416 (T146212)

I depooled these - they were running into issues wrt https://phabricator.wikimedia.org/T115194, causing problems when people started webservices:

File "/usr/bin/webservice-runner", line 26, in <module> - proxy.register(port) - File "/usr/lib/python2.7/dist-packages/toollabs/webservice/proxy.py", line 31, in register - current_ip = socket.gethostbyname(socket.getfqdn())Options
chasemp added subscribers: Andrew, Krenair.

thanks @yuvipanda

@madhuvishy you'll probably have to sync w/ @Krenair or @Andrew on some DNS leak cleanup here :)

We looked into it last night, but weren't able to find the cause. We do know that the last instance to leave a reverse DNS entry behind was deleted around 2016-09-08 22:46 (@madhuvishy ran select max(nova.instances.deleted_at) from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null; for me, which eventually resulted in 2016-09-08 22:46:45 after the query ran for 10-15 minutes) - you'd expect the DNS entry to be deleted within a minute of that
Unfortunately @madhuvishy found nothing useful from the logs around that time, and I'm not sure whether it's now a historical thing that needs a one-off cleanup, or an ongoing issue.

@AlexMonk-WMF With the offsite coming up and needing to add exec nodes to handle load, do you think we can do a clean up now so we can pool these, and then look into the underlying causes?

Cleanup script has been run for existing cases.

krenair@bastion-01:~$ host tools-webgrid-lighttpd-1418 
tools-webgrid-lighttpd-1418.eqiad.wmflabs has address 10.68.20.200
krenair@bastion-01:~$ host tools-webgrid-lighttpd-1416
tools-webgrid-lighttpd-1416.eqiad.wmflabs has address 10.68.19.50
krenair@bastion-01:~$ host 10.68.20.200
200.20.68.10.in-addr.arpa domain name pointer tools-webgrid-lighttpd-1418.tools.eqiad.wmflabs.
krenair@bastion-01:~$ host 10.68.19.50
50.19.68.10.in-addr.arpa domain name pointer tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs.

Mentioned in SAL (#wikimedia-labs) [2016-09-21T18:42:45Z] <madhuvishy> Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup

Mentioned in SAL (#wikimedia-labs) [2016-09-21T18:56:41Z] <madhuvishy> Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup

This is all done - tools-webgrid-lighttpd-1415, 1416, 1418 are up and running.