kill_job function in remote execution module of transfer framework does not close the ports instantly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Privacybatm
	May 16 2020, 7:05 PM

Description

The transfer.py is a framework used for database backup and recovery. The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket is to enquire whether it is the expected behaviour or not? If yes, could you please explain a little bit about it?

Details

	Subject	Repo	Branch	Lines +/-
	Firewall.py: Add function to kill process by its port number	operations/software/wmfmariadbpy	master	+38 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T246435 Create or improve a tool for monitoring or automating tasks for Wikimedia databases
Resolved	Privacybatm	T248256 GSoC 2020 Proposal: Improve the framework to transfer files over the LAN
Resolved	Privacybatm	T252171 Automate the detection of netcat listen port in transfer.py
Resolved	Privacybatm	T252950 kill_job function in remote execution module of transfer framework does not close the ports instantly

Event Timeline

Privacybatm created this task.May 16 2020, 7:05 PM

• Marostegui moved this task from Triage to In progress on the DBA board.May 18 2020, 5:09 AM

The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket is to enquire whether it is the expected behaviour or not? If yes, could you please explain a little bit about it?

I am not sure what you mean. Netcat is executed so that it finishes when an EOF is reached, so it just ends up. Do you mean if the job should be killed if an error is found? Sure, if you can detect one and see a case where that can happen, yes- but I don't understand yet which case you are assuming (process error? other error)? I would be of course ok if you can correct any cases were a failure leads to the port being open after an error/exception. I think either it should be closed or an error should be given to user noting that ports are left open.

kill_job is a method of RemoteExecution. Transfer uses Remote Execution to run commands. You can use that if you need it, just I don't understand which is the case/error that you refer too. Feel free to add a patch/mockup (doesn't have to be a complete solution) giving more context to your question.

In other words, feel free to provide an example of how to improve error handling :-D, I am of course open for you to correct it.

[edited] Actually I'm talking about the kill_job function. I will give you a context to understand my question correctly.

def _kill_use_ports(self, host, jobs):
        """
        Kill the jobs on the given host.

        :param host: host on which the jobs to be killed
        :param jobs: list of jobs to be killed
        """
        for job in jobs:
            self.transferer.remote_executor.kill_job(host, job)

def test_find_available_port(self):
        """Test find available port."""
        use_ports = [4400, 4401]
        # Use ports 4400 and 4401
        jobs = self._use_ports(self.host, use_ports)

        # Close ports
        self._kill_use_ports(self.host, jobs)

        # In the test running machine expect no other
        # process uses port between 4400 and 4500.
        # Since 4400 and 4401 are used by this test,
        # port opened for transfer should be 4402.
        self.assertEqual(self.options['port'], 4402)

        use_ports = [4401]
        # Use port 4401
        jobs = self._use_ports(self.host, use_ports)

        # Close ports
        self._kill_use_ports(self.host, jobs)

        # In the test running machine expect no other
        # process uses port between 4400 and 4500.
        # Since 4401 is used by this test,
        # port opened for transfer should be 4400.
        self.assertEqual(self.options['port'], 4400)

If we run the above test, it fails at the second assert. The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).

The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).

Interesting, I was about to test it for myself, but my suspicion is that you run into a famous TCP "feature", not a bug- WAIT_CLOSED. I will check if that is the case. If it is, there is not really a good fix, except either randomize the port usage (not ideal) or just comment to wait those 30 seconds between every test.

I will test it to make sure it is that and not an different bug.

I've checked and both a manual "kill -9" and a "kill -15" should make the port available almost instanatly, so probably it is not that. Maybe kill_job doesn't work properly, will research it on my testing and report back.

jcrespo edited parent tasks, added: T252171: Automate the detection of netcat listen port in transfer.py; removed: T248256: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN.May 20 2020, 4:06 PM

jcrespo mentioned this in T252171: Automate the detection of netcat listen port in transfer.py.

what we do now is, we start a nc-listen command in the target machine with the start_job which makes a new process in the framework running machine (with netcat listen waiting) and kill_job function uses terminate function inside multiprocessing/process.py (given below) to kill that job.

def terminate(self):
        '''
        Terminate process; sends SIGTERM signal or uses TerminateProcess()
        '''
        self._popen.terminate()

To solve this ticket, we will have to send a kill directly to the target process using Cumin before we kill the job in the script-running machine. But only available information about the process in the target machine is the exact-command. I don't have knowledge about any Linux way to kill process by the exact-command! Please let me know your suggestion.

I see- our command does some piping- this means that it generates a few subprocesses with a single command. I think the right way would be to use netcat to know which processes are listening on a specific port- something that could be a method within the Firewall class, and then send a kill to the pid obtained from ~~netcat. netcat -tlpn~~ netstat -tlpn should give us a numeric PID so we don't have to work with commands.

Starting to work a bit on PIDs and ports for our library of methods would not be a waste of time, as we may want to reuse it later for concurrency handling and error states, not just the integration test. Of course, the immediate need is the error in the test.

Thank you for the suggestion, I will try with netcat.

Sorry, when I said netcat before, I meant netstat. =:-D

fuser will actually be more elegant than netstat:

fuser 4040/tcp

and can kill processes too:

fuser -k 4040/tcp

~~We could also use this better than netstat for the "check open ports" than my initial oneliner.~~ Actually we cannot for that as we need a range of ports, but it will be helpful for this ticket.

jcrespo moved this task from In progress to GSOC2020 on the DBA board.May 25 2020, 1:48 PM

I will try with fuser then! Thank you!

Change 598560 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/wmfmariadbpy@master] Firewall.py: Add function to kill process by its port number

https://gerrit.wikimedia.org/r/598560

gerritbot added a project: Patch-For-Review.May 26 2020, 1:31 AM

Change 598560 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] Firewall.py: Add function to kill process by its port number

https://gerrit.wikimedia.org/r/598560

Maintenance_bot removed a project: Patch-For-Review.Jun 4 2020, 9:10 AM

This is resolved as defined, but more tickets may be open in the future to control concurrency and better leftover process after failure.

Privacybatm mentioned this in rOSTPcc58a878335d: Firewall.py: Add function to kill process by its port number.Jun 4 2020, 10:25 AM

Privacybatm mentioned this in rOSWBcc58a878335d: Firewall.py: Add function to kill process by its port number.Sep 1 2020, 8:36 AM

kill_job function in remote execution module of transfer framework does not close the ports instantlyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

kill_job function in remote execution module of transfer framework does not close the ports instantly
Closed, ResolvedPublic
Actions

Related Objects
Search...