Page MenuHomePhabricator

kill_job function in remote execution module of transfer framework does not close the ports instantly
Closed, ResolvedPublic

Description

The transfer.py is a framework used for database backup and recovery. The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket is to enquire whether it is the expected behaviour or not? If yes, could you please explain a little bit about it?

Event Timeline

The remote execution module of this framework has kill_job function and it does not kill/close the port used by the netcat instantly. This ticket is to enquire whether it is the expected behaviour or not? If yes, could you please explain a little bit about it?

I am not sure what you mean. Netcat is executed so that it finishes when an EOF is reached, so it just ends up. Do you mean if the job should be killed if an error is found? Sure, if you can detect one and see a case where that can happen, yes- but I don't understand yet which case you are assuming (process error? other error)? I would be of course ok if you can correct any cases were a failure leads to the port being open after an error/exception. I think either it should be closed or an error should be given to user noting that ports are left open.

kill_job is a method of RemoteExecution. Transfer uses Remote Execution to run commands. You can use that if you need it, just I don't understand which is the case/error that you refer too. Feel free to add a patch/mockup (doesn't have to be a complete solution) giving more context to your question.

In other words, feel free to provide an example of how to improve error handling :-D, I am of course open for you to correct it.

[edited] Actually I'm talking about the kill_job function. I will give you a context to understand my question correctly.

def _kill_use_ports(self, host, jobs):
        """
        Kill the jobs on the given host.

        :param host: host on which the jobs to be killed
        :param jobs: list of jobs to be killed
        """
        for job in jobs:
            self.transferer.remote_executor.kill_job(host, job)

def test_find_available_port(self):
        """Test find available port."""
        use_ports = [4400, 4401]
        # Use ports 4400 and 4401
        jobs = self._use_ports(self.host, use_ports)

        # Close ports
        self._kill_use_ports(self.host, jobs)

        # In the test running machine expect no other
        # process uses port between 4400 and 4500.
        # Since 4400 and 4401 are used by this test,
        # port opened for transfer should be 4402.
        self.assertEqual(self.options['port'], 4402)

        use_ports = [4401]
        # Use port 4401
        jobs = self._use_ports(self.host, use_ports)

        # Close ports
        self._kill_use_ports(self.host, jobs)

        # In the test running machine expect no other
        # process uses port between 4400 and 4500.
        # Since 4401 is used by this test,
        # port opened for transfer should be 4400.
        self.assertEqual(self.options['port'], 4400)

If we run the above test, it fails at the second assert. The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).

The reason behind that is, the remote_executor.kill_job() does not close the port instantly (takes more than 30s in my machine).

Interesting, I was about to test it for myself, but my suspicion is that you run into a famous TCP "feature", not a bug- WAIT_CLOSED. I will check if that is the case. If it is, there is not really a good fix, except either randomize the port usage (not ideal) or just comment to wait those 30 seconds between every test.

I will test it to make sure it is that and not an different bug.

I've checked and both a manual "kill -9" and a "kill -15" should make the port available almost instanatly, so probably it is not that. Maybe kill_job doesn't work properly, will research it on my testing and report back.

what we do now is, we start a nc-listen command in the target machine with the start_job which makes a new process in the framework running machine (with netcat listen waiting) and kill_job function uses terminate function inside multiprocessing/process.py (given below) to kill that job.

def terminate(self):
        '''
        Terminate process; sends SIGTERM signal or uses TerminateProcess()
        '''
        self._popen.terminate()

To solve this ticket, we will have to send a kill directly to the target process using Cumin before we kill the job in the script-running machine. But only available information about the process in the target machine is the exact-command. I don't have knowledge about any Linux way to kill process by the exact-command! Please let me know your suggestion.

I see- our command does some piping- this means that it generates a few subprocesses with a single command. I think the right way would be to use netcat to know which processes are listening on a specific port- something that could be a method within the Firewall class, and then send a kill to the pid obtained from netcat. netcat -tlpn netstat -tlpn should give us a numeric PID so we don't have to work with commands.

Starting to work a bit on PIDs and ports for our library of methods would not be a waste of time, as we may want to reuse it later for concurrency handling and error states, not just the integration test. Of course, the immediate need is the error in the test.

Thank you for the suggestion, I will try with netcat.

Sorry, when I said netcat before, I meant netstat. =:-D

fuser will actually be more elegant than netstat:

fuser 4040/tcp

and can kill processes too:

fuser -k 4040/tcp

We could also use this better than netstat for the "check open ports" than my initial oneliner. Actually we cannot for that as we need a range of ports, but it will be helpful for this ticket.

I will try with fuser then! Thank you!

Change 598560 had a related patch set uploaded (by Privacybatm; owner: Privacybatm):
[operations/software/wmfmariadbpy@master] Firewall.py: Add function to kill process by its port number

https://gerrit.wikimedia.org/r/598560

Change 598560 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] Firewall.py: Add function to kill process by its port number

https://gerrit.wikimedia.org/r/598560

This is resolved as defined, but more tickets may be open in the future to control concurrency and better leftover process after failure.