Page MenuHomePhabricator

Get rid of portgranter
Closed, ResolvedPublic

Description

It doesn't actually grant any ports, since the kernel decides this. This complicates the stack for no particular reason, and adds more complex recovery scenarios (this is what went wrong during the reboot during the GHOST update).

the uwsgi and node tools simply use the uid of the tool user as port, and this works well since one tool can have only one webservice. Let's get rid of portgranter and just use that instead. I verified that all our tools have uid in 5xxxx range, and should be good.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda added subscribers: yuvipanda, coren, scfc, Joe.

The problem with the uid = port approach is that if that one port is taken, you're stuck for good.

Ideally, web servers would bind to a free port (0) and then report to the proxies the actual port they allocated.

I haven't checked whether this technique could be used in a wrapper, i. e. bind to a free port, note the actual port, close the socket, call the web server with the port number that was free a millisecond ago, or if Linux recycles port numbers so aggressively that that race condition is hit every time.

@scfc: Right, so the only 'true' way to find out if a port is free from userspace that I could find is actually bind to a free port, but I don't know if closing it will release it immediately... We could do something lame like look at netstat output to pick an empty port?

There is no "safe" way to do so, except passing an open file descriptor to the web server as it starts (which very, very few servers support - uwsgi being the only one I know of).

That said, if you retain the "registry" system of portgranter (which is what it does), it becomes reasonable for the granter to *test* that the socket is actually free before giving it. It's not foolproof but unless a job goes out of its way to try to steal ports it will work flawlessly.

(Right now, the only real issue is that it's possible for a port to be allocated while portgranter doesn't know about it. If it tests first and discards the port before granting it, then we're golden).

Portgranter testing has the same effect as the wrapper script testing, which is the race condition. And doing it in the wrapper script reduces the number of moving parts.

No, it's not the same effect: the granter testing means that if the port is found use after all it can be /removed/ from the list entirely and another one granted to the client.

I.e.: the client doesn't fail, it gets a good port.

And also, there are no race condition possible if there is a single granter given it will never *try* to allocate the same port twice.

You can do the exact same looping in the wrapper script too :)

Again, the problem is that the portgranter doesn't actually allocate anything. The 'race condition' is jobs not going through webservice getting on those queues and binding on random ports.

Basically, I'm trying to figure out what exactly porgranter adds here, and I can not. It is another running service with a dependency on NFS, and I do not see what it actually gives us over simple uid based port binding. I am just trying to remove moving parts from the webservice system, which is already fairly complex. And portgranter did fail during the GHOST reboot by basically giving webservices duplicate ports, and I'd like to eliminate this point of failure unless there's a very good reason not to.

You're looking at it the wrong way: it doesn't allocate ports, it allocates and registers from a pool. Think of it like a DHCP server, not a bind() system call.

Its missing feature right now is removing collisions from that pool (as a DHCP server would). Adding a simple test before it responds of "is this port actually free as I expect" and removing it from the pool if it isn't will solve the "rogue job stealing a port" problem.

The race condition I refer to is two jobs starting near enough each other that "test port n, it is free, release it, start process" will both test n for a brief time, likely both find it free, then one of the two processes will randomly fail because the other ended up starting faster.

Right, but I'm saying this doesn't require a DHCP server....

Imagine I get rid of portgranter, and just set the PORT to be the uid. What goes wrong? What i the negative of that? The positive, as I said, is a simpler system with one less moving part.

Exatcly the same things as static IP allocation. Nothing when all goes well, and hard-to-track issues that need manual intervention when there is a collision. If your wrapper doesn't test the port then that blocks the job from starting entirely. If it does, then *what* alternate port does it offer? What prevents collisions on that one?

Since we have only one webservice running per user, there would be no collissions from webservice itself. And if there is an external collision, that is going to come from something not using webservice (and consequently portgranter either).

Hmm, zombie webservice processes that have refused to die would cause problems, maybe? (with uid -> port mapping)

They would, and with no way to allocate a different port that would stall the service. Whereas if a new webservice job is started it'd get a fresh port and would override the redis mapping with the new one.

Seriously, the problem set is exactly the same as "allocate IPs over a subnet to dynamic hosts" which is why I went with a conceptually similar solution. :-)

@coren: If portgranter would test each port before giving it out, you could replace the pool with a random function (or binding to 0), this also would mean that you could move that logic to the wrapper (portgrabber) and remove the daemon from the equation. I think that's what @yuvipanda meant with fewer moving parts.

So here's the problem as I see it:

Our webservice setup has the same 'state' maintained across different places, and they all need to be kept in sync.

Currently, 'state' is kept in the following places:

  1. portgranter
  2. open sockets to proxylistener
  3. redis on proxy hosts
  4. OGE master (this knows about the processes that are actually running)

So any time any of these disagree, we have a problem. I just want to reduce the amount of state as much as possible. Does *that* sound sane?

It does, very much so.

How about we set up a hangout to work out plausible solutions and report the result of that discussion back here before we settle on a specific implementation?

Not yet :) I'm still deep in betalabs stuff, and I guess @scfc would like to be involved as well :) Let's keep noodling around here (maybe retitle if current title is too presemptuous?).

I tested putting:

# Bind to a free port and note the port number.
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('', 0))
port = s.getsockname()[1]
s.close()

into portgrabber and it worked as the spawned lighttpd could bind to the port number without problem, so I propose doing that.

But labs/toollabs's www/content/list.php refers to /data/project/.system/dynamic for a list of active web services that is managed by portgranter, so before we can move into this direction, we need to implement T88216, then update www/content/list.php to refer to that list, and then portgranter can be retired.

+1 to ^, and we can just loop around if needed (and actually add instrumentation too to detect how often this causes problems).

scfc triaged this task as Medium priority.Apr 6 2015, 11:04 AM
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

Change 204009 had a related patch set uploaded (by Yuvipanda):
tools: Have uwsgi and nodejs attempt to pick a random port

https://gerrit.wikimedia.org/r/204009

Change 204009 merged by Yuvipanda:
tools: Have uwsgi and nodejs attempt to pick a random port

https://gerrit.wikimedia.org/r/204009

Alright, so uwsgi and nodejs tools now pick up from random port :)

As for list.php, an accurate way to do it now would be to read service.manifest file in that tool's home directory.

Change 204012 had a related patch set uploaded (by Yuvipanda):
tools: Don't use portgranter anywhere

https://gerrit.wikimedia.org/r/204012

Change 204012 merged by Yuvipanda:
tools: Don't use portgranter anywhere

https://gerrit.wikimedia.org/r/204012

Bit the bullet and did it. list.php is going to be possibly out of date for a bit now. That's ok, I think.

Change 204013 had a related patch set uploaded (by Yuvipanda):
Temporarily remove 'web' link URLs from main page listing

https://gerrit.wikimedia.org/r/204013

Change 204014 had a related patch set uploaded (by Yuvipanda):
tools: Remove remnants of portgranter code

https://gerrit.wikimedia.org/r/204014

Change 204014 merged by Yuvipanda:
tools: Remove remnants of portgranter code

https://gerrit.wikimedia.org/r/204014

yuvipanda claimed this task.

It is gone now.

Change 204013 abandoned by Yuvipanda:
Do not depend on portgranter

Reason:
No

https://gerrit.wikimedia.org/r/204013