Page MenuHomePhabricator

Parsoid restarts not completely reliable with upstart
Closed, DuplicatePublic

Description

Today during the deploy, several of the parsoid hosts remained down for quite a long time. Their logs were filled with:

{"name":"parsoid","hostname":"wtp1001","pid":14985,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綉","oldId":26153377,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153
377","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%89?oldid=26153377","time":"2014-11-1
3T21:15:42.711Z","v":0}
{"name":"parsoid","hostname":"wtp1001","pid":14984,"level":40,"logType":"warning","wiki":"enwiktionary
","title":"綊","oldId":17538871,"msg":"Cache MISS: http://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538
871","longMsg":"Cache MISS:\nhttp://10.2.2.29/enwiktionary/%E7%B6%8A?oldid=17538871","time":"2014-11-1
3T21:15:42.713Z","v":0}
[warning][worker][14995] shutting down
[warning][worker][16672] shutting down
[warning][worker][14987] shutting down
[info][master][14981] shutting down, killing workers
[warning][worker][14996] shutting down
[warning][worker][15004] shutting down
[warning][worker][14990] shutting down
[warning][worker][14998] shutting down
[warning][worker][14988] shutting down
[warning][worker][14992] shutting down
[warning][worker][14985] shutting down
[warning][worker][15011] shutting down
[warning][worker][14984] shutting down
[warning][worker][15001] shutting down
[warning][worker][15007] shutting down
{"name":"parsoid","hostname":"wtp1001","pid":15003,"level":30,"logType":"info","wiki":"rowiki","title"
:"2012","oldId":8889091,"msg":"completed parsing in 7469 ms","longMsg":"completed parsing in\n7469\nms
","time":"2014-11-13T21:15:43.087Z","v":0}
[warning][worker][15003] shutting down
[info][master][14981] exiting
[info][master][16784] initializing 15 workers
[info][worker][16793] loading ...
[info][worker][16790] loading ...
[info][worker][16788] loading ...
[info][worker][16798] loading ...
[info][worker][16795] loading ...
[info][worker][16801] loading ...
[info][worker][16806] loading ...
[info][worker][16793] ready on :8000
[error][worker][16793] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.5:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16790] ready on :8000
[info][worker][16798] ready on :8000
[info][worker][16795] ready on :8000
[error][worker][16790] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.3:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[error][worker][16798] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.7:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16801] ready on :8000
[error][worker][16795] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.6:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[error][worker][16801] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.9:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16806] ready on :8000
[info][worker][16791] loading ...
[error][worker][16806] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.11:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16787] loading ...
[info][worker][16788] ready on :8000
[error][worker][16788] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.2:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16799] loading ...
[info][worker][16804] loading ...
[info][worker][16810] loading ...
[info][worker][16808] loading ...
[info][worker][16812] loading ...
[info][worker][16814] loading ...
[info][worker][16791] ready on :8000
[error][worker][16791] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.4:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[info][worker][16787] ready on :8000
[error][worker][16787] Port 8000 is already in use. Exiting.
Stack:

HTTPServer.<anonymous> (/srv/deployment/parsoid/deploy/src/api/ParsoidService.js:61:18)
HTTPServer.EventEmitter.emit (events.js:95:17)
net.js:1081:16
Object.1:1 (cluster.js:592:5)
handleResponse (cluster.js:171:41)
respond (cluster.js:192:5)

[..etc...]

Two things to note here: first, logs from the master aren't going through bunyan, and thus not ending up in logstash. There's is some discussion of this on https://gerrit.wikimedia.org/r/171618.

Secondly, we need to use SO_REUSEADDR when we open port 800 so that we don't have to wait for the full linger time to bring up the service after a deploy.
See http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t


Version: unspecified
Severity: normal

Details

Reference
bz73395

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:56 AM
bzimport added a project: Parsoid-Web-API.
bzimport set Reference to bz73395.
cscott created this task.Nov 13 2014, 10:16 PM

Filed bug 73396 for the "logs don't make it to logstash" issue.

Node's docs claim that they already use SO_REUSEADDR:

http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback

This makes our EADDRINUSE errors very mysterious.

However, the 'cluster' package does mysterious things: http://nodejs.org/api/cluster.html#cluster_how_it_works

It's possible that's what's going wrong here, somehow?

Given that we have had this code for a long time, and also given that we have also had a couple deploys post node 0.10 upgrade, and that this is the first deploy after we enabled timeouts, I would say this is related to timeouts (5 mins).

Possibly we're not shutting down cleanly when the service is stopped, and the zombie process continues holding on to the port until the timeout expires.

So, this is the code that we use to cleanly terminate the workers (see api/server.js)

cluster.disconnect(function() {
        logger.log( "info", "exiting" );
        process.exit(0);
});

This waits for worker to shut down. However, in the common case, the title should get processed fairly quicly and clear the timeout => all the workers should shut down in a timely fashion. I don't think (or at least, I would be surprised if it were) all those cases of hung processes that do indeed take 5 mins and eventually get killed by the timeout.

So, one question I now have is if we are holding onto timeouts after request processing and not clearing those out cleanly in all cases.

Well, it was 3-4 of the parsoid hosts which had this problem. So it's possible that these were hung processes which needed to wait for the timeout.

But in this case I would expect 'service parsoid stop' not to actually complete until the timeout had finished and the process had actually stopped. The problem was that the service was restarted while the old parsoid was still running (or so it would seem).

Ah, if 3-4 then maybe that is what it is then. Ping gwicke about the service restart part then. He might have a clue what is going on.

marcoil set Security to None.
Arlolra triaged this task as High priority.Nov 25 2014, 11:23 PM
GWicke added a subscriber: GWicke.EditedDec 3 2014, 2:15 AM

This might actually be the same problem I was dealing with quite a while ago.

There are some race conditions in upstart that are related to its hacky handling of forks & it's use of ptrace rather than cgroups. IIRC it sometimes thinks one of the workers is the leader process if a bunch of workers were re-forked since the last full restart. The same problem doesn't happen with the init script (start-stop-daemon really) & systemd. Fortunately, we are moving to Debian & are getting rid of upstart.

We have a fine init script that works both on Ubuntu & Debian in the deploy repo (debian/parsoid.init IIRC). We could plug that into puppet if this gets too annoying.

Can someone test this by using the init script on one instance to see if this still happens?

ssastry moved this task from Needs Triage to In Progress on the Parsoid board.Dec 4 2014, 8:23 AM

Can someone test this by using the init script on one instance to see if this still happens?

@yuvipanda: We don't have root on those boxes, so that would need to be somebody in ops.

@yuvipanda: We don't have root on those boxes, so that would need to be somebody in ops.

See also: https://gerrit.wikimedia.org/r/#/c/182585/

chasemp added a subscriber: chasemp.Jan 7 2015, 5:11 PM

@yuvipanda: We don't have root on those boxes, so that would need to be somebody in ops.

See also: https://gerrit.wikimedia.org/r/#/c/182585/

@GWicke, now that the perms have landed can you poke at this?

GWicke renamed this task from Parsoid should use SO_REUSEADDR when it binds to its port to Parsoid restarts not completely reliable with upstart.Jan 7 2015, 5:26 PM
GWicke claimed this task.

cool, so since it looks like ops was roped in for perms stuff and that is resolved. I'm removing us, but only so we don't keep bugging you. If something comes up let us know :)

I think this problem may have been mitigated with this: https://gerrit.wikimedia.org/r/#/c/182644/ .. Last few deploys, I haven't see any stuck processes that needed explicit kills.

ssastry lowered the priority of this task from High to Medium.Feb 3 2015, 10:23 PM

The eventual move to jessie & systemd will further improve this, as systemd systematically tracks process groups with cgroups.

I hope that we can create a generic systemd unit for node services, leveraging the default interface provided by https://github.com/wikimedia/service-runner. See T89901.

GWicke removed GWicke as the assignee of this task.Feb 20 2015, 10:21 PM
ssastry moved this task from In Progress to Needs Triage on the Parsoid board.Mar 3 2015, 9:09 PM