Replace custom server.js with service-runner
Closed, ResolvedPublic

Description

Missing features in service-runner before we can use it for Parsoid:

  • request timeout logic: T90362

The move would involve:

  • config format change (yaml instead of JS): Pretty straightforward except for the parsoidConfig.setInterwiki() calls, which would need translating to a static syntax.

Benefits:

  • get heap monitoring & other future goodies for free
  • ability to run several sevices in a single process for small installs
  • ability to use the standard node service puppet module; possibly generic packaging in the future

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

It looks like the Parsoid side is now ready, and the next step is to prepare puppet changes / the move to the service puppet module. @mobrovac, could you prioritize the puppet work this week?

mobrovac claimed this task.

Taking this over. I have started work on the service::node patch, and it is likely I'll need to make some smaller refinements to https://gerrit.wikimedia.org/r/#/c/289508/

Change 289508 merged by Mobrovac:
T90668: Replace custom server.js with service-runner

https://gerrit.wikimedia.org/r/289508

Change 298436 had a related patch set uploaded (by Mobrovac):
Parsoid: Move to service::node

https://gerrit.wikimedia.org/r/298436

The above patch has been cherry-picked in beta and deployment-parsoid07 has been provisioned with it without problems. Next steps:

  • direct all BetaCluster traffic to that instance and let it run for a while
  • schedule and go through the move in production (for wtp* hosts)

Note that the patch does not touch ruthenium for the time being, but its puppet modules will likely need some rework too. I propose to do it in a second pass.

Change 298780 had a related patch set uploaded (by Mobrovac):
[Beta] Parsoid: direct traffic to deployment-parsoid07

https://gerrit.wikimedia.org/r/298780

Change 298782 had a related patch set uploaded (by Mobrovac):
[Beta] Parsoid: direct traffic to deployment-parsoid07

https://gerrit.wikimedia.org/r/298782

  • schedule and go through the move in production (for wtp* hosts)

How this is all going to work since you both need new deployed code as well as new puppet code? Will this be done one node at a time where you manually pull fresh code onto that node and then run new puppet there?

Change 298782 merged by jenkins-bot:
[Beta] Parsoid: direct traffic to deployment-parsoid07

https://gerrit.wikimedia.org/r/298782

Change 298780 merged by Rush:
[Beta] Parsoid: direct traffic to deployment-parsoid07

https://gerrit.wikimedia.org/r/298780

All BetaCluster is now being received by deployment-parsoid07. Verified VE there, it works as advertised.

  • schedule and go through the move in production (for wtp* hosts)

How this is all going to work since you both need new deployed code as well as new puppet code? Will this be done one node at a time where you manually pull fresh code onto that node and then run new puppet there?

The usual way is to disable puppet on the concerned nodes, then merge the ops/puppet change, do the deploy from tin (without service restarts), and then go through the nodes one by one, enabling puppet and running it. That will cause the new config to be pulled in, and the service will be restarted because of it. After a certain amount of nodes has been converted and we have verified they work, then puppet can simply be enabled on the rest of the nodes.

ssastry added a comment.EditedJul 13 2016, 4:07 PM

All BetaCluster is now being received by deployment-parsoid07. Verified VE there, it works as advertised.

No, it doesn't yet. I get a http 500 (got http 400 a couple mins earlier) when I try to review changes (which does the same thing more or less as save as far as Parsoid is concerned). You need to actually make edits and review changes since VE loads with content from RESTBase and doesn't actually hit Parsoid directly.

And, where are the parsoid logs on deployment-parsoid07?

All BetaCluster is now being received by deployment-parsoid07. Verified VE there, it works as advertised.

No, it doesn't yet. I get a http 500 (got http 400 a couple mins earlier) when I try to review changes (which does the same thing more or less as save as far as Parsoid is concerned). You need to actually make edits and review changes since VE loads with content from RESTBase and doesn't actually hit Parsoid directly.

There seems to be an issue with Cassandra there, it's throwing a lot of 500s.

And, where are the parsoid logs on deployment-parsoid07?

/srv/log/parsoid/main.log. You can use the tail-parsoid command to tail them (it accepts the same args as tail).

All BetaCluster is now being received by deployment-parsoid07. Verified VE there, it works as advertised.

No, it doesn't yet. I get a http 500 (got http 400 a couple mins earlier) when I try to review changes (which does the same thing more or less as save as far as Parsoid is concerned). You need to actually make edits and review changes since VE loads with content from RESTBase and doesn't actually hit Parsoid directly.

There seems to be an issue with Cassandra there, it's throwing a lot of 500s.

This has calmed down now, it was due to Cassandra restarts.

Still a couple of issues:

  • Kibana logs shows http 400s that I saw when I tried to edit the African_Linsang page and hit 'Review changes'. "Content-type of original html is missing." is the error. This is likely a result of https://gerrit.wikimedia.org/r/#/c/293351/ and possibly some mismatch between Parsoid & RESTBase expectations. I might just revert that for now to eliminate that moving part from the equation during this switch.
  • There are no log entries in /srv/log/parsoid/main.log on deployment-parsoid07 which is a problem since we rely on disk logs regularly.
ssastry added a comment.EditedJul 13 2016, 5:09 PM

Still a couple of issues:

  • Kibana logs shows http 400s that I saw when I tried to edit the African_Linsang page and hit 'Review changes'. "Content-type of original html is missing." is the error. This is likely a result of https://gerrit.wikimedia.org/r/#/c/293351/ and possibly some mismatch between Parsoid & RESTBase expectations. I might just revert that for now to eliminate that moving part from the equation during this switch.

Reverted now. After merging https://gerrit.wikimedia.org/r/#/c/298804/, deployment-parsoid07 should be updated. Can you also document the process of updating beta cluster? Is it just a git pull on the beta cluster node? Or a git deploy from somewhere?

Note that the patch does not touch ruthenium for the time being, but its puppet modules will likely need some rework too. I propose to do it in a second pass.

Is it possible to do this as part of the upgrade as well? It would be good to run rt-tests for a bit at least. For now, a simple tweak might be sufficient to pass the yaml config instead of the localsettings config.

ssastry@ruthenium:/srv/parsoid/src$ /usr/bin/nodejs bin/server.js -n 8 -c /srv/parsoid/src/localsettings.js
Error while reading config file: YAMLException: end of the stream or a document separator is expected at line 17, column 10:
    			prefix: 'customwiki',

Reverted now. After merging https://gerrit.wikimedia.org/r/#/c/298804/, deployment-parsoid07 should be updated. Can you also document the process of updating beta cluster? Is it just a git pull on the beta cluster node? Or a git deploy from somewhere?

Deployed in beta. The procedure is exactly the same as for prod, the difference is that you log onto deployment-tin.deployment-prep.eqiad.wmflabs instead of tin.eqiad.wmnet.

Note that the patch does not touch ruthenium for the time being, but its puppet modules will likely need some rework too. I propose to do it in a second pass.

Is it possible to do this as part of the upgrade as well? It would be good to run rt-tests for a bit at least. For now, a simple tweak might be sufficient to pass the yaml config instead of the localsettings config.

I understand that rt-tests are important, but I'm leaning towards minimising the amount of changes made in prod at one time. I will take a closer look at its manifests and try to understand what exactly would be needed to do the switch there too. In general, we have these options:

  1. Switch wtp* and ruthenium at the same time
  2. Switch wtp* first, and then ruthenium
  3. Switch ruthenium first

I will report back on this after further inspection as at present I cannot assess how much work is needed to get ruthenium working with the new way of things.

ssastry@ruthenium:/srv/parsoid/src$ /usr/bin/nodejs bin/server.js -n 8 -c /srv/parsoid/src/localsettings.js
Error while reading config file: YAMLException: end of the stream or a document separator is expected at line 17, column 10:
    			prefix: 'customwiki',

Yup, service-runner expects a YAML config, not a JS one.

  • There are no log entries in /srv/log/parsoid/main.log on deployment-parsoid07 which is a problem since we rely on disk logs regularly.

Yeah, this is worrisome. What is happening here is that Parsoid now seems to be using two different loggers - one provided by service-runner, and another one instantiated by Parsoid itself based on the settings loaded from localsettings.js. This would need to be consolidated to use exclusively the logger provided by service-runner. I think we should consider this a strong blocker for the migration - going blind into prod does not sound like a tale I'd like to read :)

Well, actually, technically we do have the logs available. They're in /var/log/upstart/parsoid.log which is readable only by root. My idea was to simply redirect stdout and stderr for the time being. Alas, this needs trickery for Upstart (as has been doen previously for Parsoid's case), which I'm not comfortable with, TBH. Note that this is not the case for SystemD (on Jessie) - we can redirect stdout/err easily there.

A possible work-around we can think about at this point is to tell Parsoid to write to /srv/log/parsoid/parsoid.log in localsettings.js, but I'm not happy with hard-coding the path in the deploy repo either. We may live with it since we plan to switch to Jessie soon(TM), but that does not address the real issue at hand (Parsoid having using two loggers).

PS7 of Gerrit 298436 makes changes to ruthenium too. The puppet compiler output is pretty decent, especially for ruthenium. For the wtp* hosts however, there are a bunch of admin rule removals, which I'm not sure that should happen (nor that those should be there in the first place, IMHO, as admin access is handled differently IIRC).

ssastry added a comment.EditedJul 14 2016, 3:59 PM

Well, actually, technically we do have the logs available. They're in /var/log/upstart/parsoid.log which is readable only by root. My idea was to simply redirect stdout and stderr for the time being. Alas, this needs trickery for Upstart (as has been doen previously for Parsoid's case), which I'm not comfortable with, TBH. Note that this is not the case for SystemD (on Jessie) - we can redirect stdout/err easily there.

A possible work-around we can think about at this point is to tell Parsoid to write to /srv/log/parsoid/parsoid.log in localsettings.js, but I'm not happy with hard-coding the path in the deploy repo either. We may live with it since we plan to switch to Jessie soon(TM), but that does not address the real issue at hand (Parsoid having using two loggers).

I think you can have the 2 loggers write to 2 different files (parsoid.log and servicerunner.log) in the same dir. servicerunner logs will only be service-level events (workers start, stop, timeout). Parsoid logger has some functionality that we rely on that service runner logger doesn't provide yet. I don't think migration to service-runner should block on consolidation into a single logger.

Change 299004 had a related patch set uploaded (by Mobrovac):
Logger conf: output to a file instead of stdout

https://gerrit.wikimedia.org/r/299004

I think you can have the 2 loggers write to 2 different files (parsoid.log and servicerunner.log) in the same dir. servicerunner logs will only be service-level events (workers start, stop, timeout). Parsoid logger has some functionality that we rely on that service runner logger doesn't provide yet. I don't think migration to service-runner should block on consolidation into a single logger.

It's rather non-standard, but ok. Gerrit 299004 alters Parsoid's configuration to output to /srv/log/parsoid/parsoid.log. Note that this is a temporary measure until the hosts are switched to Jessie, at which point we can revert it.

Change 299004 merged by jenkins-bot:
Logger conf: output to a file instead of stdout

https://gerrit.wikimedia.org/r/299004

OK, Parsoid in beta should now be ready. @ssastry @Arlolra mind testing it and playing with it to be sure we're on the right track? On my side, I'll try to dig into why those admin rights are being removed from the wtp* hosts with this patch.

If all is good, I propose to start moving the production hosts, so adding Operations .How about this: convert a couple of hosts on Tuesday, 2016-07-19 and let them run for a day, then turn the rest on Wednesday?

OK, Parsoid in beta should now be ready. @ssastry @Arlolra mind testing it and playing with it to be sure we're on the right track? On my side, I'll try to dig into why those admin rights are being removed from the wtp* hosts with this patch.

If all is good, I propose to start moving the production hosts, so adding Operations .How about this: convert a couple of hosts on Tuesday, 2016-07-19 and let them run for a day, then turn the rest on Wednesday?

Works for me.

mobrovac added a subscriber: Joe.Jul 19 2016, 7:25 AM

@Joe, @ssastry and I will move the first couple of wtp nodes to the new version today @ 14:30 UTC.

Change 299715 had a related patch set uploaded (by Giuseppe Lavagetto):
service::node: add 'entrypoint' and 'heartbeat_to'

https://gerrit.wikimedia.org/r/299715

Change 299716 had a related patch set uploaded (by Giuseppe Lavagetto):
parsoid: add role based on service::node, apply to two hosts

https://gerrit.wikimedia.org/r/299716

Change 299717 had a related patch set uploaded (by Giuseppe Lavagetto):
parsoid::testing: move to use the parsoid class

https://gerrit.wikimedia.org/r/299717

Change 299718 had a related patch set uploaded (by Giuseppe Lavagetto):
parsoid: move to role::parsoid for all production nodes

https://gerrit.wikimedia.org/r/299718

Change 299720 had a related patch set uploaded (by Giuseppe Lavagetto):
parsoid: add transition cleanup role

https://gerrit.wikimedia.org/r/299720

Change 299720 merged by Giuseppe Lavagetto:
parsoid: add transition cleanup role

https://gerrit.wikimedia.org/r/299720

Change 299715 merged by Giuseppe Lavagetto:
service::node: add 'entrypoint' and 'heartbeat_to'

https://gerrit.wikimedia.org/r/299715

Change 299716 merged by Giuseppe Lavagetto:
parsoid: add role based on service::node, apply to two hosts

https://gerrit.wikimedia.org/r/299716

Change 299774 had a related patch set uploaded (by Mobrovac):
Parsoid: Increase heap limit to 600 MB

https://gerrit.wikimedia.org/r/299774

Change 299774 merged by Giuseppe Lavagetto:
Parsoid: Increase heap limit to 600 MB

https://gerrit.wikimedia.org/r/299774

Change 299775 had a related patch set uploaded (by Mobrovac):
Parsoid: Lower the heartbeat timeout to 3 mins

https://gerrit.wikimedia.org/r/299775

Change 299775 merged by Giuseppe Lavagetto:
Parsoid: Lower the heartbeat timeout to 3 mins

https://gerrit.wikimedia.org/r/299775

Change 299717 merged by Giuseppe Lavagetto:
parsoid::testing: move to use the parsoid class

https://gerrit.wikimedia.org/r/299717

Change 299786 had a related patch set uploaded (by Mobrovac):
Parsoid: testreduce: correct gitRepoPath

https://gerrit.wikimedia.org/r/299786

Change 299786 merged by Giuseppe Lavagetto:
Parsoid: testreduce: correct gitRepoPath

https://gerrit.wikimedia.org/r/299786

Change 299795 had a related patch set uploaded (by Mobrovac):
Config: expect "port" and "interface" as config option names

https://gerrit.wikimedia.org/r/299795

Change 299795 merged by jenkins-bot:
Config: expect "port" and "interface" as config option names

https://gerrit.wikimedia.org/r/299795

Change 299960 had a related patch set uploaded (by Giuseppe Lavagetto):
service::node: add git as deployment method

https://gerrit.wikimedia.org/r/299960

Mentioned in SAL [2016-07-20T12:17:55Z] <_joe_> disabling puppet on all parsoid hosts for the transition to service-runner T90668

Change 299718 merged by Giuseppe Lavagetto:
parsoid: move to role::parsoid for all production nodes

https://gerrit.wikimedia.org/r/299718

Change 298436 abandoned by Mobrovac:
Parsoid: Move to service::node

Reason:
Split into multiple patches, cf. https://gerrit.wikimedia.org/r/#/q/project:operations/puppet branch:production topic:parsoid_service_node,n,z

https://gerrit.wikimedia.org/r/298436

Change 299960 merged by Giuseppe Lavagetto:
service::node: add git as deployment method

https://gerrit.wikimedia.org/r/299960

Change 300067 had a related patch set uploaded (by Mobrovac):
Parsoid: clean up the manifests and files

https://gerrit.wikimedia.org/r/300067

ssastry closed this task as Resolved.Jul 20 2016, 6:37 PM

This is now done and the production cluster has migrated over. There is a bunch of followup to do including vagrant updates, debian package updates, more tweaking and cleanup, but they can all be new tasks.

GWicke added a comment.EditedJul 20 2016, 8:46 PM

Congratulations, @Arlolra, @ssastry, @mobrovac, @Joe!

As they say: "That's one small step for [a] man, one giant leap for mankind."

Change 300067 merged by Giuseppe Lavagetto:
Parsoid: clean up the manifests and files

https://gerrit.wikimedia.org/r/300067