Page MenuHomePhabricator

Migrate Parsoid cluster to Jessie / node 4.x
Closed, ResolvedPublic

Description

We have been RT testing and Travis testing Parsoid code on 4.x for a while and everything seems good. It is time to start taking steps to move Parsoid to Jessie / node 4.x.

Given change in GC behavior in node v4, the last blocker is doing some memory usage evaluation under similar concurrent loads as production. Given that the migration will have to happen one node at a time anyway, one simple / lazy way of doing this would be to migrate a single node over and monitor it over a 24 hour period. Is this recommended or should I do load tests separately?

Separately, what is involved in getting this process going?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I installed Siege and ran a 10 minute test on my laptop with node 0.10 and node 4.3 (siege -v -i -t 10m -f urls.txt is the command line). I picked a random set of 10K titles from the visual diff database (which itself was randomly generated from production dbs) and fed it to siege.

Looks like Parsoid is 20% faster with node 4.3 (1.45 vs 1.82s response time; and 4540 vs 3817 completed transactions).

As for memory usage, with node 4.3, peak RES memory on any Parsoid worker was about 600M whereas on node 0.10, it was about 400M. This doesn't sound alarming.

Memory usage graph over 1 year shows a fairly stable usage graph and between 25-30% of available memory in use. So, with node v4, this might go up to 40-45% of available memory, and hence potentially less margin for memory usage spikes -- although that might not necessarily work that way given different GC behavior.

So, this still looks okay in terms of moving ahead, but I think we should still do a test migration of one or two servers, monitor for a 24 hour period at least and then make a call on the full migration. Separately, we should audit Parsoid code to see if there are any low-hanging fruit wrt memory usage that we can optimize.

ssastry triaged this task as Medium priority.May 12 2016, 10:50 PM

Also, the load on my laptop was also lower with node v4.3 (load average: 3.10, 2.45, 1.90) vs node v0.10.26 (load average: 5.11, 4.33, 2.98). So, that is a win as well.

After moving to 4.x, you might also be able to get back more performance & memory by removing shims: T129598

Considering that the RT testing infra has been running on Node 4 for quite a while now and the data you collected, I think this is ready to move on.

Given that the migration will have to happen one node at a time anyway, one simple / lazy way of doing this would be to migrate a single node over and monitor it over a 24 hour period.

Yes. But given the number of Parsoid nodes, we might want to consider switching 2 of them right away and let them run for a while.

Separately, what is involved in getting this process going?

We need to depool the ndoes we want to convert, re-image them and bring them back up. Adding @akosiaris who has been running point on the Ops side for this. @akosiaris can we settle on a timeline for this?

Considering that the RT testing infra has been running on Node 4 for quite a while now and the data you collected, I think this is ready to move on.

Given that the migration will have to happen one node at a time anyway, one simple / lazy way of doing this would be to migrate a single node over and monitor it over a 24 hour period.

Yes. But given the number of Parsoid nodes, we might want to consider switching 2 of them right away and let them run for a while.

Separately, what is involved in getting this process going?

We need to depool the ndoes we want to convert, re-image them and bring them back up. Adding @akosiaris who has been running point on the Ops side for this. @akosiaris can we settle on a timeline for this?

When would you like this done? A proposal on my part is end of next week (week of 16th, say Thursday ?) for the first 2 boxes, and early week of 23rd for the rest of the boxes. Does this sound OK ?

For what is worth, and given binary node_modules between nodejs versions not being compatible, during the migration period, no parsoid deploys should happen. Correct ?

When would you like this done? A proposal on my part is end of next week (week of 16th, say Thursday ?) for the first 2 boxes, and early week of 23rd for the rest of the boxes. Does this sound OK ?

Seems like we have ourselves a deal! OK with you, @ssastry ?

For what is worth, and given binary node_modules between nodejs versions not being compatible, during the migration period, no parsoid deploys should happen. Correct ?

I don't think holding normal deploys will be needed, even though it would certainly make our lives easier. @ssastry, are there any important feature deploys planned in the next couple of weeks? Would having no deploys during this period be a problem?

As for the node binaries, Parsoid and its RT clusters have been running on node 0.10 and 4.x each, respectively, with the the same binaries without problems, so I don't anticipate any issues in that regard. That said, caution needs to be exercised on the first Node 4.x deploy which will give us the real picture.

For what is worth, and given binary node_modules between nodejs versions not being compatible, during the migration period, no parsoid deploys should happen. Correct ?

Parsoid doesn't have any binary dependencies.

What @Arlolra said and @akosiaris's proposed timeline looks good to me. Since we usually deploy on Mondays and Wednesdays, there shouldn't be any issue with deploys anyway if we are migrating the cluster between Thursday and early week of 23rd. But, we can hold the Monday deploy if necessary.

Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to production puppet code wrt service restarts. @mobrovac @akosiaris can you help with this?

Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to production puppet code wrt service restarts. @mobrovac @akosiaris can you help with this?

Good catch. I forgot parsoid does not yet use service-runner. It definitely has upstart stuff (and misses systemd stuff) from what I see in the role class.

Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to production puppet code wrt service restarts. @mobrovac @akosiaris can you help with this?

Good catch. I forgot parsoid does not yet use service-runner. It definitely has upstart stuff (and misses systemd stuff) from what I see in the role class.

So, this complicates the migration a bit since this requires the puppet code to install upstart vs systemd scripts depending on trusty/jessie. Shouldn't this be done prior to embarking on re-imaging of the 2 nodes on Thursday?

Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to production puppet code wrt service restarts. @mobrovac @akosiaris can you help with this?

Good catch. I forgot parsoid does not yet use service-runner. It definitely has upstart stuff (and misses systemd stuff) from what I see in the role class.

So, this complicates the migration a bit since this requires the puppet code to install upstart vs systemd scripts depending on trusty/jessie. Shouldn't this be done prior to embarking on re-imaging of the 2 nodes on Thursday?

Indeed. All these niceties that service-runner/service::node provide and parsoid doesn't yet have :-(.

I see a couple of paths forward (along with pros and cons)

  • Just do the absolute necessary migrations to get Parsoid working on jessie/systemd.
    • Pros:
      • Uncouples the migration to nodejs 4.x from the migration to service-runner and allows it to proceed
    • Cons:
      • Repeats part of the work done for service runner which supports all of these already and much more
      • Leaves the puppet manifests for parsoid in the same state, without a much needed overhaul. The overhaul has stalled anyway awaiting for the migration to service-runner. T86633.
  • Couple this with the migration of parsoid to service-runner which to me seems to have stalled. T90668 seems to be the blocker of that.
    • Pros:
      • We get all the niceties of service-runner along with support for both systemd and upstart, like better monitoring checks and so on
      • Cleans up the puppet manifests which is a long awaited item as mentioned above
      • Avoids repeating work we 've already done in a generalized form to accomodate all the nodejs services
    • Cons:
      • Couples this with the migration to service runner which means effectively this task gets Stalled and blocked on T90668

FWIW, I hoped that parsoid would be on service-runner by now (obvious in T86633#1514459). I still am favorable of the approach, but I am starting to doubt it will happen anytime soon (will it happen at all ?) . @mobrovac, @ssastry, please correct me if I am wrong.

@Arlolra had been evaluating the migration to service-runner and this seems like a very good reason to consider that. At this time, most of the significant blockers seems addressed as far as I know. We'll chat about this in today's team meeting.

In practice supporting both upstart and systemd can be done by using the base::service_unit Puppet define, and AFAIK that is the only thing which would need to change in the current state as far as the migration goes, which is a big pro for going in this direction right now.

That said, I really think it's time to move Parsoid to service-runner as it is the last big Node.JS service not being based on it.

In practice supporting both upstart and systemd can be done by using the base::service_unit Puppet define, and AFAIK that is the only thing which would need to change in the current state as far as the migration goes, which is a big pro for going in this direction right now.

I am unfamiliar with this, so, will defer to you to iron this out, but if it is as simple as that, I think we should decouple the two migrations.

That said, I really think it's time to move Parsoid to service-runner as it is the last big Node.JS service not being based on it.

Right .. but, Parsoid has had special requirements compared to other services and needed to be fixed before the migration could happen. Let us continue to that discussion on the other ticket.

This pull leads me to believe that service-runner has a hard dependency on node v4.x, which kind of puts us in a bind in the upgrade path.

@Arlolra, if you look at the dates & the discussion you'll notice that a) there is no dependency on node v4.x, and b) there are no plans to introduce such a dependency in the foreseeable future.

We are considering to drop 0.10 support in RB (see T134200), but the same is not true for service-runner at this point.

If you say so, but that's not the impression I got from reading it.

In that case, I'll write a patch for T90668 to ease the work here.

@Arlolra, if you look at the dates & the discussion you'll notice that a) there is no dependency on node v4.x, and b) there are no plans to introduce such a dependency in the foreseeable future.

@GWicke: can you clarify? Your comment on that pull request says:
"We are starting to use ES2015 features available in the current node 4 LTS release, but would still like to be backwards-compatible with node 0.10 and 0.12 if that is reasonably possible." which is why you seem to be introducing babeljs there. So, I am reading it just like @Arlolra. If we migrate to service-runner now (when parsoid cluster is still running node 0.10), won't this break? That PR hasn't yet been merged, and it looks like if we go the babeljs route, it requires more perf. opt work.

@Arlolra, if you look at the dates & the discussion you'll notice that a) there is no dependency on node v4.x, and b) there are no plans to introduce such a dependency in the foreseeable future.

@GWicke: can you clarify? Your comment on that pull request says:
"We are starting to use ES2015 features available in the current node 4 LTS release, but would still like to be backwards-compatible with node 0.10 and 0.12 if that is reasonably possible." which is why you seem to be introducing babeljs there. So, I am reading it just like @Arlolra. If we migrate to service-runner now (when parsoid cluster is still running node 0.10), won't this break? That PR hasn't yet been merged, and it looks like if we go the babeljs route, it requires more perf. opt work.

service-runner loads some global-level goodies, such as corejs ES6 shims, which in this way become available to the service's code automagically. PR #73 proposes to do the same thing for ES6 syntax for services that still want to keep the compatibility with node 0.1x and to use the new language features at the same time.

TL;DR that PR tries to address issues in services using service-runner, not the package itself.

TL;DR that PR tries to address issues in services using service-runner, not the package itself.

Ah, okay. That is clearer now. Thanks.

service-runner loads some global-level goodies, such as corejs ES6 shims, which in this way become available to the service's code automagically.

That seems potentially problematic. Parsoid currently shims w/ core-js@1.2.6 but it'll now be getting core-js@2.0.3 (and, for the sake of argument, the service-runner launched http api isn't the only entrypoint into the codebase, we also have scripts like parse.js). So, we're unwittingly getting a core-js upgrade, which we'll have to keep in lock step with service-runner, to remain consistent.

My recommendation would be to separate moving to service-runner and Jessie and Node 4.x because the former implies testing and making sure all of the Parsoid's needs have been satisfied, which makes it hard to put an ETA on.

I'd say let's go ahead and move to Jessie and Node 4.x now, and then as a next step move to service-runner and clean up the mess in ops/puppet around Parsoid's role.

Given that Parsoid already supports running on service-runner, and so far the discussion in T90668 doesn't indicate any major remaining issues, my recommendation would be to make a time-boxed effort to finish the migration to service-runner. If it can be done in a week or so, then we can save some time by switching straight to service::node.

If it takes longer, then we can still do the puppet tweaks to move to 4.x without service-runner, and finish up the move to service-runner afterwards.

We would potentially save some work, and in the worst case we'd end up about even.

Given that Parsoid already supports running on service-runner, and so far the discussion in T90668 doesn't indicate any major remaining issues, my recommendation would be to make a time-boxed effort to finish the migration to service-runner. If it can be done in a week or so, then we can save some time by switching straight to service::node.

If it takes longer, then we can still do the puppet tweaks to move to 4.x without service-runner, and finish up the move to service-runner afterwards.

We would potentially save some work, and in the worst case we'd end up about even.

Arlo's patch is about ready for final review. Separately, we still need to resolve a few details about which I've left a comment in T90668#2324484 ... how long do we anticipate all these concerns to be addressed?

ssastry changed the task status from Open to Stalled.Jun 15 2016, 5:20 PM

Addendum to my earlier performance numbers: On a bunch of pages, looks like DOM post processing is about 2x faster on v4.3 vs v0.10 on my laptop.

ssastry changed the task status from Stalled to Open.Jul 21 2016, 4:43 PM

Okay, 2 months later, we are now ready to pick this up again. @akosiaris @mobrovac .. is first week of August a good time to pick this up again?

Looping @Joe and @Dzahn in too.

August works for me. @Joe, @akosiaris, @Dzahn ?

The plan is the following. Convert 2 to 3 machines in eqiad and monitor them for a couple of days and then continue with the rest after we are confident there are no problems. There should be no coding needed, only reimaging, as we already know Parsoid works on ruthenium, which is on Jessie and Node 4.

Added that T125003 subtask, but that might be the wrong one. Basically, we need to make sure Beta Cluster is updated before.

1st week of August is fine for me as well.

Proposed timeline:

  • 2016-08-02: take out wtp100[12] and reimage them, potentially start them up the same day
  • 2016-08-03: have the two nodes running on Jessie and Node 4 first thing in the EU morning
  • 2016-08-04: start transitioning other nodes all through next week

@mobrovac what's the plan regarding upgrading the services in Beta Cluster? Seems unwise to ignore our only holistic testing cluster and do this in prod first, no?

Change 302300 had a related patch set uploaded (by Mobrovac):
[Beta] Parsoid: Use deployment-parsoid09

https://gerrit.wikimedia.org/r/302300

Change 302301 had a related patch set uploaded (by Mobrovac):
[Beta] Parsoid: Switch to using deployment-parsoid09

https://gerrit.wikimedia.org/r/302301

Change 302301 merged by jenkins-bot:
[Beta] Parsoid: Switch to using deployment-parsoid09

https://gerrit.wikimedia.org/r/302301

@mobrovac what's the plan regarding upgrading the services in Beta Cluster? Seems unwise to ignore our only holistic testing cluster and do this in prod first, no?

{{done}} - Parsoid in BC is on Jessie / Node 4 and that host (deployment-parsoid09) is used by RESTBase, MW and the parsoid-beta.wmflabs.org proxy.

Change 302300 merged by Andrew Bogott:
[Beta] Parsoid: Use deployment-parsoid09

https://gerrit.wikimedia.org/r/302300

@mobrovac what's the plan regarding upgrading the services in Beta Cluster? Seems unwise to ignore our only holistic testing cluster and do this in prod first, no?

{{done}} - Parsoid in BC is on Jessie / Node 4 and that host (deployment-parsoid09) is used by RESTBase, MW and the parsoid-beta.wmflabs.org proxy.

Thanks! :)

Mentioned in SAL [2016-08-02T07:55:00Z] <akosiaris> T135176 depool wtp100[12]

Change 302389 had a related patch set uploaded (by Alexandros Kosiaris):
wtp100[12] to jessie

https://gerrit.wikimedia.org/r/302389

Change 302389 merged by Alexandros Kosiaris:
wtp100[12] to jessie

https://gerrit.wikimedia.org/r/302389

Mentioned in SAL [2016-08-02T12:34:07Z] <akosiaris> T135176 repool wtp100[12] with a weight of 1 instead of 15

wtp100[12] have been reinstalled as jessie and repooled with a very low weight temporarily which will be increased within the day, assuming problems do not arise

Mentioned in SAL [2016-08-02T14:26:22Z] <akosiaris> T135176 set weight for wtp100[12] to 8

Mentioned in SAL [2016-08-02T18:21:41Z] <akosiaris> T135176 set weight for wtp100[12] to 15

Mentioned in SAL [2016-08-03T12:31:00Z] <akosiaris> T135176 depool wtp100[34567]

Mentioned in SAL [2016-08-03T16:01:02Z] <akosiaris> T135176 pool wtp100[3457] with weight=15. wtp1006 does not look so good

Mentioned in SAL [2016-08-03T18:53:33Z] <akosiaris> T135176 pool wtp100[34567] with weight=15

Change 302880 had a related patch set uploaded (by Alexandros Kosiaris):
wtp10XX: Set all installers to jessie

https://gerrit.wikimedia.org/r/302880

Change 302880 merged by Alexandros Kosiaris:
wtp10XX: Set all installers to jessie

https://gerrit.wikimedia.org/r/302880

Mentioned in SAL [2016-08-05T07:18:06Z] <akosiaris> T135176 depool wtp1013-wtp1018

Mentioned in SAL [2016-08-05T10:23:38Z] <akosiaris> T135176 pool wtp1013-wtp1018

Mentioned in SAL [2016-08-05T10:24:04Z] <akosiaris> T135176 depool wtp1019-wtp1024

Mentioned in SAL [2016-08-05T15:16:13Z] <akosiaris> T135176 pool wtp1019-wtp1024

So, Parsoid @ EQIAD is now fully migrated. So most of this is basically done. What is still left to do is parsoid @ CODFW which is way way easier since it serves no traffic and hence, almost all nodes can be done at pretty much the same time. pybal might not like it too much of course so I 'll try to not trigger the threshold.

Change 303528 had a related patch set uploaded (by Alexandros Kosiaris):
parsoid: Set all wtp20XX boxes as jessie

https://gerrit.wikimedia.org/r/303528

Mentioned in SAL [2016-08-08T10:05:28Z] <akosiaris> T135176 depool wtp2001-wtp2020

Change 303528 merged by Alexandros Kosiaris:
parsoid: Set all wtp20XX boxes as jessie

https://gerrit.wikimedia.org/r/303528

Mentioned in SAL [2016-08-09T13:34:07Z] <akosiaris> all of wtp2001-wtp2020 except wtp2002 have been pooled. T135176

Mentioned in SAL [2016-08-09T14:37:42Z] <akosiaris> T135176 pool wtp2002

akosiaris claimed this task.

Finally all hosts have been reimaged and been pooled. Resolving