Page MenuHomePhabricator

Move AQS to nodejs 10
Closed, ResolvedPublic5 Estimated Story Points

Description

As the parent task explains, it would be great to move AQS to nodejs 10 if possible.

Event Timeline

elukey triaged this task as Medium priority.Nov 29 2018, 8:58 AM
elukey created this task.
Milimetric raised the priority of this task from Medium to High.Nov 29 2018, 6:30 PM
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Sorryyyy just realized that this task has been sitting here due to me!!!

So the plan should be the following:

  • review/merge https://gerrit.wikimedia.org/r/#/c/477475/ (rebased today, going to ask to the SRE team if it is ok to merge)
  • the above will allow us to deploy nodejs10 in the cloud instances of deployment-prep (that are still running Jessie, so we need to upgrade them to Stretch first)
  • test AQS
  • deploy in production

Change 491276 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Deployment-prep: add cassandra/twcs scap repository

https://gerrit.wikimedia.org/r/491276

Created the new cluster in deployment-prep:

elukey@deployment-aqs01:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UJ  172.16.1.50   71.82 KB   256          ?                 482e2b4a-1865-4bf7-9e95-b482b5a1011b  rack1
UN  172.16.1.5    88.81 KB   256          100.0%            aa7d44ce-f19d-481f-a2fb-0bcf7ec894fe  rack1
UN  172.16.1.190  101.9 KB   256          100.0%            9f188a57-eba0-405b-a9e8-4df4c5932e28  rack1

Change 491282 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aqs: add the possibily to deploy nodejs 10

https://gerrit.wikimedia.org/r/491282

Change 491282 merged by Elukey:
[operations/puppet@production] aqs: add the possibily to deploy nodejs 10

https://gerrit.wikimedia.org/r/491282

So next steps:

  1. Add some data to Cassandra in deployment prep (deployment-aqs0[1,2,3].deployment-prep.eqiad.wmflabs)
  2. Verify that AQS works as expected
  3. Add profile::aqs::use_nodejs10: true to puppet in Horizon and apt-get install nodejs 10 (will do it, should take 5 mins)
  4. Come up with a patch to upgrade the aqs deploy code to support nodejs 10
  5. Cherry pick it on deployment-deploy01.deployment-prep.eqiad.wmflabs (you can deploy from there aqs via scap deploy -e aqs-deployment-prep)
  6. Test in deployment prep
  7. Go to prod

Marko mentioned on IRC that if we have a .travis file in the repo then he can enable Travis on the GH mirror and run nodejs10 there too.

Change 491276 merged by Elukey:
[operations/puppet@production] Deployment-prep: add cassandra/twcs scap repository

https://gerrit.wikimedia.org/r/491276

Started trying to help on this, ran into two problems:

  1. keyspaces aren't created yet on deployment-aqs01. That means aqs didn't really start successfully but systemd says it did.
  2. deployment-aqs01 is running node 6.11, should I install 10.15.1, or will you do that? Or are we going with a different version?

Started trying to help on this, ran into two problems:

  1. keyspaces aren't created yet on deployment-aqs01. That means aqs didn't really start successfully but systemd says it did.

I created the user aqs in cassandra, this was surely missing. IIRC the keyspaces needed to be created by hand and then some data can be added to them via https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Load_data_into_cassandra_in_beta, but I might be wrong. @JAllemandou
do you remember?

  1. deployment-aqs01 is running node 6.11, should I install 10.15.1, or will you do that? Or are we going with a different version?

I didn't install 10.x since I wanted to have a stable cluster (resembling production) first, so we have a good testing baseline (was working before, does not now, etc..). So I'd try to make the deployment prep cluster working with 6.11 now, and then I'll upgrade. What do you think?

I created the user aqs in cassandra, this was surely missing. IIRC the keyspaces needed to be created by hand and then some data can be added to them via https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Load_data_into_cassandra_in_beta, but I might be wrong.

I think aqs user was the missing part here. Cassandra keyspaces should be created by AQS, making data-insertion feasible once AQS has been started successfully once.

So something is not working, I don't see the aqs keyspaces:

aqs@cqlsh> describe keyspaces;

system_traces  system_auth  system  system_distributed

Joseph found the answer, I missed to add the GRANTs for the aqs user, now I can see:

aqs@cqlsh> describe keyspaces;

"local_group_default_T_pageviews_per_project_v2"
"local_group_default_T_lgc_pagecounts_per_project"
"local_group_default_T_unique_devices"
system_auth
system
"local_group_default_T_top_pageviews"
system_distributed
system_traces
"local_group_default_T_top_bycountry"
"local_group_default_T_pageviews_per_article_flat"

@Milimetric you should be free to load data and test if everything works as expected!

@mobrovac hello! Today I had an interesting debugging session on the AQS deployment-prep hosts. I was bootstrapping the cluster with Stretch instances, cassandra configured, etc.. but I made the mistake of creating the aqs user without the related GRANTs.. aqs/service-runner was behaving in a weird way, namely not binding the 7323 listen port (probably because stuck trying to create the keyspaces in cassandra?) and not emitting anything in the logs that could have helped tracking the problem down (not even at trace level). Is there the chance to improve this? Maybe adding some logging that says "I am trying to create keyspaces now..". It would have helped a lot in narrowing down the problem :)

@mobrovac hello! Today I had an interesting debugging session on the AQS deployment-prep hosts. I was bootstrapping the cluster with Stretch instances, cassandra configured, etc.. but I made the mistake of creating the aqs user without the related GRANTs.. aqs/service-runner was behaving in a weird way, namely not binding the 7323 listen port (probably because stuck trying to create the keyspaces in cassandra?) and not emitting anything in the logs that could have helped tracking the problem down (not even at trace level). Is there the chance to improve this? Maybe adding some logging that says "I am trying to create keyspaces now..". It would have helped a lot in narrowing down the problem :)

Hm, that sounds weird. Usually when you set logging to trace, you should the queries it is trying to complete. Oh, but it couldn't because of the missing GRANTs. Hmmmm, in that case an error should have been visible in the logs. Was systemd restarting AQS or did it just hang?

@mobrovac hello! Today I had an interesting debugging session on the AQS deployment-prep hosts. I was bootstrapping the cluster with Stretch instances, cassandra configured, etc.. but I made the mistake of creating the aqs user without the related GRANTs.. aqs/service-runner was behaving in a weird way, namely not binding the 7323 listen port (probably because stuck trying to create the keyspaces in cassandra?) and not emitting anything in the logs that could have helped tracking the problem down (not even at trace level). Is there the chance to improve this? Maybe adding some logging that says "I am trying to create keyspaces now..". It would have helped a lot in narrowing down the problem :)

Hm, that sounds weird. Usually when you set logging to trace, you should the queries it is trying to complete. Oh, but it couldn't because of the missing GRANTs. Hmmmm, in that case an error should have been visible in the logs. Was systemd restarting AQS or did it just hang?

Just hanging, basically not reaching the point of binding the listen port..

Ok, @elukey I loaded fake data in the deployment cluster and verified AQS is working well. Base case done. You can do the profile::aqs::use_nodejs10: true magic. And I'll take it from there testing AQS and making any changes necessary.

@Milimetric done! I haven't rolled restart aqs yet (still using the old nodejs interpreter), so we can test (if you are ok) a deployment via scap that should 1) deploy the new code 2) restart aqs as well in one go (depooling/pooling). This would be an optimal test for the procedure that we'll use in production. Lemme know!

Restarted aqs in deployment-prep, we are using nodejs 10 in there. @Milimetric let's chat about the next steps whenever you have time, I think we are really close to deploy!

Change 495255 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/aqs@master] Bump up the referenced node version

https://gerrit.wikimedia.org/r/495255

Confirmed deployment-aqs servers are behaving normally with Node 10.4

Next step: deploy to prod. I pushed a change here to bump up the node version in package.json. We could build with that and test again if we want to be extra cautious. Let me know what you think @elukey

Confirmed deployment-aqs servers are behaving normally with Node 10.4

Next step: deploy to prod. I pushed a change here to bump up the node version in package.json. We could build with that and test again if we want to be extra cautious. Let me know what you think @elukey

We could rebuild and deploy the new aqs version with the nodejs10 packages.. What I'd like to avoid is that the next one that deploys AQS after this will have also to pick up the changes in https://gerrit.wikimedia.org/r/495255. What do you think?

Change 495255 merged by Milimetric:
[analytics/aqs@master] Bump up the referenced node version

https://gerrit.wikimedia.org/r/495255

ok, agreed, will build and deploy to staging now

Change 496110 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::aqs: use nodejs-10 for the aqs service

https://gerrit.wikimedia.org/r/496110

@Milimetric if the test went fine I'd say to proceed with production :)

What I'd do is the following:

  • Merge https://gerrit.wikimedia.org/r/496110 (puppet) and run puppet on the AQS nodes. This should enable the nodejs-10 repo but not upgrade the nodejs package
  • Install the nodejs package on aqs100* (apt-get install nodejs) that should not trigger a restart of the AQS service (namely aqs will keep going with nodejs-6.
  • Start the deployment, targeting the aqs1004 canary. Check for a bit traffic, errors, etc..
  • If the above is good, proceed with the rest, otherwise rollback.

What do you think? @mforns was also interested in participating, we could do it one of these EU evenings?

@elukey

we could do it one of these EU evenings?

Sure! I'll be on vacation, though, starting this Fri 15th (included), and will be back on Mon 25th.

Ok, everything looks good in the deployment-aqs cluster. Ready to deploy whenever yall want tomorrow.

Change 496110 merged by Elukey:
[operations/puppet@production] role::aqs: use nodejs-10 for the aqs service

https://gerrit.wikimedia.org/r/496110

elukey set the point value for this task to 5.Mar 14 2019, 6:07 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.