Page MenuHomePhabricator

Visual-diff testreduce make ruthenium unresponsive
Closed, ResolvedPublic

Description

Today, starting at 2017-01-24 18:44 UTC, the visual-diff testreduce on ruthenium.eqiad.wmnet started spawning processes, ending with the almost complete hang of the host due to full RAM and swap.

I had to stop both parsoid-vd-client and parsoid-vd processes and disable Puppet to avoid it to restart them. Because if they are restarted the test starts again spawning processes and ending up hanging the host.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-01-24T20:49:18Z] <volans> disabled puppet on ruthenium to avoid the restart of parsoid-vd and parsoid-vd-client processes T156177

Change 334452 had a related patch set uploaded (by Volans):
Testreduce: allow to decide the state of the services

https://gerrit.wikimedia.org/r/334452

Change 334452 merged by Volans:
Testreduce: allow to decide the state of the services

https://gerrit.wikimedia.org/r/334452

Mentioned in SAL (#wikimedia-operations) [2017-01-27T00:12:14Z] <volans> re-enabled puppet (with a temporary fix to keep parsoid-vd and parsoid-vd-client stopped) on ruthenium T156177

@ssastry @mobrovac Puppet re-enabled with a temporary patch to allow this. Let us know once the issue is fixed to revert the patch and make Puppet keep the services running.

If you need to test it on ruthenium, you can still start the parsoid-vd and parsoid-vd-client services, but they will be stopped at the next puppet run (in at most 30 minutes).

Thnx @Volans for taking care of this and keeping tabs on it :) @ssastry, please let us know once you rebuild the VD repos and redeploy them on ruthenium so that things can get back to normal.

Thnx @Volans for taking care of this and keeping tabs on it :) @ssastry, please let us know once you rebuild the VD repos and redeploy them on ruthenium so that things can get back to normal.

Okay, I rebuilt the node modules and redeployed on ruthenium. I fixed some issues but looks like I need to fix some other code for visual diffing to function properly with the rebuilt code. But, now at least the clients aren't respawning endless .. they are just stuck on phantom actually generating screenshots. Let us leave this ticket unresolved until I am able to investigate that.

This has turned into a rabbit hole but a good one :-) .. I've started updating node modules to newer versions, migrating code to use promises, and fixing problems. Anyway, at this point, this is all unrelated to the node version update (even thought that triggered it). So, I am going to resolve this ticket.

ssastry claimed this task.

@ssastry: does this mean that https://gerrit.wikimedia.org/r/#/c/334452 can be reverted and restart the 2 services?

@ssastry: does this mean that https://gerrit.wikimedia.org/r/#/c/334452 can be reverted and restart the 2 services?

Ah, I forgot about that. Not yet. The service is still broken .. it won't do harm to restart the services, but I'll ping you once I fix the remaining issues.

Okay .. I have this almost working now. But, I am finding that the old proxy settings I used for nodejs as well as phantom to contact services are getting in the way. After I fixed everything else with the code, I am finding that I should disable proxy settings in my code. But, as far as I can tell, this doesn't seem related to my code. So, did something change on ruthenium wrt proxy settings?

So, did something change on ruthenium wrt proxy settings?

No, there shouldn't have been any significant proxy changes for quite a while. Can you narrow this down time-wise? When do you start seeing problems?

So, did something change on ruthenium wrt proxy settings?

No, there shouldn't have been any significant proxy changes for quite a while. Can you narrow this down time-wise? When do you start seeing problems?

This is all mixed up with the node upgrade from v4 to v6. This ticket was created when that upgrade happened .. but it is possible that the tests were failing for a while before then. Since I stopped referring to these test results earlier last year while we focused on other projects, any failures likely slipped by till the node upgrade. So, this proxy business has probably been broken for some time now, but only caught now when I started updating the code to worked with newer libraries and newer node version.

Change 337553 had a related patch set uploaded (by Volans):
Testreduce: use address instead of IP for web proxy

https://gerrit.wikimedia.org/r/337553

Change 337553 merged by Volans:
Testreduce: use address instead of IP for web proxy

https://gerrit.wikimedia.org/r/337553

For reference, it was the HTTP proxy that was hardcoded with carbon's IP in the systemd unit file. Modified with the name instead in the above patch. (@Dzahn FYI )

Change 337556 had a related patch set uploaded (by Volans):
Revert "Testreduce: allow to decide the state of the services"

https://gerrit.wikimedia.org/r/337556

Change 337557 had a related patch set uploaded (by Volans):
Testreduce: renamed environmental variable

https://gerrit.wikimedia.org/r/337557

Change 337557 merged by Volans:
Testreduce: renamed environmental variable

https://gerrit.wikimedia.org/r/337557

Change 337556 merged by Volans:
Revert "Testreduce: allow to decide the state of the services"

https://gerrit.wikimedia.org/r/337556