Page MenuHomePhabricator

parsoid-vd on scandium randomly died
Closed, ResolvedPublic

Description

Today parsoid-vd service started alerting from scandium. There is a traceback in the log:

Apr  2 20:26:22 scandium nodejs[100833]: events.js:167
Apr  2 20:26:22 scandium nodejs[100833]:       throw er; // Unhandled 'error' event
Apr  2 20:26:22 scandium nodejs[100833]:       ^
Apr  2 20:26:22 scandium nodejs[100833]: Error: Connection lost: The server closed the connection.
Apr  2 20:26:22 scandium nodejs[100833]:     at Protocol.end (/srv/testreduce/node_modules/mysql/lib/protocol/Protocol.js:109:13)
Apr  2 20:26:22 scandium nodejs[100833]:     at Socket.<anonymous> (/srv/testreduce/node_modules/mysql/lib/Connection.js:102:28)
Apr  2 20:26:22 scandium nodejs[100833]:     at Socket.emit (events.js:187:15)
Apr  2 20:26:22 scandium nodejs[100833]:     at endReadableNT (_stream_readable.js:1081:12)
Apr  2 20:26:22 scandium nodejs[100833]:     at process._tickCallback (internal/process/next_tick.js:63:19)
Apr  2 20:26:22 scandium nodejs[100833]: Emitted 'error' event at:
Apr  2 20:26:22 scandium nodejs[100833]:     at Connection._handleProtocolError (/srv/testreduce/node_modules/mysql/lib/Connection.js:426:8)
Apr  2 20:26:22 scandium nodejs[100833]:     at Protocol.emit (events.js:182:13)
Apr  2 20:26:22 scandium nodejs[100833]:     at Protocol._delegateError (/srv/testreduce/node_modules/mysql/lib/protocol/Protocol.js:393:10)
Apr  2 20:26:22 scandium nodejs[100833]:     at Protocol.end (/srv/testreduce/node_modules/mysql/lib/protocol/Protocol.js:113:8)
Apr  2 20:26:22 scandium nodejs[100833]:     at Socket.<anonymous> (/srv/testreduce/node_modules/mysql/lib/Connection.js:102:28)
Apr  2 20:26:22 scandium nodejs[100833]:     [... lines matching original stack trace ...]
Apr  2 20:26:22 scandium nodejs[100833]:     at process._tickCallback (internal/process/next_tick.js:63:19)
Apr  2 20:26:22 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr  2 20:26:22 scandium systemd[1]: parsoid-vd.service: Unit entered failed state.
Apr  2 20:26:22 scandium systemd[1]: parsoid-vd.service: Failed with result 'exit-code'.

I restarted the service to clear the alert. I was told that this may be of interest from the perspective of similar errors in production.

Event Timeline

Could you please associate at least one active project with this task (via the Add Action...Change Project Tags dropdown)? This will allow others to get notified and see this task when looking at the corresponding project workboard. Thanks!

crusnov triaged this task as Medium priority.Apr 4 2019, 4:23 AM
crusnov added a project: SRE.

Earlier i had re-enabled monitoring and started the service. I think it's known to @ssastry that it works for a while and then failed before.

08:39 <+icinga-wm> PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.

See logs below from scandium. Seems to reliably shut down after 1 hour of no activity. Must be some config setting in some nodejs library.

$ sudo journalctl -n 1000 -u parsoid-vd | egrep "8011|FAIL"
....
Apr 05 00:56:13 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 00:56:25 scandium nodejs[188711]: RT test server listening on: 8011
Apr 05 01:56:25 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 02:25:57 scandium nodejs[206954]: RT test server listening on: 8011
Apr 05 03:25:57 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 03:26:06 scandium nodejs[219146]: RT test server listening on: 8011
Apr 05 04:26:06 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 04:26:24 scandium nodejs[231176]: RT test server listening on: 8011
Apr 05 05:26:24 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 05:56:19 scandium nodejs[248641]: RT test server listening on: 8011
Apr 05 06:56:19 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 07:25:59 scandium nodejs[5197]: RT test server listening on: 8011
Apr 05 08:25:59 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 08:26:11 scandium nodejs[17336]: RT test server listening on: 8011
Apr 05 09:26:11 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 09:56:19 scandium nodejs[35225]: RT test server listening on: 8011
Apr 05 10:56:19 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 11:25:49 scandium nodejs[53294]: RT test server listening on: 8011
Apr 05 12:25:50 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE
Apr 05 12:26:23 scandium nodejs[65416]: RT test server listening on: 8011
Apr 05 13:26:23 scandium systemd[1]: parsoid-vd.service: Main process exited, code=exited, status=1/FAILURE

@Arlolra @cscott Any insights here? Why would express terminate after an hour of being idle? (testreduce codebase for parsoid-vd service on scandium).

This is not critical. I asked @Dzahn to turn off these alerts for now. Meanwhile, I started a test run to confirm that the service doesn't terminate when there is activity against the service.

Change 501586 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/parsoid: disable systemd monitoring on scandium test host

https://gerrit.wikimedia.org/r/501586

Change 501586 merged by Dzahn:
[operations/puppet@production] icinga/parsoid: disable systemd monitoring on scandium test host

https://gerrit.wikimedia.org/r/501586

systemd monitoring has been removed from icinga entirely for scandium. that's what was flapping due to this. all other base monitoring checks are still here and enabled (again).

did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks

root@scandium:~# systemctl status parsoid-vd
● parsoid-vd.service - parsoid-vd: Testreduce HTTP service for visual-diff results
   Loaded: loaded (/lib/systemd/system/parsoid-vd.service; static; vendor preset: enabled)
   Active: active (running) since Mon 2019-04-15 22:07:01 UTC; 3 weeks 1 days ago
 Main PID: 57596 (nodejs)
    Tasks: 10 (limit: 39321)
   CGroup: /system.slice/parsoid-vd.service
           └─57596 /usr/bin/nodejs server.js --config /etc/testreduce/parsoid-vd.settings.js

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@scandium:~#  sudo journalctl -n 1000 -u parsoid-vd | egrep "8011|FAIL"
root@scandium:~#
Dzahn claimed this task.

did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks

Thanks! Based on this and also on:

This is not critical. I asked @Dzahn to turn off these alerts for now. Meanwhile, I started a test run to confirm that the service
doesn't terminate when there is activity against the service.

I would say we can close this. @ssastry if you disagree we can of course reopen it right away.

did this issue get fixed? i just checked scandium and parsoid-vd has been running for 3 weeks

Thanks! Based on this and also on:

This is not critical. I asked @Dzahn to turn off these alerts for now. Meanwhile, I started a test run to confirm that the service
doesn't terminate when there is activity against the service.

I would say we can close this. @ssastry if you disagree we can of course reopen it right away.

I didn't do anything to fix it ... but, if it was something transient and it got fixed .. then that works for me.