Page MenuHomePhabricator

etherpad showing transient connection issues
Closed, ResolvedPublic

Description

Experiencing issues with etherpad today - seeing a lot of connect/disconnect (this morning) and experienced failure to load this afternoon, with error "upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused"

Reuven observed etherpad has died with "JavaScript heap out of memory" 521 times today

Event Timeline

From the logs:

May 28 23:29:03 etherpad1004 nodejs[1110841]: <--- Last few GCs --->
May 28 23:29:03 etherpad1004 nodejs[1110841]: [1110841:0x55e14adad140]    51077 ms: Mark-sweep 916.0 (1005.2) -> 915.2 (1005.2) MB, 199.8 / 0.0 ms  (average mu = 0.646, current mu = 0.232) allocation failure; scavenge might not succeed
May 28 23:29:03 etherpad1004 nodejs[1110841]: [1110841:0x55e14adad140]    51382 ms: Mark-sweep 923.0 (1005.2) -> 922.2 (1021.2) MB, 251.2 / 0.0 ms  (average mu = 0.477, current mu = 0.177) allocation failure; scavenge might not succeed
May 28 23:29:03 etherpad1004 nodejs[1110841]: <--- JS stacktrace --->
May 28 23:29:03 etherpad1004 nodejs[1110841]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
May 28 23:29:03 etherpad1004 nodejs[1110841]:  1: 0x7fa0ab61ab08 node::Abort() [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  2: 0x7fa0ab50e0e4  [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  3: 0x7fa0aba17e10 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  4: 0x7fa0aba181cc v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  5: 0x7fa0abbea475  [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  6: 0x7fa0abbfd859 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  7: 0x7fa0abbdb512 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  8: 0x7fa0abbdc8a5 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]:  9: 0x7fa0abbbcb8c v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 10: 0x7fa0abbb5744 v8::internal::FactoryBase<v8::internal::Factory>::NewRawTwoByteString(int, v8::internal::AllocationType) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 11: 0x7fa0abff5d7d v8::internal::IncrementalStringBuilder::Extend() [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 12: 0x7fa0abd0582b v8::internal::JsonStringifier::SerializeString(v8::internal::Handle<v8::internal::String>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 13: 0x7fa0abd09041 v8::internal::JsonStringifier::Result v8::internal::JsonStringifier::Serialize_<true>(v8::internal::Handle<v8::internal::Object>, bool, v8::internal::Handle<v8::internal::Object>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 14: 0x7fa0abd0a730 v8::internal::JsonStringifier::Result v8::internal::JsonStringifier::Serialize_<true>(v8::internal::Handle<v8::internal::Object>, bool, v8::internal::Handle<v8::internal::Object>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 15: 0x7fa0abd0adb9 v8::internal::JsonStringifier::SerializeJSReceiverSlow(v8::internal::Handle<v8::internal::JSReceiver>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 16: 0x7fa0abd07f88  [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 17: 0x7fa0abd08545 v8::internal::JsonStringifier::Stringify(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 18: 0x7fa0abd085ab v8::internal::JsonStringify(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 19: 0x7fa0aba9dc25 v8::internal::Builtin_JsonStringify(int, unsigned long*, v8::internal::Isolate*) [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 nodejs[1110841]: 20: 0x7fa0ab9218f9  [/lib/x86_64-linux-gnu/libnode.so.108]
May 28 23:29:03 etherpad1004 systemd[1]: etherpad-lite.service: Main process exited, code=killed, status=6/ABRT
May 28 23:29:03 etherpad1004 systemd[1]: etherpad-lite.service: Failed with result 'signal'.
May 28 23:29:03 etherpad1004 systemd[1]: etherpad-lite.service: Consumed 29.555s CPU time.
May 28 23:29:03 etherpad1004 systemd[1]: etherpad-lite.service: Scheduled restart job, restart counter is at 1048.
May 28 23:29:03 etherpad1004 systemd[1]: Stopped etherpad-lite.service - Etherpad-lite daemon.
May 28 23:29:03 etherpad1004 systemd[1]: etherpad-lite.service: Consumed 29.555s CPU time.
May 28 23:29:03 etherpad1004 systemd[1]: Started etherpad-lite.service - Etherpad-lite daemon.

And the source for that number was just:

rzl@etherpad1004:~$ sudo journalctl -u etherpad-lite.service --since today --grep "out of memory" | wc -l
521

That's mostly a drive-by; leaving any deeper investigation for collaboration-services.

Mentioned in SAL (#wikimedia-operations) [2026-05-29T08:46:54Z] <jelto> gnt-instance modify -B memory=4g,vcpus=1 etherpad1004.eqiad.wmnet - T427588

VM etherpad1004.eqiad.wmnet rebooted by jelto@cumin1003 with reason: memory increased

Mentioned in SAL (#wikimedia-operations) [2026-05-29T08:59:38Z] <jelto> gnt-instance modify -B memory=4g,vcpus=1 etherpad2002.codfw.wmnet - T427588

VM etherpad2002.codfw.wmnet rebooted by jelto@cumin1003 with reason: memory increased

Jelto triaged this task as Medium priority.Fri, May 29, 9:07 AM

Thank you @RLazarus for the initial research and thank you @BLiviero-WMF for reporting the issue. Yes the etherpad host is quite small in terms of memory and CPU.

I resized both VMS (the active one and the replica) from 2GB to 4GB of memory. I'll check the number of out of memory errors tomorrow.

The out of memory errors on etherpad1004 were significantly decreased with 4GB of memory:

jelto@etherpad1004:~$ sudo journalctl -u etherpad-lite.service --since "2026-05-29 09:00:00" --grep "out of memory" | wc -l
29

Compared to over 500 errors for a single day. I'll resolve the task. Long term a new etherpad version and/or infrastructure will improve the situation even further.