Maniphest T232007

Restbase: significant increase of outbound dropped packets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	Sep 4 2019, 4:34 PM

Description

See https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1566785022195&to=1566898273780

Which mean that for *some* reason the Linux kernel doesn't understand packets going out of a process, and drops them.

Timing matches:

13:15 mobrovac@deploy1001: Started deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953

from SAL

While in normal situation, exposing a new port would not cause this issue, @akosiaris identified the source process to be nodejs, and temporarily rolling back this change caused the issue to stop.

Deploying the matching iptables policy didn't solve the issue.

I don't know Restbase so I can't tell if it's causing any prod issue, but it's the sign of something wrong going on (bug in restbase?) and should be addressed.

Related Objects

Mentioned Here: T224553: Migrate remaining Restbase servers to Stretch
T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes

Event Timeline

ayounsi triaged this task as Medium priority.Sep 4 2019, 4:34 PM

ayounsi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2019, 4:34 PM

• Pchelolo added projects: Platform Team Workboards (Clinic Duty Team), User-mobrovac, RESTBase.Sep 4 2019, 4:38 PM

Indeed after commenting out the change in /etc/restbase/config.yaml and restarting the restbase process the behavior stopped. That being said, it took something like 3minutes which is way more than expected.

The way I identified the process to be nodejs is

akosiaris@restbase1022:~$ sudo perf record -g -a -e skb:kfree_skb
# let it run 30s, Ctrl-C
akosiaris@restbase1022:~$ sudo perf script
nodejs 98995 [000] 5537747.513730: skb:kfree_skb: skbaddr=0xffff9c281c8b1000 protocol=2048 location=0xffffffffbb1042b2
	    7fffbb0fef2f kfree_skb+0x80004540206f ([kernel.kallsyms])
	    7fffbb1062b2 sk_stream_kill_queues+0x800045402052 ([kernel.kallsyms])
	    7fffbb0fef2f kfree_skb+0x80004540206f ([kernel.kallsyms])
	    7fffbb1062b2 sk_stream_kill_queues+0x800045402052 ([kernel.kallsyms])
	    7fffbb160949 inet_csk_destroy_sock+0x800045402059 ([kernel.kallsyms])
	    7fffbb16eae6 tcp_fin+0x8000454020f6 ([kernel.kallsyms])
	    7fffbb16f0d0 tcp_data_queue+0x800045402570 ([kernel.kallsyms])
	    7fffbb17048a tcp_rcv_state_process+0x80004540236a ([kernel.kallsyms])
	    7fffbb12be8d sk_filter_trim_cap+0x80004540202d ([kernel.kallsyms])
	    7fffbb17aaa7 tcp_v4_do_rcv+0x8000454020a7 ([kernel.kallsyms])
	    7fffbb17c1d9 tcp_v4_rcv+0x800045402889 ([kernel.kallsyms])
	    7fffbb154d23 ip_local_deliver_finish+0x800045402093 ([kernel.kallsyms])
	    7fffbb154feb ip_local_deliver+0x80004540206b ([kernel.kallsyms])
	    7fffbb17b902 tcp_v4_early_demux+0x8000454020f2 ([kernel.kallsyms])
	    7fffbb154c90 ip_local_deliver_finish+0x800045402000 ([kernel.kallsyms])
	    7fffbb155304 ip_rcv+0x800045402294 ([kernel.kallsyms])
	    7fffbb154890 ip_rcv_finish+0x800045402000 ([kernel.kallsyms])
	    7fffbb11343d __netif_receive_skb_core+0x80004540251d ([kernel.kallsyms])
	    7fffbadeb3bc kmem_cache_alloc+0x80004540211c ([kernel.kallsyms])
	    7fffbb191288 inet_gro_receive+0x8000454021f8 ([kernel.kallsyms])
	    7fffbb1139df netif_receive_skb_internal+0x80004540202f ([kernel.kallsyms])
	    7fffbb1147d8 napi_gro_receive+0x8000454020b8 ([kernel.kallsyms])
	    7fffc04bad38 tg3_poll_work+0x800040002a58 ([kernel.kallsyms])
	    7fffbacb754d task_tick_fair+0x80004540252d ([kernel.kallsyms])
	    7fffc04bb22a tg3_poll_msix+0x80004000203a ([kernel.kallsyms])
	    7fffbb114226 net_rx_action+0x800045402246 ([kernel.kallsyms])
	    7fffbb22045d __do_softirq+0x80004540210d ([kernel.kallsyms])
	    7fffbac82c62 irq_exit+0x8000454020c2 ([kernel.kallsyms])
	    7fffbb21f4e7 do_IRQ+0x800045402057 ([kernel.kallsyms])
	    7fffbb21d256 ret_from_intr+0x800045402000 ([kernel.kallsyms])
	          143fa4 [unknown] (/lib/x86_64-linux-gnu/libc-2.24.so)
	          681f06 v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::FindEntry+0xffffffffffc00086 (/usr/bin/nodejs)
	          688a55 v8::internal::StringTable::LookupKey+0xffffffffffc00035 (/usr/bin/nodejs)
	          690e8a v8::internal::StringTable::LookupString+0xffffffffffc0009a (/usr/bin/nodejs)
	          669438 v8::internal::LookupIterator::PropertyOrElement+0xffffffffffc004e8 (/usr/bin/nodejs)
	          7bd65f v8::internal::Runtime::GetObjectProperty+0xffffffffffc0004f (/usr/bin/nodejs)
	          7bd740 [unknown] (/usr/bin/nodejs)
	          7c0d88 v8::internal::Runtime_KeyedGetProperty+0xffffffffffc00118 (/usr/bin/nodejs)
	    1ea2482060c7 [unknown] (/tmp/perf-98995.map)
	    1ea248b84a2b [unknown] (/tmp/perf-98995.map)
	    1ea248af9326 [unknown] (/tmp/perf-98995.map)
	    1ea248af4164 [unknown] (/tmp/perf-98995.map)
	    1ea24a0585f0 [unknown] (/tmp/perf-98995.map)
	    1ea248af3860 [unknown] (/tmp/perf-98995.map)
	    1ea248af2fc3 [unknown] (/tmp/perf-98995.map)
	    1ea248af2e3c [unknown] (/tmp/perf-98995.map)
	    1ea248b6367f [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248af2cbd [unknown] (/tmp/perf-98995.map)
	    1ea248af25a0 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248aeedf9 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248aeea28 [unknown] (/tmp/perf-98995.map)
	    1ea248797ad5 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248d0ffa2 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248aee5ef [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248aee07c [unknown] (/tmp/perf-98995.map)
	    1ea24a05a181 [unknown] (/tmp/perf-98995.map)
	    1ea248aed2b4 [unknown] (/tmp/perf-98995.map)
	    1ea248aed192 [unknown] (/tmp/perf-98995.map)
	    1ea248aec924 [unknown] (/tmp/perf-98995.map)
	    1ea24a0594de [unknown] (/tmp/perf-98995.map)
	    1ea248aeadf9 [unknown] (/tmp/perf-98995.map)
	    1ea248aeabe2 [unknown] (/tmp/perf-98995.map)
	    1ea248aea5ca [unknown] (/tmp/perf-98995.map)
	    1ea248aea32b [unknown] (/tmp/perf-98995.map)
	    1ea248ae9435 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248b63ab8 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248ae9105 [unknown] (/tmp/perf-98995.map)
	    1ea249bd7685 [unknown] (/tmp/perf-98995.map)
	    1ea248d0fc6b [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea249be4860 [unknown] (/tmp/perf-98995.map)
	    1ea2487c06e8 [unknown] (/tmp/perf-98995.map)
	    1ea248ad962e [unknown] (/tmp/perf-98995.map)
	    1ea248adaa00 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea2487b8718 [unknown] (/tmp/perf-98995.map)
	    1ea248d1006b [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248ae0ba9 [unknown] (/tmp/perf-98995.map)
	    1ea248839e80 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea2487d6f82 [unknown] (/tmp/perf-98995.map)
	    1ea2487d45f0 [unknown] (/tmp/perf-98995.map)
	    1ea2487be1be [unknown] (/tmp/perf-98995.map)
	    1ea2487bd1c0 [unknown] (/tmp/perf-98995.map)
	    1ea248ad37b9 [unknown] (/tmp/perf-98995.map)
	    1ea248ad962e [unknown] (/tmp/perf-98995.map)
	    1ea248adaa00 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea2487b8718 [unknown] (/tmp/perf-98995.map)
	    1ea248d1006b [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248ae0be5 [unknown] (/tmp/perf-98995.map)
	    1ea248e8f67a [unknown] (/tmp/perf-98995.map)
	    1ea2487b92a8 [unknown] (/tmp/perf-98995.map)
	    1ea248ad387d [unknown] (/tmp/perf-98995.map)
	    1ea248ad962e [unknown] (/tmp/perf-98995.map)
	    1ea248adaa00 [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea2487b8718 [unknown] (/tmp/perf-98995.map)
	    1ea248d1006b [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea248ae0b6d [unknown] (/tmp/perf-98995.map)
	    1ea248adc6cf [unknown] (/tmp/perf-98995.map)
	    1ea24821c7d5 [unknown] (/tmp/perf-98995.map)
	    1ea24821ec1a [unknown] (/tmp/perf-98995.map)
	    1ea24821948f [unknown] (/tmp/perf-98995.map)
	          57df70 [unknown] (/usr/bin/nodejs)
	          57e2e0 v8::internal::Execution::Call+0xffffffffffc00070 (/usr/bin/nodejs)
	          2fdceb v8::Function::Call+0xffffffffff80012b (/usr/bin/nodejs)
	          2fded1 v8::Function::Call+0xffffffffff800041 (/usr/bin/nodejs)
	          960aee node::AsyncWrap::MakeCallback+0xffffffffffc0015e (/usr/bin/nodejs)
	          9b74c4 node::StreamBase::EmitData+0xffffffffffc000d4 (/usr/bin/nodejs)
	          9eb101 node::TLSWrap::OnReadSelf+0xffffffffffc00041 (/usr/bin/nodejs)
	          9ed985 node::TLSWrap::ClearOut+0xffffffffffc00115 (/usr/bin/nodejs)
	          9ee140 node::TLSWrap::DoRead+0xffffffffffc000d0 (/usr/bin/nodejs)
	          9ba7ea node::StreamWrap::OnReadCommon+0xffffffffffc0007a (/usr/bin/nodejs)
	           137cd [unknown] (/usr/lib/x86_64-linux-gnu/libuv.so.1.0.0)
	    7fa5e67d9900 [unknown] ([unknown])
[snip]

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:32:07Z] <akosiaris> depool restbase1022 T232007

Depooling the server from the restbase service also minimized the amount of discards. https://grafana.wikimedia.org/d/VHtefZcZk/t232007?orgId=1&from=1567670685741&to=1567689425861. From 08:31 to 11:48 restbase1022 was depooled. While that's way more easy to explain, it still is weird.

Ah, this could be an issue with the rate limiter and having multiple services in service-runner that use it. We need to take a deep dive into service-runner and the rate limiter here to validate this hypothesis.

This is back to much better values as of 16:50 UTC for codfw and 18:40 for eqiad. Still higher than before the main increase though.
From SAL it matches:
16:39 urandom: decommissioning Cassandra, restbase1018-a -- T224553
18:43 urandom: decommissioning Cassandra, restbase1018-b -- T224553
/cc @Eevans

And it's alerting again...
https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1568895095253&to=1568912301774

13:01 mobrovac@deploy1001: Finished deploy [restbase/deploy@7f4b7f7]: Start using RESTBase built on Stretch - T224553 (duration: 21m 38s)
12:39 mobrovac@deploy1001: Started deploy [restbase/deploy@7f4b7f7]: Start using RESTBase built on Stretch - T224553

This is very weird. The SAL entry you mention did not change anything code-wise, only rebuilt some native node modules (which are not used in normal operation).

In T232007#5478163, @mobrovac wrote:

Ah, this could be an issue with the rate limiter and having multiple services in service-runner that use it. We need to take a deep dive into service-runner and the rate limiter here to validate this hypothesis.

I looked at this. I thought that perhaps a new rate limiter is being instatiated for every service service-runner runs (it has 2 instances per process now), but that's not the case - the rate-limiting DHT is created oly once and then passed onto the individual services.

I'm at a loss here... We'll need to investigate other avenues.

daniel moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Oct 8 2019, 9:57 AM

Back to normal on 10-07
https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=21&fullscreen&edit&tab=alert&orgId=1&from=1570407765208&to=1570486140039

This matches Restbase deploys from SAL.
@mobrovac Does those deploys gives clues on what went wrong? Would be great to know in case the same event happen in the future. Then it's fine to close the task.

I don't have a definitive answer here. The deploys that coincide with the drop did streamline the start-up sequence and increase the allowed memory per worker to avoid worker deaths, so either of these would help:

the start-up sequence now does not issue a high number of queries to Cassandra on start-up
increasing the memory limit stopped worker deaths

My money would be on #2, since it is possible that the master was trying to send messages to a dying worker, hence causing dropped packets.

There were no more worker deaths after 2019-10-07 and no increase in dropped packets either. I'm calling this done.

Restbase: significant increase of outbound dropped packetsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Restbase: significant increase of outbound dropped packets
Closed, ResolvedPublic
Actions