Page MenuHomePhabricator

Restbase: significant increase of outbound dropped packets
Open, NormalPublic



Which mean that for *some* reason the Linux kernel doesn't understand packets going out of a process, and drops them.

Timing matches:

13:15 mobrovac@deploy1001: Started deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953

from SAL

While in normal situation, exposing a new port would not cause this issue, @akosiaris identified the source process to be nodejs, and temporarily rolling back this change caused the issue to stop.

Deploying the matching iptables policy didn't solve the issue.

I don't know Restbase so I can't tell if it's causing any prod issue, but it's the sign of something wrong going on (bug in restbase?) and should be addressed.

Event Timeline

ayounsi triaged this task as Normal priority.Wed, Sep 4, 4:34 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Sep 4, 4:34 PM

Indeed after commenting out the change in /etc/restbase/config.yaml and restarting the restbase process the behavior stopped. That being said, it took something like 3minutes which is way more than expected.

The way I identified the process to be nodejs is

akosiaris@restbase1022:~$ sudo perf record -g -a -e skb:kfree_skb
# let it run 30s, Ctrl-C
akosiaris@restbase1022:~$ sudo perf script
nodejs 98995 [000] 5537747.513730: skb:kfree_skb: skbaddr=0xffff9c281c8b1000 protocol=2048 location=0xffffffffbb1042b2
	    7fffbb0fef2f kfree_skb+0x80004540206f ([kernel.kallsyms])
	    7fffbb1062b2 sk_stream_kill_queues+0x800045402052 ([kernel.kallsyms])
	    7fffbb0fef2f kfree_skb+0x80004540206f ([kernel.kallsyms])
	    7fffbb1062b2 sk_stream_kill_queues+0x800045402052 ([kernel.kallsyms])
	    7fffbb160949 inet_csk_destroy_sock+0x800045402059 ([kernel.kallsyms])
	    7fffbb16eae6 tcp_fin+0x8000454020f6 ([kernel.kallsyms])
	    7fffbb16f0d0 tcp_data_queue+0x800045402570 ([kernel.kallsyms])
	    7fffbb17048a tcp_rcv_state_process+0x80004540236a ([kernel.kallsyms])
	    7fffbb12be8d sk_filter_trim_cap+0x80004540202d ([kernel.kallsyms])
	    7fffbb17aaa7 tcp_v4_do_rcv+0x8000454020a7 ([kernel.kallsyms])
	    7fffbb17c1d9 tcp_v4_rcv+0x800045402889 ([kernel.kallsyms])
	    7fffbb154d23 ip_local_deliver_finish+0x800045402093 ([kernel.kallsyms])
	    7fffbb154feb ip_local_deliver+0x80004540206b ([kernel.kallsyms])
	    7fffbb17b902 tcp_v4_early_demux+0x8000454020f2 ([kernel.kallsyms])
	    7fffbb154c90 ip_local_deliver_finish+0x800045402000 ([kernel.kallsyms])
	    7fffbb155304 ip_rcv+0x800045402294 ([kernel.kallsyms])
	    7fffbb154890 ip_rcv_finish+0x800045402000 ([kernel.kallsyms])
	    7fffbb11343d __netif_receive_skb_core+0x80004540251d ([kernel.kallsyms])
	    7fffbadeb3bc kmem_cache_alloc+0x80004540211c ([kernel.kallsyms])
	    7fffbb191288 inet_gro_receive+0x8000454021f8 ([kernel.kallsyms])
	    7fffbb1139df netif_receive_skb_internal+0x80004540202f ([kernel.kallsyms])
	    7fffbb1147d8 napi_gro_receive+0x8000454020b8 ([kernel.kallsyms])
	    7fffc04bad38 tg3_poll_work+0x800040002a58 ([kernel.kallsyms])
	    7fffbacb754d task_tick_fair+0x80004540252d ([kernel.kallsyms])
	    7fffc04bb22a tg3_poll_msix+0x80004000203a ([kernel.kallsyms])
	    7fffbb114226 net_rx_action+0x800045402246 ([kernel.kallsyms])
	    7fffbb22045d __do_softirq+0x80004540210d ([kernel.kallsyms])
	    7fffbac82c62 irq_exit+0x8000454020c2 ([kernel.kallsyms])
	    7fffbb21f4e7 do_IRQ+0x800045402057 ([kernel.kallsyms])
	    7fffbb21d256 ret_from_intr+0x800045402000 ([kernel.kallsyms])
	          143fa4 [unknown] (/lib/x86_64-linux-gnu/
	          681f06 v8::internal::HashTable<v8::internal::StringTable, v8::internal::StringTableShape, v8::internal::HashTableKey*>::FindEntry+0xffffffffffc00086 (/usr/bin/nodejs)
	          688a55 v8::internal::StringTable::LookupKey+0xffffffffffc00035 (/usr/bin/nodejs)
	          690e8a v8::internal::StringTable::LookupString+0xffffffffffc0009a (/usr/bin/nodejs)
	          669438 v8::internal::LookupIterator::PropertyOrElement+0xffffffffffc004e8 (/usr/bin/nodejs)
	          7bd65f v8::internal::Runtime::GetObjectProperty+0xffffffffffc0004f (/usr/bin/nodejs)
	          7bd740 [unknown] (/usr/bin/nodejs)
	          7c0d88 v8::internal::Runtime_KeyedGetProperty+0xffffffffffc00118 (/usr/bin/nodejs)
	    1ea2482060c7 [unknown] (/tmp/
	    1ea248b84a2b [unknown] (/tmp/
	    1ea248af9326 [unknown] (/tmp/
	    1ea248af4164 [unknown] (/tmp/
	    1ea24a0585f0 [unknown] (/tmp/
	    1ea248af3860 [unknown] (/tmp/
	    1ea248af2fc3 [unknown] (/tmp/
	    1ea248af2e3c [unknown] (/tmp/
	    1ea248b6367f [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248af2cbd [unknown] (/tmp/
	    1ea248af25a0 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248aeedf9 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248aeea28 [unknown] (/tmp/
	    1ea248797ad5 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248d0ffa2 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248aee5ef [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248aee07c [unknown] (/tmp/
	    1ea24a05a181 [unknown] (/tmp/
	    1ea248aed2b4 [unknown] (/tmp/
	    1ea248aed192 [unknown] (/tmp/
	    1ea248aec924 [unknown] (/tmp/
	    1ea24a0594de [unknown] (/tmp/
	    1ea248aeadf9 [unknown] (/tmp/
	    1ea248aeabe2 [unknown] (/tmp/
	    1ea248aea5ca [unknown] (/tmp/
	    1ea248aea32b [unknown] (/tmp/
	    1ea248ae9435 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248b63ab8 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248ae9105 [unknown] (/tmp/
	    1ea249bd7685 [unknown] (/tmp/
	    1ea248d0fc6b [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea249be4860 [unknown] (/tmp/
	    1ea2487c06e8 [unknown] (/tmp/
	    1ea248ad962e [unknown] (/tmp/
	    1ea248adaa00 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea2487b8718 [unknown] (/tmp/
	    1ea248d1006b [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248ae0ba9 [unknown] (/tmp/
	    1ea248839e80 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea2487d6f82 [unknown] (/tmp/
	    1ea2487d45f0 [unknown] (/tmp/
	    1ea2487be1be [unknown] (/tmp/
	    1ea2487bd1c0 [unknown] (/tmp/
	    1ea248ad37b9 [unknown] (/tmp/
	    1ea248ad962e [unknown] (/tmp/
	    1ea248adaa00 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea2487b8718 [unknown] (/tmp/
	    1ea248d1006b [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248ae0be5 [unknown] (/tmp/
	    1ea248e8f67a [unknown] (/tmp/
	    1ea2487b92a8 [unknown] (/tmp/
	    1ea248ad387d [unknown] (/tmp/
	    1ea248ad962e [unknown] (/tmp/
	    1ea248adaa00 [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea2487b8718 [unknown] (/tmp/
	    1ea248d1006b [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea248ae0b6d [unknown] (/tmp/
	    1ea248adc6cf [unknown] (/tmp/
	    1ea24821c7d5 [unknown] (/tmp/
	    1ea24821ec1a [unknown] (/tmp/
	    1ea24821948f [unknown] (/tmp/
	          57df70 [unknown] (/usr/bin/nodejs)
	          57e2e0 v8::internal::Execution::Call+0xffffffffffc00070 (/usr/bin/nodejs)
	          2fdceb v8::Function::Call+0xffffffffff80012b (/usr/bin/nodejs)
	          2fded1 v8::Function::Call+0xffffffffff800041 (/usr/bin/nodejs)
	          960aee node::AsyncWrap::MakeCallback+0xffffffffffc0015e (/usr/bin/nodejs)
	          9b74c4 node::StreamBase::EmitData+0xffffffffffc000d4 (/usr/bin/nodejs)
	          9eb101 node::TLSWrap::OnReadSelf+0xffffffffffc00041 (/usr/bin/nodejs)
	          9ed985 node::TLSWrap::ClearOut+0xffffffffffc00115 (/usr/bin/nodejs)
	          9ee140 node::TLSWrap::DoRead+0xffffffffffc000d0 (/usr/bin/nodejs)
	          9ba7ea node::StreamWrap::OnReadCommon+0xffffffffffc0007a (/usr/bin/nodejs)
	           137cd [unknown] (/usr/lib/x86_64-linux-gnu/
	    7fa5e67d9900 [unknown] ([unknown])

Mentioned in SAL (#wikimedia-operations) [2019-09-05T08:32:07Z] <akosiaris> depool restbase1022 T232007

Depooling the server from the restbase service also minimized the amount of discards. From 08:31 to 11:48 restbase1022 was depooled. While that's way more easy to explain, it still is weird.

Ah, this could be an issue with the rate limiter and having multiple services in service-runner that use it. We need to take a deep dive into service-runner and the rate limiter here to validate this hypothesis.

This is back to much better values as of 16:50 UTC for codfw and 18:40 for eqiad. Still higher than before the main increase though.
From SAL it matches:
16:39 urandom: decommissioning Cassandra, restbase1018-a -- T224553
18:43 urandom: decommissioning Cassandra, restbase1018-b -- T224553
/cc @Eevans