Page MenuHomePhabricator

Investigate segfaults on ats-tls running on cp5001
Closed, ResolvedPublic

Description

During the last weekend we had two segfaults:

vgutierrez@cp5001:/srv/trafficserver/tls/var/log$ ls /srv/trafficserver/tls/var/log/crash* -alh
-r-------- 1 root trafficserver 3.2M Sep  5 15:54 /srv/trafficserver/tls/var/log/crash-2019-09-05-155443.log
-r-------- 1 root trafficserver 3.2M Sep  8 03:24 /srv/trafficserver/tls/var/log/crash-2019-09-08-032410.log

Further investigation is required

Event Timeline

Change 535121 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable hardening for the TLS instance on cp5001

https://gerrit.wikimedia.org/r/535121

Both crashes seems to be related:

crash-2019-09-05-155443.log
Thread 12661, [ET_NET 32]:
0    0x000000000049d9f0 crash_logger_invoke(int, siginfo_t*, void*) + 0x80
1    0x00002b8df29540e0 __restore_rt + (nil)
2    0x00002b8df34a5fff gsignal + 0xcf
3    0x00002b8df34a742a abort + 0x16a
4    0x00002b8df09e2d0a ink_abort(char const*, ...) + 0x9a
5    0x00002b8df09e8a6d ats_malloc + 0x2d
6    0x00000000005aa5ba HdrHeap::coalesce_str_heaps(int) + 0x5a
7    0x00000000005aa470 HdrHeap::allocate_str(int) + 0x220
8    0x00000000005aa832 HdrHeap::duplicate_str(char const*, int) + 0xa2
9    0x00000000005b2620 mime_field_name_set(HdrHeap*, MIMEHdrImpl*, MIMEField*, short, char const*, int, bool) + 0x40
10   0x000000000055c1a8 HpackDynamicTable::add_header_field(MIMEField const*) + 0x158
11   0x000000000055c6d6 encode_literal_header_field_with_indexed_name(unsigned char*, unsigned char const*, MIMEFieldWrapper const&, unsigned int, HpackIndexingTable&, HpackField) + 0x56
12   0x000000000055deb8 hpack_encode_header_block(HpackIndexingTable&, unsigned char*, unsigned long, HTTPHdr*, int) + 0x2f8
13   0x00000000005526af http2_encode_header_blocks(HTTPHdr*, unsigned char*, unsigned int, unsigned int*, HpackIndexingTable&, int) + 0x4f
14   0x00000000005670cb Http2ConnectionState::send_headers_frame(Http2Stream*) + 0x16b
15   0x0000000000554d9d Http2Stream::update_write_request(IOBufferReader*, long, bool) + 0x75d
16   0x0000000000555cae Http2Stream::do_io_write(Continuation*, long, IOBufferReader*, bool) + 0xce
17   0x000000000053b1c4 HttpTunnel::producer_run(HttpTunnelProducer*) + 0xb14
18   0x000000000053a506 HttpTunnel::tunnel_run(HttpTunnelProducer*) + 0x46
19   0x00000000004eeabd HttpSM::handle_api_return() + 0x51d
20   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
21   0x0000000000501570 HttpSM::set_next_state() + 0xa70
22   0x00000000004e772f HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*)) + 0x10f
23   0x00000000004ee5e0 HttpSM::handle_api_return() + 0x40
24   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
25   0x00000000004f364f HttpSM::state_read_server_response_header(int, void*) + 0x5af
26   0x00000000004e4162 HttpSM::main_handler(int, void*) + 0x1b2
27   0x000000000068995e UnixNetVConnection::net_read_io(NetHandler*, EThread*) + 0xace
28   0x000000000067f57f NetHandler::process_ready_list() + 0x9f
29   0x000000000067fe1c NetHandler::waitForActivity(long) + 0x69c
30   0x000000000067fefa non-virtual thunk to NetHandler::waitForActivity(long) + 0xa
31   0x00000000006ae5f2 EThread::execute_regular() + 0x4f2
32   0x00000000006ad1f9 Thread::start(char const*, void*, unsigned long, std::__1::function<void ()> const&) + 0x1a9
33   0x00002b8df294a4a4 start_thread + 0xc4
34   0x00002b8df355bd0f clone + 0x3f
35   0x0000000000000000 0x0 + 0x3f
crash-2019-09-08-032410.log
Thread 31499, [ET_NET 17]:
0    0x000000000049d9f0 crash_logger_invoke(int, siginfo_t*, void*) + 0x80
1    0x00002ae32eda10e0 __restore_rt + (nil)
2    0x00002ae32f8f2fff gsignal + 0xcf
3    0x00002ae32f8f442a abort + 0x16a
4    0x00002ae32ce2fd0a ink_abort(char const*, ...) + 0x9a
5    0x00002ae32ce35a6d ats_malloc + 0x2d
6    0x00000000005aa3df HdrHeap::allocate_str(int) + 0x18f
7    0x00000000005aa832 HdrHeap::duplicate_str(char const*, int) + 0xa2
8    0x00000000005b5788 mime_field_value_set(HdrHeap*, MIMEHdrImpl*, MIMEField*, char const*, int, bool) + 0x38
9    0x000000000055c1ce HpackDynamicTable::add_header_field(MIMEField const*) + 0x17e
10   0x000000000055c8ed encode_literal_header_field_with_new_name(unsigned char*, unsigned char const*, MIMEFieldWrapper const&, HpackIndexingTable&, HpackField) + 0x4d
11   0x000000000055de90 hpack_encode_header_block(HpackIndexingTable&, unsigned char*, unsigned long, HTTPHdr*, int) + 0x2d0
12   0x00000000005526af http2_encode_header_blocks(HTTPHdr*, unsigned char*, unsigned int, unsigned int*, HpackIndexingTable&, int) + 0x4f
13   0x00000000005670cb Http2ConnectionState::send_headers_frame(Http2Stream*) + 0x16b
14   0x0000000000554d9d Http2Stream::update_write_request(IOBufferReader*, long, bool) + 0x75d
15   0x0000000000555cae Http2Stream::do_io_write(Continuation*, long, IOBufferReader*, bool) + 0xce
16   0x000000000053b1c4 HttpTunnel::producer_run(HttpTunnelProducer*) + 0xb14
17   0x000000000053a506 HttpTunnel::tunnel_run(HttpTunnelProducer*) + 0x46
18   0x00000000004eeabd HttpSM::handle_api_return() + 0x51d
19   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
20   0x0000000000501570 HttpSM::set_next_state() + 0xa70
21   0x00000000004e772f HttpSM::call_transact_and_set_next_state(void (*)(HttpTransact::State*)) + 0x10f
22   0x00000000004ee5e0 HttpSM::handle_api_return() + 0x40
23   0x00000000004edfd3 HttpSM::state_api_callout(int, void*) + 0x1233
24   0x00000000004f364f HttpSM::state_read_server_response_header(int, void*) + 0x5af
25   0x00000000004e4162 HttpSM::main_handler(int, void*) + 0x1b2
26   0x000000000068995e UnixNetVConnection::net_read_io(NetHandler*, EThread*) + 0xace
27   0x000000000067f57f NetHandler::process_ready_list() + 0x9f
28   0x000000000067fe1c NetHandler::waitForActivity(long) + 0x69c
29   0x000000000067fefa non-virtual thunk to NetHandler::waitForActivity(long) + 0xa
30   0x00000000006ae5f2 EThread::execute_regular() + 0x4f2
31   0x00000000006ad1f9 Thread::start(char const*, void*, unsigned long, std::__1::function<void ()> const&) + 0x1a9
32   0x00002ae32ed974a4 start_thread + 0xc4
33   0x00002ae32f9a8d0f clone + 0x3f
34   0x0000000000000000 0x0 + 0x3f

Change 535121 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable hardening for the TLS instance on cp5001

https://gerrit.wikimedia.org/r/535121

Mentioned in SAL (#wikimedia-operations) [2019-09-09T08:38:43Z] <vgutierrez> disabling systemd hardening for ats-tls on cp5001 - T232298

service log shows memory issues:

Sep 05 15:54:43 cp5001 traffic_server[12607]: Fatal: couldn't allocate 32768 bytes
Sep 08 03:24:10 cp5001 traffic_server[31479]: Fatal: couldn't allocate 65536 bytes

and it seems to be related to the memory leak described on https://github.com/apache/trafficserver/pull/5695

Vgutierrez triaged this task as Medium priority.Sep 9 2019, 9:08 AM
Vgutierrez moved this task from Backlog to Caching on the Traffic board.

Change 535139 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm6

https://gerrit.wikimedia.org/r/535139

Change 535139 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm6

https://gerrit.wikimedia.org/r/535139

Mentioned in SAL (#wikimedia-operations) [2019-09-09T13:39:49Z] <vgutierrez> uploaded trafficserver 8.0.5-1wm6 to apt.wikimedia.org (stretch) - T232298

Mentioned in SAL (#wikimedia-operations) [2019-09-09T13:51:00Z] <vgutierrez> upgrading ats to 8.0.5-1wm6 on cp5001 - T232298

So it looks like we are still leaking memory with ATS 8.0.5-1wm6:

(gdb) bt
#0  0x00002adfd1a16fff in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002adfd1a1842a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00002adfcef53d0a in ink_abort (message_format=<optimized out>) at ink_error.cc:99
#3  0x00002adfcef59a6d in ats_malloc (size=128408) at ink_memory.cc:60
#4  0x00000000005c6b3f in ProcessManager::signalManager (this=0xf99760, msg_id=16, data_raw=0x2adffab3e010 "\003", data_len=-777949185) at ProcessManager.cc:245
#5  0x00000000006aa68b in RecMessageSend (msg=0x0) at RecProcess.cc:325
#6  0x000000000069e4bd in send_push_message () at P_RecCore.cc:123
#7  0x00000000006aa974 in sync_cont::sync (this=0xfc5fb0) at RecProcess.cc:194
#8  0x00000000006add11 in Continuation::handleEvent (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ./I_Continuation.h:160
#9  EThread::process_event (this=0x2adfd3a46010, e=0xfd44e0, calling_code=2) at UnixEThread.cc:131
#10 0x00000000006ae520 in EThread::execute_regular (this=0x2adfd3a46010) at UnixEThread.cc:244
#11 0x00000000006ad369 in spawn_thread_internal (a=0xfd05b0) at Thread.cc:85
#12 0x00002adfd0ebb4a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00002adfd1accd0f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Mentioned in SAL (#wikimedia-operations) [2019-09-10T04:46:37Z] <vgutierrez> depool cp5001 for memory leak debugging on ATS - T232298

Mentioned in SAL (#wikimedia-operations) [2019-09-10T10:32:42Z] <vgutierrez> repool cp5001 with ats-tls collecting memory usage details every hour - T232298

Change 535995 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Shrink SSL session cache size

https://gerrit.wikimedia.org/r/535995

Change 535995 merged by Vgutierrez:
[operations/puppet@production] ATS: Shrink SSL session cache size

https://gerrit.wikimedia.org/r/535995

Mentioned in SAL (#wikimedia-operations) [2019-09-12T06:51:31Z] <vgutierrez> restarting ATS-TLS on cp4021 and cp2002 to get the new SSL session cache size - T232298

Change 536302 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.5-1wm7

https://gerrit.wikimedia.org/r/536302

Mentioned in SAL (#wikimedia-operations) [2019-09-13T10:13:44Z] <vgutierrez> disable ATS-TLS debug options on cp5001 - T232298

Change 536302 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.5-1wm7

https://gerrit.wikimedia.org/r/536302

Mentioned in SAL (#wikimedia-operations) [2019-09-17T05:56:22Z] <vgutierrez> uploaded trafficserver 8.0.5-1wm7 to apt.wikimedia.org (stretch) - T232298 T232724

Change 538580 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Reenable systemd hardening on cp5001 for ats-tls

https://gerrit.wikimedia.org/r/538580

Change 538580 merged by Vgutierrez:
[operations/puppet@production] ATS: Reenable systemd hardening on cp5001 for ats-tls

https://gerrit.wikimedia.org/r/538580

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks