Page MenuHomePhabricator

mailman3-web got stuck on lists1001, possible DoS
Closed, ResolvedPublicSecurity

Description

Screenshot_2021-05-15 Host overview - Grafana.png (1×1 px, 139 KB)

the most recent dropoff is when I manually restarted the mailman3-web service, it looks like it might have been killed at least twice before?

16:45:01 <+icinga-wm> PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
16:46:31 <+icinga-wm> PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:00:39 <+icinga-wm> PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:02:35 <+icinga-wm> RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:02:39 <+icinga-wm> RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2021-06-30 09:00:48 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:04:07 <+icinga-wm> RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

But even after icinga reported a recovery, https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ wouldn't load for me until I restarted mailman3-web

Event Timeline

The first HTTP 5xx was served at 2021-05-16T00:32:07. I look in the logs for any requests before that which could be super expensive but didn't really turn up anything out side of some requests to export mboxes for a month, but it seemed rather quick. Still looking...

May 14 03:20:05 lists1001 kernel: [7820374.836036] Out of memory: Kill process 24859 (uwsgi) score 676 or sacrifice child
...
May 15 23:00:02 lists1001 kernel: [7977570.188641] Out of memory: Kill process 30920 (uwsgi) score 675 or sacrifice child
May 15 23:00:02 lists1001 kernel: [7977570.189717] Killed process 30920 (uwsgi) total-vm:9922044kB, anon-rss:4461024kB, file-rss:0kB, shmem-rss:36kB
May 15 23:00:03 lists1001 kernel: [7977571.027737] oom_reaper: reaped process 30920 (uwsgi), now anon-rss:0kB, file-rss:0kB, shmem-rss:36kB
...
May 16 00:13:02 lists1001 kernel: [7981950.173210] Out of memory: Kill process 11127 (uwsgi) score 606 or sacrifice child
May 16 00:13:02 lists1001 kernel: [7981950.174212] Killed process 11127 (uwsgi) total-vm:9003928kB, anon-rss:3318964kB, file-rss:0kB, shmem-rss:24kB

Might not be a DoS but just a memory leak?

Restarted again. Grafana shows the CPU spike to be at ~01:59. Here's the logs from that time for (accounts|hyperkitty|postorius) with static removed:

{P15970}

The export endpoint looks suspicious.

According to the logs, 149.224.100.157 is making a lot of requests, sometimes several times per second. Shall we block?

According to the logs, 149.224.100.157 is making a lot of requests, sometimes several times per second. Shall we block?

Well most of the /top-posters and like requests happen automatically when you load hyperkitty, JS lazy-loads those components. Besides that I don't see any other real requests.

Since we don't have an external memcache configured, django defaults to using an in-memory cache which could explain the OOMs.

I'm inclined to call it UBN

Maybe we should disable some caching we enabled before (I'm not sure, I should check)

Can we for now add some memory? https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM

6GB is really small for a webservice.

Yeah, I think we can do this tomorrow.

But I'm not sure it alone will fix things. The pattern seems to be 1) something happens, 2) immediate spike in CPU usage to max 3) memory gradually increases in a staircase like graph until it maxes out, causing the whole VM to freeze up. Maybe we can look at it with gdb...

In the meantime, one could restart the service once upon a while, in case it's just a memory leak. We did that with Gerrit, too...

Nvm, ignore those pastes. Those were loads spikes at midnight which are a different thing.

(gdb) py-bt
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python3.7/email/_encoded_words.py", line 93, in <genexpr>
    return sum(len(_q_byte_map[x]) for x in bstring)
  <built-in method sum of module object at remote 0x7fdd0c74dc28>
  File "/usr/lib/python3.7/email/_encoded_words.py", line 93, in len_q
    return sum(len(_q_byte_map[x]) for x in bstring)
  File "/usr/lib/python3.7/email/_encoded_words.py", line 226, in encode
    qlen = _cte_encode_length['q'](bstring)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 2730, in _fold_as_ew
    ew = _ew.encode(first_part, charset=encode_as)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 2652, in _refold_parse_tree
    part.ew_combine_allowed, charset)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 144, in fold
    return _refold_parse_tree(self, policy=policy)
  File "/usr/lib/python3.7/email/headerregistry.py", line 258, in fold
    return header.fold(policy=policy)
  File "/usr/lib/python3.7/email/policy.py", line 205, in _fold
    return value.fold(policy=self)
  File "/usr/lib/python3.7/email/policy.py", line 199, in fold_binary
    folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
  File "/usr/lib/python3.7/email/generator.py", line 418, in _write_headers
    self._fp.write(self.policy.fold_binary(h, v))
  File "/usr/lib/python3.7/email/generator.py", line 195, in _write
    self._write_headers(msg)
  File "/usr/lib/python3.7/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.7/email/message.py", line 178, in as_bytes
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3/dist-packages/hyperkitty/views/mlist.py", line 336, in stream_mbox
    yield compressor.compress(msg.as_bytes(unixfrom=True))

Yep, it's the export endpoint. Apache patch coming to just disable it for now.

The export endpoint is disabled via apache now. It looks to me that the endpoint has no limits whatsoever, sigh, will file that upstream. We might also want to examine what the server-side timeouts are for apache and uwsgi/mailman3-web.

Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".May 17 2021, 7:14 PM
Legoktm changed the edit policy from "Custom Policy" to "All Users".