mailman3-web got stuck on lists1001, possible DoS
Closed, ResolvedPublicSecurity
Actions

Assigned To

Authored By

	Legoktm
	May 16 2021, 12:58 AM

Description

Screenshot_2021-05-15 Host overview - Grafana.png (1×1 px, 139 KB)

the most recent dropoff is when I manually restarted the mailman3-web service, it looks like it might have been killed at least twice before?

16:45:01 <+icinga-wm> PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
16:46:31 <+icinga-wm> PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:00:39 <+icinga-wm> PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:02:35 <+icinga-wm> RECOVERY - mailman archives on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:02:39 <+icinga-wm> RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2021-06-30 09:00:48 +0000 (expires in 45 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
17:04:07 <+icinga-wm> RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 30 Jun 2021 09:00:48 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring

But even after icinga reported a recovery, https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ wouldn't load for me until I restarted mailman3-web

Details

	Subject	Repo	Branch	Lines +/-
	mailman: Disable hyperkitty export endpoint	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects

Mentioned In: T289899: Security Review For Mailman3 / lists.wikimedia.org
T288369: Mailing lists statistics on wikimedia.biterg.io broken since move from pipermail to hyperkitty (non-public API)

Event Timeline

Legoktm created this task.May 16 2021, 12:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 16 2021, 12:58 AM

Legoktm added projects: Wikimedia-Mailing-lists, Vuln-DoS.May 16 2021, 12:58 AM

Legoktm added a subscriber: Ladsgroup.

The first HTTP 5xx was served at 2021-05-16T00:32:07. I look in the logs for any requests before that which could be super expensive but didn't really turn up anything out side of some requests to export mboxes for a month, but it seemed rather quick. Still looking...

May 14 03:20:05 lists1001 kernel: [7820374.836036] Out of memory: Kill process 24859 (uwsgi) score 676 or sacrifice child
...
May 15 23:00:02 lists1001 kernel: [7977570.188641] Out of memory: Kill process 30920 (uwsgi) score 675 or sacrifice child
May 15 23:00:02 lists1001 kernel: [7977570.189717] Killed process 30920 (uwsgi) total-vm:9922044kB, anon-rss:4461024kB, file-rss:0kB, shmem-rss:36kB
May 15 23:00:03 lists1001 kernel: [7977571.027737] oom_reaper: reaped process 30920 (uwsgi), now anon-rss:0kB, file-rss:0kB, shmem-rss:36kB
...
May 16 00:13:02 lists1001 kernel: [7981950.173210] Out of memory: Kill process 11127 (uwsgi) score 606 or sacrifice child
May 16 00:13:02 lists1001 kernel: [7981950.174212] Killed process 11127 (uwsgi) total-vm:9003928kB, anon-rss:3318964kB, file-rss:0kB, shmem-rss:24kB

Might not be a DoS but just a memory leak?

Legoktm updated the task description. (Show Details)May 16 2021, 2:29 AM

Restarted again. Grafana shows the CPU spike to be at ~01:59. Here's the logs from that time for (accounts|hyperkitty|postorius) with static removed:

{P15970}

The export endpoint looks suspicious.

According to the logs, 149.224.100.157 is making a lot of requests, sometimes several times per second. Shall we block?

In T282957#7090704, @Ladsgroup wrote:

According to the logs, 149.224.100.157 is making a lot of requests, sometimes several times per second. Shall we block?

Well most of the /top-posters and like requests happen automatically when you load hyperkitty, JS lazy-loads those components. Besides that I don't see any other real requests.

Since we don't have an external memcache configured, django defaults to using an in-memory cache which could explain the OOMs.

Can we for now add some memory? https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM

6GB is really small for a webservice.

had to restart it again now

Urbanecm subscribed.May 16 2021, 5:32 PM

I'm inclined to call it UBN

Maybe we should disable some caching we enabled before (I'm not sure, I should check)

In T282957#7090832, @Ladsgroup wrote:

Can we for now add some memory? https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM

6GB is really small for a webservice.

Yeah, I think we can do this tomorrow.

But I'm not sure it alone will fix things. The pattern seems to be 1) something happens, 2) immediate spike in CPU usage to max 3) memory gradually increases in a staircase like graph until it maxes out, causing the whole VM to freeze up. Maybe we can look at it with gdb...

In the meantime, one could restart the service once upon a while, in case it's just a memory leak. We did that with Gerrit, too...

{P15972}
{P15973}

Nvm, ignore those pastes. Those were loads spikes at midnight which are a different thing.

(gdb) py-bt
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python3.7/email/_encoded_words.py", line 93, in <genexpr>
    return sum(len(_q_byte_map[x]) for x in bstring)
  <built-in method sum of module object at remote 0x7fdd0c74dc28>
  File "/usr/lib/python3.7/email/_encoded_words.py", line 93, in len_q
    return sum(len(_q_byte_map[x]) for x in bstring)
  File "/usr/lib/python3.7/email/_encoded_words.py", line 226, in encode
    qlen = _cte_encode_length['q'](bstring)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 2730, in _fold_as_ew
    ew = _ew.encode(first_part, charset=encode_as)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 2652, in _refold_parse_tree
    part.ew_combine_allowed, charset)
  File "/usr/lib/python3.7/email/_header_value_parser.py", line 144, in fold
    return _refold_parse_tree(self, policy=policy)
  File "/usr/lib/python3.7/email/headerregistry.py", line 258, in fold
    return header.fold(policy=policy)
  File "/usr/lib/python3.7/email/policy.py", line 205, in _fold
    return value.fold(policy=self)
  File "/usr/lib/python3.7/email/policy.py", line 199, in fold_binary
    folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
  File "/usr/lib/python3.7/email/generator.py", line 418, in _write_headers
    self._fp.write(self.policy.fold_binary(h, v))
  File "/usr/lib/python3.7/email/generator.py", line 195, in _write
    self._write_headers(msg)
  File "/usr/lib/python3.7/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.7/email/message.py", line 178, in as_bytes
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3/dist-packages/hyperkitty/views/mlist.py", line 336, in stream_mbox
    yield compressor.compress(msg.as_bytes(unixfrom=True))

Yep, it's the export endpoint. Apache patch coming to just disable it for now.

The export endpoint is disabled via apache now. It looks to me that the endpoint has no limits whatsoever, sigh, will file that upstream. We might also want to examine what the server-side timeouts are for apache and uwsgi/mailman3-web.

sbassett edited projects, added SecTeam-Processed; removed Security-Team.May 17 2021, 3:46 PM

sbassett subscribed.

sbassett added a project: SecTeam-wikimedia-project-event.May 17 2021, 3:49 PM

Filed upstream at https://gitlab.com/mailman/hyperkitty/-/issues/385 / https://gitlab.com/mailman/hyperkitty/-/issues/386

Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".May 17 2021, 7:14 PM

Legoktm changed the edit policy from "Custom Policy" to "All Users".

Maintenance_bot added a project: SRE.May 17 2021, 7:45 PM

sbassett added a project: Upstream.May 17 2021, 8:41 PM

Legoktm mentioned this in T288369: Mailing lists statistics on wikimedia.biterg.io broken since move from pipermail to hyperkitty (non-public API).Aug 6 2021, 8:32 PM

Legoktm mentioned this in T289899: Security Review For Mailman3 / lists.wikimedia.org.Oct 14 2021, 5:57 PM

JJMC89 moved this task from Backlog to Mailman v3 on the Wikimedia-Mailing-lists board.Feb 20 2024, 8:14 AM

	F34455476: Screenshot_2021-05-15 Host overview - Grafana.png
	May 16 2021, 12:58 AM

mailman3-web got stuck on lists1001, possible DoSClosed, ResolvedPublicSecurityActions

Description

Details

Related Objects

Event Timeline

mailman3-web got stuck on lists1001, possible DoS
Closed, ResolvedPublicSecurity
Actions