Page MenuHomePhabricator

Frequent `429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange` errors in SULWatcher
Open, Needs TriagePublic

Description

The SULWatchers will semi-frequently stop monitoring recent changes because of a 429 error from EventStreams. Manual intervention is then required to restart them. Sometimes, this will immediately fail with another 429, requiring a delay (5 minutes has worked for me) before restarting the RC reader again.

I don't know if the root cause is in SULWatcher specifically, or if it is a low IP-based ratelimit being applied to Toolforge.

2023-02-09T03:53:55Z SULWatcher MainThread INFO: starting SULWatcher
2023-02-09T03:53:55Z SULWatcher MainThread INFO: starting SULWatcher2
2023-02-09T03:53:55Z SULWatcher MainThread INFO: starting SULWatcher3
2023-02-09T03:53:55Z SULWatcher MainThread INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 48, in __init__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T03:53:59Z ib3.auth SULWatcher2 INFO: Connected to server erbium.libera.chat
2023-02-09T03:53:59Z ib3.mixins SULWatcher2 INFO: Joining #wikimedia-unifications
2023-02-09T03:54:01Z ib3.nick SULWatcher WARNING: Requested nick "SULWatcher" in use
2023-02-09T03:54:01Z ib3.nick SULWatcher3 WARNING: Requested nick "SULWatcher3" in use
2023-02-09T03:54:03Z ib3.auth SULWatcher3 INFO: Connected to server silver.libera.chat
2023-02-09T03:54:03Z ib3.mixins SULWatcher3 INFO: Joining #wikimedia-unifications
2023-02-09T03:54:03Z ib3.auth SULWatcher INFO: Connected to server silver.libera.chat
2023-02-09T03:54:03Z ib3.mixins SULWatcher INFO: Joining #wikimedia-unifications
2023-02-09T03:55:16Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T03:55:24Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T03:55:24Z SULWatcher SULWatcher INFO: AntiComposite is opped - restarting...
2023-02-09T03:55:24Z SULWatcher SULWatcher INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 48, in __init__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T04:00:51Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T04:00:51Z SULWatcher SULWatcher INFO: AntiComposite is opped - restarting...
2023-02-09T04:00:51Z SULWatcher SULWatcher INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 92, in __next__
    next_chunk = next(self.resp_iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 100, in __next__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T06:25:53Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T06:25:53Z SULWatcher SULWatcher INFO: Bsadowski1 is opped - restarting...
2023-02-09T06:25:53Z SULWatcher SULWatcher INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 92, in __next__
    next_chunk = next(self.resp_iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 100, in __next__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T13:48:51Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T13:49:07Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T13:49:07Z SULWatcher SULWatcher INFO: mahler is opped - restarting...
2023-02-09T13:49:07Z SULWatcher SULWatcher INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 92, in __next__
    next_chunk = next(self.resp_iterator)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 100, in __next__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T18:54:33Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T18:54:36Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T18:54:36Z SULWatcher SULWatcher INFO: AntiComposite is opped - restarting...
2023-02-09T18:54:36Z SULWatcher SULWatcher INFO: starting EventStream
Exception in thread EventsteamsThread:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 879, in run
    self.listener.start()
  File "/data/project/stewardbots/stewardbots/SULWatcher/SULWatcher.py", line 795, in start
    for event in EventStream(url):  # Listen to EventStream
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 48, in __init__
    self._connect()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/data/project/stewardbots/venv-k8s-py39/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
2023-02-09T19:01:56Z SULWatcher SULWatcher INFO: do_command(self, e, 'restart rc', '#wikimedia-unifications')
2023-02-09T19:01:56Z SULWatcher SULWatcher INFO: AntiComposite is opped - restarting...
2023-02-09T19:01:56Z SULWatcher SULWatcher INFO: starting EventStream

Event Timeline

This is now also affecting StewardBot.

This happpens when an IP opens up too many simultaneous connections to the service. The client IP is passed on to the eventstreams service from the Varnish frontend http servers, and then a running count of connections per eventstreams application instance is kept. If too many connections are made to that application instance, a 429 is returned. The current limit is connections 2 per application instance, of which there are 8 in each DC. So you should be able to open a total of 16 connections from your IP to a single DC (you will be routed to one of the two automatically based on your location). If you get this error, and are opening fewer than 16 connections, I'd expect a reconnect to route eventually route you to an application instance where you have fewer than 2 connections open.

I suppose, the more connections you have open at once, the more likely this is to happen.

Based on this explanation of the rate limiting implementation I am very much wondering if EventStreams is seeing all traffic from Cloud VPS as coming from a single IP, specifically 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org). If so, EventStreams would be mostly unusable by Toolforge tools and other Cloud VPS projects with potentially hundreds of tools fighting over 16 slots.

@Ottomata any idea how we could test that hypothesis?

aborrero subscribed.

Based on this explanation of the rate limiting implementation I am very much wondering if EventStreams is seeing all traffic from Cloud VPS as coming from a single IP, specifically 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org). If so, EventStreams would be mostly unusable by Toolforge tools and other Cloud VPS projects with potentially hundreds of tools fighting over 16 slots.

I confirm this is not the case. The destination address of stream.wikimedia.org, as seen from within Toolforge, is exempt from the egress NAT on Cloud VPS. This means EventStreams gets to see the source IP address of the Toolforge kubernetes worker node.

However, there could be multiple pods within the same Toolforge k8s worker node connecting to the same endpoint, thus consuming the available slots.

This already happened, see T363296: toolforge: explore options to introduce egress network quotas.

This is probably not helpful, but EventStreams naive IP based local rate limiting is pretty dumb. If there was a smarter more global solution in WMF prod (maybe there is these days), we'd much prefer to use that.

I confirm this is not the case. The destination address of stream.wikimedia.org, as seen from within Toolforge, is exempt from the egress NAT on Cloud VPS. This means EventStreams gets to see the source IP address of the Toolforge kubernetes worker node.

However, there could be multiple pods within the same Toolforge k8s worker node connecting to the same endpoint, thus consuming the available slots.

This already happened, see T363296: toolforge: explore options to introduce egress network quotas.

If this is the case I have an idea about how to help, but it will require hiding a feature of stream.wikimedia.org. We can create a fanout proxy that subscribes to the SSE feed from stream.wikimedia.org and rebroadcasts it to anyone who subscribes to an SSE feed from the proxy. The thing this necessarily removes (modulo some potential limited local caching within the proxy) is the ability for a client to send a Last-Event-ID to catch up on events missed while offline.

Mentioned in SAL (#wikimedia-cloud) [2024-04-24T18:37:10Z] <wmbot~anticomposite@tools-bastion-13> Deploy 2f528f2..c8354da (T329327)

Or, could we just avoid rate limiting Cloud VPS / Toolforge IPs in EventStreams code? Or at least increase the limit by a lot for those IPs?

Would require a code change with nice config params to provide the IPs and limit settings.

Relevant place to look: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/eventstreams/+/refs/heads/master/routes/stream.js#342

Oh, another piece of info: WMF traffic frontends set a timeout for all connections of 15 minutes. This causes connected SSE clients to reconnect every 15 minutes. The disconnect should decrement the client IP. But, if there are enough connections from the same IP, I think the reconnect would be more likely to end up at a worker that is already at the limit for that IP.

E.g. if there are 8 workers, and the limit per IP is 2, and you have close to 16 connections from that IP, the periodic reconnect after 15 minutes is pretty likely to get blocked (and have to retry several times) before it finds the empty slot.