Page MenuHomePhabricator

rate limited etherpad
Closed, ResolvedPublic

Description

When I'm simply typing in an etherpad on the Wikimedia install (mac, chrome), and while not being the most prolific typer at all, I get an error saying that I'm being rate limited. See attached. Other users were experiencing the same (but at higher speeds?).

I tested myself here: https://etherpad.wikimedia.org/p/B6GSr_Ip3YNBf0KFDl6G

Screen Shot 2020-10-14 at 8.02.40 AM.png (292×1 px, 57 KB)

Event Timeline

hashar triaged this task as Unbreak Now! priority.Oct 14 2020, 5:06 PM
hashar subscribed.

The rate limiting makes it impossible to write past a few keystrokes.

Might be unrelated, Firefox also gives me an error establishing the websocket connection:

Firefox can’t establish a connection to the server at wss://etherpad.wikimedia.org/socket.io/?EIO=3&transport=websocket&sid=XXXXXX

Phabricator's websockets have also recently stopped working. May be related?

akosiaris changed the task status from Open to Stalled.Oct 15 2020, 9:07 AM
akosiaris lowered the priority of this task from Unbreak Now! to Medium.
akosiaris subscribed.

Lowering priority as the service isn't broken and setting as Stalled as we are waiting from the upstream to release the new version to fix this.

This is an upstream bug (see https://github.com/ether/etherpad-lite/issues/4340). The bug was introduced in 1.8.5 (we are at 1.8.6) and has already been solved upstream, the fix should be in 1.8.7 (or some equivalent version) when it is released.

hashar changed the task status from Stalled to Open.Oct 15 2020, 12:26 PM

I have set it unbreak now cause the rate limiting makes it impossible to write in a pad past a few keystrokes. If one can't write to it, surely that defeat the whole purpose of the pad. The application is disrupted, and I find it unfair to stall this ticket until upstream cuts a new release that we can deploy.

The issue you have mentioned leads to https://github.com/ether/etherpad-lite/pull/4373 which has a commit to support XFF header ( https://github.com/ether/etherpad-lite/pull/4373/commits/e8605effea29e5e44f07b09959f6a4411fcad7e6 ). Which seems to imply that the rate limiting is done based on the Varnish/ATS cache instead of the client IP. That would explains why it fails in our setup.

The issue you have mentioned suggests to raise commitRateLimiting.points in settings.json (from 10 to 100). Then per lack of XFF header support, I don't think that is going to be much helpful.

I guess we can either:

  • rollback EtherPad to the previous version (1.8.4?)
  • raise commitRateLimiting.points to an insanely large value
  • try to cherry pick the patches from PR 4373

Change 635098 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] etherpad: add commitRateLimiting config, set duration 10, points 100

https://gerrit.wikimedia.org/r/635098

Change 635098 merged by Dzahn:
[operations/puppet@production] etherpad: add commitRateLimiting config, set higher values

https://gerrit.wikimedia.org/r/635098

Mentioned in SAL (#wikimedia-operations) [2020-10-19T23:02:22Z] <mutante> etherpad got restarted with new config options related to rate limiting - hopefully this fixed T265490

https://grafana.wikimedia.org/d/000000193/etherpad?viewPanel=16&orgId=1&from=now-24h&to=now

^ Uhm.. but this went up. Trying to manually reproduce I could not... also before the change. Need more users at once.

Change 635094 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] etherpad: add 'trustProxy' config setting and enable it

https://gerrit.wikimedia.org/r/635094

@hashar Please let me know if you still see the issue or not. I am hopeful it might be fixed and can't reproduce it right now but also it seems we only had spikes during meetings when many people at once used it, like today's quarterly review meeting.

Works for me now (but my reproduction seemed to be much less compelling than those from other reporters). Thanks!

Thanks for adding the Prometheus probe, looks like it could be helpful in the future :)

I guess raising commitRateLimiting addressed it. Once upstream release a new version that contains ( https://github.com/ether/etherpad-lite/pull/4373 ) , we can lower it back again.

Works for me now

I guess raising commitRateLimiting addressed it.

Yay, thanks! If people see it again i will try to raise the numbers more. I increased both the window and the "points" by factor 10 but maybe should have only done that to the points and not the duration.

Thanks for adding the Prometheus probe, looks like it could be helpful in the future :)

Alex did or found that, i see some spikes but the large one I was referring to above seems to have just been the restart after the config change.

Once upstream release a new version that contains ( https://github.com/ether/etherpad-lite/pull/4373 ) , we can lower it back again.

There is also still https://gerrit.wikimedia.org/r/c/operations/puppet/+/635094 and "The other effect will be that the logs will contain the real client's IP, instead of the reverse proxy's IP."
which is what Alex patched for Wikimedia before it existed upstream, afaict.

Let's keep it open for a few more days I guess. Reports still welcome if it's gone or still happening for others during certain events.

Phabricator's websockets have also recently stopped working. May be related?

Probably not related but Aphlict has been fixed now and the cause was: https://gerrit.wikimedia.org/r/c/operations/puppet/+/635298/

Change 635094 abandoned by Dzahn:
[operations/puppet@production] etherpad: add 'trustProxy' config setting and enable it

Reason:

https://gerrit.wikimedia.org/r/635094

Looks good to me now. thank you!

Same here, I am not being rate limited anymore.

Dzahn claimed this task.

Thank you for confirming.

ssastry subscribed.

I ran into this repeatedly a few mins back.

Change 636458 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] etherpad: reduce rate limiting window from 10s to 1s

https://gerrit.wikimedia.org/r/636458

Change 636458 merged by Dzahn:
[operations/puppet@production] etherpad: reduce rate limiting window from 10s to 1s

https://gerrit.wikimedia.org/r/636458

I ran into this repeatedly a few mins back.

Apparently it still happens on Monday mornings but not other times because these are the most busy times with many teams using it during meetings.

There are 2 values that can be changed here. The length of the window it is looking at and the number of "points" during that window. I left the points at "100" but lowered the window size from 10s to just 1s. Hopefully that did it.

Has anyone had this issue today, since it was another Monday? I hope not, given that effectively what was a 10x in allowed points/sec.

I will be off for about 2 weeks, please feel free to close or ping others to play more with the settings if needed which I hope won't be the case.

Etherpad worked well before/during our team meeting today, which wasn't the case last week. Thanks for the fix!