Page MenuHomePhabricator

UDP traffic throughput to instances in the "meet" Cloud VPS project not meeting expectations
Open, LowPublic

Description

Part of Wikimedia Meet. Lots of Wikimedians use to have meetings there, but they several times reported that it fails really bad if number of participants exceeds 15-ish people. Notable case was a Wikimedia Clinic done by @Asaf. Looking at the internet, it seems jitsi can easily handle large crowds (at least up to a way higher number). Looking at graphs, it says incidents happen when the flow is 3 MB/s and I think (I might be wrong sorry for raising false alarm) that cloud proxy or the wan-transport can't handle such flow. The public IP associated with the VM is 185.15.56.72 and port 10000/UDP. Is there a way to improve this situation?

Event Timeline

@Ladsgroup what is your supported traffic goal? Scaling this project was a concern that we all discussed in T249159: Request creation of "meet" VPS project. I'm not saying that there are no potential gains/changes to be had in the Cloud VPS networking layer or possibly even in the instances you are using, but there are going to be hard limits that are reached at some point.

I'm also not quite sure what "cloud proxy" you are referencing as a potential bottleneck. Your project is using it's own public IP address and not using the shared HTTPS proxy layer in my understanding.

aborrero added a subscriber: aborrero.

I would be interested in knowing more about this. Will try to allocate some time to take a look "soon".

@Ladsgroup what is your supported traffic goal? Scaling this project was a concern that we all discussed in T249159: Request creation of "meet" VPS project. I'm not saying that there are no potential gains/changes to be had in the Cloud VPS networking layer or possibly even in the instances you are using, but there are going to be hard limits that are reached at some point.

Indeed, My goal is meeting the community demands. Currently it works to some degree but definitely can be improved to accommodate our users needs. There is also two aspects of this:

  • Horizontal scaling: Currently we are on only one video-bridge, meaning two meetings at the same time are going to cause issues for each other (if number of people goes up)
  • Vertical scaling: We can't have large meetings.

We don't need to work on horizontal scaling atm since usually one meeting happen per day: https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?viewPanel=57&orgId=1&var-project=meet&var-server=All&from=now-7d&to=now so the chance of overlap is low but we need to work on the vertical scaling a bit so we can have meetings with large crowds, once we hit there for the next level of scaling (meetings at the same time), we can simply do more videobridge VMs. After that we should slowly work on productionizing this (maybe?), then we can use LDAP for it.

I'm also not quite sure what "cloud proxy" you are referencing as a potential bottleneck. Your project is using it's own public IP address and not using the shared HTTPS proxy layer in my understanding.

Yeah we are not using the HTTPS proxy. I thought there might be some proxies stripping IPs and such. If not, then it's fine. I'm sorry I don't know much about OpenStack and how it handles traffic.

I would be interested in knowing more about this. Will try to allocate some time to take a look "soon".

Thank you so much, greatly appreciated.

@Ladsgroup what is your supported traffic goal? Scaling this project was a concern that we all discussed in T249159: Request creation of "meet" VPS project. I'm not saying that there are no potential gains/changes to be had in the Cloud VPS networking layer or possibly even in the instances you are using, but there are going to be hard limits that are reached at some point.

I'm also not quite sure what "cloud proxy" you are referencing as a potential bottleneck. Your project is using it's own public IP address and not using the shared HTTPS proxy layer in my understanding.

Thanks everyone for looking into this. I'm actually surprised that at 3 MB/second we were even supporting up to 20 people, as I thought we would need 5-10 MB/sec.

In terms of reasonable use cases, it would be nice to support in the area of 50-100 people, but even 50 would be a great improvement for our daily needs and the user groups out there that don't usually exceed 50 people per meeting. Not knowing exactly how the scalability would work, but perhaps raising the bandwidth to support up to 10 Mbytes/second would be a good next step? There would not be an avalanche of users, and we could test this on a small scale first. We have an event December 11-13 where this would come in handy in case there's a chance it's easy to raise the bandwidth quickly, otherwise we can use another platform.

aborrero moved this task from Needs discussion to Watching on the cloud-services-team (Kanban) board.

I just checked your VMs. They are running on hypervisors that are currently running ~50 other virtual machines.
This means the network interface card is being shared by 50 VMs which each would potentially want to use the full throughput. Same for the CPU.

The neutron virtual router currently serves between ~1Gibps / ~2Gibps of network traffic, being basically CPU idle, so I would rule out the edge network as the problem here.

However, there are many other potential spots for throughtput contention / bottlenecks. I can think of for example, from the top of my head:

  • the software itself (jitsi meet?)
  • the docker layer configured in the VM. I see many iptables rules created by docker in jitsi.meet.eqiad1.wikimedia.cloud
  • the network configuration in the VM. For example, I can see some tc stuff configured in jitsi.meet.eqiad1.wikimedia.cloud
  • CPU scheduling / kernel tunning in the VM.
  • the virtual machine network driver, virtio_net
  • the KVM layer, and all the software-defined-network components created by openstack neutron and openstack nova
  • the physical CPU scheduling / kernel tunning in the hypervisor
  • the hypervisor network driver
  • etc

Unfortunately we cannot spend much more time investigating all these issues. I would appreciate if you research this to the extend you can, and point us to a specific thing you would like to see fixed.

Let me mention that we have VMs receiving 200Mbps traffic with no special tunning whatsoever, for example the Toolforge front proxy (running nginx):

bd808 renamed this task from Cloud network not being able to handle large UDP traffic to UDP traffic throughput to instances in the "meet" Cloud VPS project not meeting expectations.Dec 5 2020, 12:17 AM

Thank you for the through investigation. Given that there are VMs that handle much more traffic, it seems the bottleneck is not there (and it's really hard to find out what is the bottleneck. Unless we put 16 people in the meet and go check the VM.)

It happened again: https://meta.wikimedia.org/wiki/Talk:Wikimedia_Meet#Very_poor_server_and_videconference_quality

I have been thinking of de-dockerizing for a while now T262747: De-dockerize jitsi and gave it a try this weekend (failed spectacularly though) one thing for the manual stood out to me:

Systemd/Limits: Default deployments on systems using systemd will have low default values for maximum processes and open files. If the used bridge will expect higher number of participants the default values need to be adjusted (the default values are good for less than 100 participants).

To update the values edit /etc/systemd/system.conf and make sure you have the following values if values are smaller, if not do not update.

DefaultLimitNOFILE=65000
DefaultLimitNPROC=65000
DefaultTasksMax=65000

Can it be the reason? Even though this not mentioned in the docker guide but I assume it would apply there too (otherwise it means I have no idea how computers work which should not come as a surprise to anyone). I applied these changes and reloaded the daemon config, let's see how it unfolds?

Great.

I'm sorry I have little idea how to tune this particular software/setup for better performance.

I'm un-claiming this task, given there is nothing actionable for me at the moment, but will keep an eye here for future updates.

I rebuilt the jitsi on a new VM, dockerized it and still people have trouble. Can it be due the size of the VM and number of cores/size of memory? The load on the CPU doesn't look crazy but maybe it needs parallel cores? I also really should try to de-dockerize jitsi but unfortunately it's really complicated (and doubly so on buster given that it doesn't have Open JDK8)

Yes, incrementing the specs of the setup worth trying.

Another test you can run is with iperf:

  • install iperf on the server
  • stop jitsi
  • run iperf server mode on the UDP port that jitsi would use
  • run iperf client mode on several places: your laptop, other VM in the project, etc

Over the weekend, I tested it with some network perf tools but nothing out of ordinary jumped out :/

This happened with the research showcase again. I built it on a new VM and a new docker image (with more updated software). Can I have a bigger quota (just a tiny bit, like 10 more vcpus and 10GB RAM?) so I can put it on a bigram? that helps with the issues.

If that doesn't help, we can shrink it later (and I'm fine with that)

This happened with the research showcase again. I built it on a new VM and a new docker image (with more updated software). Can I have a bigger quota (just a tiny bit, like 10 more vcpus and 10GB RAM?) so I can put it on a bigram? that helps with the issues.

Please create a quota change request using the instructions and template at Cloud-VPS (Quota-requests).

"A tiny bit" is relative, the standard Cloud VPS project quota is 8 vcpus and 16G ram so your increase is functionally "+1 Cloud VPS project of additional compute power". ;)

Over the weekend, I tested it with some network perf tools but nothing out of ordinary jumped out :/

This happened with the research showcase again. I built it on a new VM and a new docker image (with more updated software). Can I have a bigger quota (just a tiny bit, like 10 more vcpus and 10GB RAM?) so I can put it on a bigram? that helps with the issues.

could you please share the results of your iperf tests?

From another VM:

------------------------------------------------------------
Client connecting to 185.15.56.72, UDP port 10000
Sending 1470 byte datagrams, IPG target: 11215.21 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 172.16.0.141 port 32888 connected with 185.15.56.72 port 10000
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec
[  3] Sent 892 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec   0.040 ms    0/  892 (0%)

Server-side:

^Cladsgroup@jitsi03:/srv/jitsi$ iperf -s -u -p 10000
------------------------------------------------------------
Server listening on UDP port 10000
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 172.16.2.46 port 10000 connected with 185.15.56.244 port 51468
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec   0.040 ms    0/  892 (0%)

From my laptop:

amsa@amsa-Latitude-7480:~$ iperf -u -c 185.15.56.72 -p 10000 
------------------------------------------------------------
Client connecting to 185.15.56.72, UDP port 10000
Sending 1470 byte datagrams, IPG target: 11215.21 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.127 port 38772 connected with 185.15.56.72 port 10000
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec
[  3] Sent 892 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec   0.000 ms 2147481864/2147482756 (1e+02%)

Server-side:

ladsgroup@jitsi03:/srv/jitsi$ iperf -s -u -p 10000
------------------------------------------------------------
Server listening on UDP port 10000
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 172.16.2.46 port 10000 connected with <redacted> port <redacted>
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0-10.0 sec  1.25 MBytes  1.05 Mbits/sec   0.083 ms 2147481864/2147482756 (1e+02%)

I rebuilt it with a bigram. Let's see how that goes.

and it still has issues for 12 people even with audio only. The weird bit is that first I built the VM, it worked fine, then after one large meeting, it just imploded, so I added a regular restart of docker and its services and it has no effect. Maybe I should add restart of network manager? Can I have a regular restart of the VM. I'm not sure de-dockerizing it would help.

Those iperf numbers look slow, but on UDP mode, it usually reports something really low. I get 5.94 Gbits/sec in TCP mode on the same VM that I get exactly your numbers in another project (and a much smaller VM). So I agree your numbers look fine.

I popped in and checked the ulimit thing, and the processes that are running java have vastly higher limits right now than T268393#6686651. To check that, I used cat /proc/9615/limits for instance after getting a PID from ps, which you can still do with containers.

Also, to check the container that I noticed using that UDP port, I did a docker introspect on it and the PID was 9052:

root@jitsi04:/proc/9052# cat limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       64037                64037                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

In case you want to have a look. You can change ulimit in docker runs with CLI args (and in compose files and such I think too). Ultimately, I don't know what to test on this thing. Just thought that might help if you are considering de-dockerizing.

Thanks for the tip. I looked at the logs and copy-pasted it in the internet and it seems to be the issue coming from this. Looking at the config. The NAT_HARVESTER_LOCAL_ADDRESS was set to 172.18.0.5 which is not the IP of the VM, I changed it to 172.16.3.251 and it worked fine until it got overridden with 172.19.0.4 (reverse DNS lookup says these IPs are not used, might be the IP inside the docker network?)

Yup. It's the videobridge container inside the docker network:

root@f312a7cc2278:/# nslookup 172.19.0.4
4.0.19.172.in-addr.arpa	name = docker-jitsi-meet_jvb_1.docker-jitsi-meet_meet.jitsi.

So some other issues have been resolved including ICE checks but there's one big issue that persists and I think it might be related to the cloud's network infra. So the jitsi works just fine most of the time but once every ten minute or so, several people just get cut off immediately for a minute or two. The logs suddenly storm with "packet discarded" for two minutes until a whole new connection is made:

web_1      | 145.14.245.130 - - [27/Feb/2021:21:27:30 +0000] "POST /http-bind?room=fgyfh HTTP/2.0" 200 324 "-" "okhttp/3.12.1"
jvb_1      | Feb 27, 2021 9:27:36 PM org.jitsi.utils.logging2.LoggerImpl log
jvb_1      | INFO: Discarded 1 packets. Last remote address:/[reducted]
jvb_1      | Feb 27, 2021 9:27:37 PM org.jitsi.utils.logging2.LoggerImpl log
jvb_1      | INFO: Discarded 101 packets. Last remote address:/[reducted]
jvb_1      | Feb 27, 2021 9:27:37 PM org.jitsi.utils.logging2.LoggerImpl log
jvb_1      | INFO: Discarded 201 packets. Last remote address:/[reducted]
jvb_1      | Feb 27, 2021 9:27:38 PM org.jitsi.utils.logging2.LoggerImpl log
jvb_1      | INFO: Discarded 301 packets. Last remote address:/[reducted]
jvb_1      | Feb 27, 2021 9:27:38 PM org.jitsi.utils.logging2.LoggerImpl log
jvb_1      | INFO: Discarded 401 packets. Last remote address:/[reducted]
jvb_1      | Feb 27, 2021 9:27:38 PM org.jitsi.utils.logging2.LoggerImpl log

I put the full error in P14517 (I redacted as much as possible, let me know if you want the original data). There's no error or warning beforehand or afterwards, it's just healthchecks (and all okay) but suddenly it starts dropping packets.

Do we have some sort of ratelimit in the cloud infra that might have affected this? Or is it behind CloudFlare and it's being aggressive?

[..]
I put the full error in P14517 (I redacted as much as possible, let me know if you want the original data). There's no error or warning beforehand or afterwards, it's just healthchecks (and all okay) but suddenly it starts dropping packets.

Do we have some sort of ratelimit in the cloud infra that might have affected this? Or is it behind CloudFlare and it's being aggressive?

No, we don't have such ratelimit today, nor the cloud network is in any way associated with cloudflare.

As next step, I would try to understand why this is happening:
jvb_1 | INFO: Discarded 401 packets. Last remote address:/[reducted]

What is discarding packets and for what reason?

No, we don't have such ratelimit today, nor the cloud network is in any way associated with cloudflare.

I believe that transit into eqiad is, at least at some times, passed through Cloudflare's "Magic Transit" DDOS filtering: https://wikitech.wikimedia.org/wiki/Cloudflare

As next step, I would try to understand why this is happening:
jvb_1 | INFO: Discarded 401 packets. Last remote address:/[reducted]

What is discarding packets and for what reason?

It doesn't say why the packets get discarded.

From asking from SRE, it doesn't seem to be an easy way to exclude this IP from CF DDOS filtering. If that's the case, we might have to shut down Wikimedia Meet :( I just need to make 100% sure the CF filter is the reason behind this issue.

That's a case of netflow to the IP: https://w.wiki/33kA

It works fine and then suddenly drops for two minutes, goes back to normal, then drop again after a bit. It really looks like it's CF but there's nothing in CF dashboard for wm meet's IP.

As far as I can tell we're not advertising 185.15.56.0/22 via CF, so it should be a red herring.

So I de-dockerized jitsi, let's see how it goes now.

It seems it's not fixed yet so the issue is not docker either :( I was thinking it can be too small MTU in somehwere. TCP has MTU path recognition but UDP can't be bothered about it.

But with de-dockerizing jitsi, I honestly don't think I can do anything about this problem and I don't think it's on my side.

I was thinking it can be too small MTU in somehwere. TCP has MTU path recognition but UDP can't be bothered about it.
I honestly don't think I can do anything about this problem and I don't think it's on my side.

Just a random idea, but sometimes I observed that Jitsi clients may suddenly decide to reconnect because of some imaginary connection problem. Sometimes you can see in the logs that a client experiences an absurdly long latency, and apparently that's enough for everyone in the call to also have problems. I had such issues with a relatively small group of people who were meeting on a regular basis, where a single IP address had those issues. The latency as logged would often be an incredibly high number of seconds, suggesting either a satellite connection, or an extremely high packet loss, or a severely broken ping (which could be caused by firewalls). I never managed to actually debug the issue live and find a permanent solution for it, although at some point people stopped complaining about it as if it went away.

I don't know if it's relevant here but in the last weeks I had a couple of meetings with 3 coworkers lasting 4 hours without any interruption, screaming in our microphones all together and sharing boring videos about weird Free software stuff. No government interruption.

Some months ago, instead, we were interrupted every 5/15 minutes.

This is fixed from my perspective! 🎉 I also hope for the other use cases.