Following Brendan Greggs examples: http://www.brendangregg.com/blog/2017-12-31/reinvent-netflix-ec2-tuning.html
Description
Event Timeline
I've added these settings for now:
vm.swappiness = 0 kernel.numa_balancing = 0 vm.dirty_ratio = 80 vm.dirty_background_ratio = 5 vm.dirty_expire_centisecs = 12000 net.core.somaxconn = 1000 net.core.netdev_max_backlog = 5000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_wmem = 4096 12582912 16777216 net.ipv4.tcp_rmem = 4096 12582912 16777216 net.ipv4.tcp_max_syn_backlog = 8096 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_tw_reuse = 1 net.ipv4.ip_local_port_range = 10240 65535
It didn't make our WebPageReplay metrics more stable, rather it introduced higher standard deviation (the blue vertical line is when I did the change):
Its not as visible for all URLs though. I'll revert the change later today and let the experts handle that in the future.
It would be super useful to get your eyes on this @dpifke ! First like a sanity check to see what it looks like now and then we could setup a new instance that you can play around with. We could just setup the tests we run for enwiki and send the metrics to another graphite namespace.
Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.
I tried this again but no improvement. Changing and redeploying on a new AWS server makes more difference or can improve more.