Current state: We're running 1-2 experiments at a time with no varnish issues, and with the current limit of enrolling 0.1% of enwiki, we are still limited on sample size for experiments with low engagement rates. As I understand it, we would likely need to increase enrolment carefully since we won't know the impact until we actually hit it. The risk is that if we run up against varnish capacity limits, it starts evicting cache objects which increases backend load and slows response time. If hot/warm cache layers get evicted, we lose DDOS protection.
Working Hypothesis: If we gradually raise enrolment from 0.1% -> 0.2% -> 0.4% -> 0.8% -> 1.6% of enwiki and monitor system health at each step, we will get a better sense of impact on the system at each step.
This helps us not only increase enrolment, but also stress test the system and learn what performance impacts might look like, so that we can better detect them in the future.
Acceptance Criteria
-
Remove the limit (shh!)we finally decided to tweak the database before turning the related experiments on (see https://phabricator.wikimedia.org/T407570#11331696) - Run one experiment of increasing enrolment starting at 0.1% and doubling the traffic every two days
- Track health on relevant dashboards
Notes
Relevant dashboard to check the performance
- Varnish Caching
- Cache Host Drilldown
- MediaWiki vs. the Database
- Navigation Timing Breakdown
- Navigation timing by continent
- EventGate dashboard
Experiment details
- The experiment configuration is available in https://test-kitchen.wikimedia.org/experiment/synth-aa-test-traffic-impact
- Number of Impact Test Events (superset chart)