Page MenuHomePhabricator

Test the impact of incremental increase in traffic for cache splitting experiments
Open, HighPublic3 Estimated Story Points

Description

Current state: We're running 1-2 experiments at a time with no varnish issues, and with the current limit of enrolling 0.1% of enwiki, we are still limited on sample size for experiments with low engagement rates. As I understand it, we would likely need to increase enrolment carefully since we won't know the impact until we actually hit it. The risk is that if we run up against varnish capacity limits, it starts evicting cache objects which increases backend load and slows response time. If hot/warm cache layers get evicted, we lose DDOS protection.

Working Hypothesis: If we gradually raise enrolment from 0.1% -> 0.2% -> 0.4% -> 0.8% -> 1.6% of enwiki and monitor system health at each step, we will get a better sense of impact on the system at each step.

This helps us not only increase enrolment, but also stress test the system and learn what performance impacts might look like, so that we can better detect them in the future.

Acceptance Criteria

  • Remove the limit (shh!) we finally decided to tweak the database before turning the related experiments on (see https://phabricator.wikimedia.org/T407570#11331696)
  • Run one experiment of increasing enrolment starting at 0.1% and doubling the traffic every two days
  • Track health on relevant dashboards

Notes

Relevant dashboard to check the performance

Experiment details

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for filing the task, @JVanderhoop-WMF. As per the discussion on Slack, the above sounds good.

Please let us know if Traffic can help with this in any way, or if further input is required.

JVanderhoop-WMF moved this task from Radar to READY TO GROOM on the Test Kitchen board.
JVanderhoop-WMF updated the task description. (Show Details)
JVanderhoop-WMF set the point value for this task to 3.

Remove the limit

@JVanderhoop-WMF I guess that, instead of removing completely the limit, we can just increase it a bit, right? Is 1.6% the exact limit we want to test? We don't want to test 40% of the traffic for enwiki, right? That way we can prevent someone from assigning a too much high traffic to that wiki while configuring something real

Based on our slack conversation:

@Sfaci
I was wondering if we could just update the database directly before those test experiment start. That way we would avoid changing the limit, merging, deploying a new release and so on. I don't like to tweak the database but maybe it's better than the whole deployment process this time. And once something is in the database, there are no validations. The experiments could run perfectly
@JVanderhoop-WMF
That's a great idea to reduce some of the front end work. @Milimetric @cjming what do you think?
@Milimetric
uh... sure, as long as it happens before the experiments start, that works. I would just hate for varnish to grab some bad config and then we have inconsistent sampling. But we can test for that in the resulting data too, so I guess that's ok
@Sfaci
We can be fully sure about that because varnish won't know anything about the experiment before turning it on. The API doesn't include those experiment in the response
we can register the experiment, tweak the traffic and then turn it on and everything will be 100% ok
@cjming
agree - sounds good to me - it's just us for us

We have decided to tweak the database, instead or removing the validation rule, to increase the traffic once we register the experiments and before turning them on. That way Varnish won't know anything about them before having the increased traffic we want to test.

@ssingh I have a couple of questions about this:

  • How long should an experiment be running with a specific traffic for us to be able to see something relevant in the metrics? I'm wondering when we should increase the traffic for every experiment towards the next stage (0.1% -> 0.2% for example). Would day by day be enough? one week?
  • I don't really know how many experiment we should/can launch to test this but, in the case we consider to launch 4 or 5 experiments at the same time with 0.1% traffic for enwiki as the initial traffic, can we increase the traffic to the next phase (0.2% for example) for all of them at the same time or should we increase it one by one with some delay just in case is too much for varnish?

I had a meeting with @BBlack from Traffic team and we talked about the questions above and also other related topic. We also ended up questioning other related things based on the great ideas and comments that Brandon raised. The following is a summary of the notes I took during that meeting about the answers for the my previous questions and also the related topics:

How long should an experiment be running with a specific traffic for us to be able to see something relevant in the metrics? I'm wondering when we should increase the traffic for every experiment towards the next stage (0.1% -> 0.2% for example). Would day by day be enough? one week?

Two days should be enough to see changes in the performance after starting any test or making any change in an existing one (increasing the traffic, for example)

I don't really know how many experiment we should/can launch to test this but, in the case we consider to launch 4 or 5 experiments at the same time with 0.1% traffic for enwiki as the initial traffic, can we increase the traffic to the next phase (0.2% for example) for all of them at the same time or should we increase it one by one with some delay just in case is too much for varnish?

The answer is not easy. In short, there is no a right number. It's more about what's an acceptable tradeoff regarding the performance, for example which performance drop we are willing to accept. I guess that this test can be also useful to see how the overall performance degrades as we increase the number of experiments or the assigned traffic in overall. Once we have seen that, we could define some numbers and/or that acceptable tradeoff.

We were also discussing the following:

  • We have to think about the aggregation. It's doesn't matter the number of the experiments and the assigned traffic for each one. The aggregation of all that is what it matters
  • Regarding Varnish and the related infrastructure there is no performance limitation. Varnish could run a lot of experiments. It's more about the implementation of edge-unique, because at that point we were talking about 100 experiments as the limit to consider. It could be improved and optimized.
  • Regarding the metrics to monitor, RUM (Real User Monitoring) metrics would be the most interesting/relevant
  • We should let traffic team know when we start with some test like this

We also talked about privacy (this is probably for a different ticket)

  • Same percentage means a different number of devices depending on the wiki (I think this was discussed previously, but we never implemented it and I would say we didn't event create a ticket). Less popular wikis should have a low limit for the traffic to protect privacy of users
  • We could use https://stats.wikimedia.org/#/en.wikipedia.org to map % of traffic to number of devices to give advice and set validation rules in the xLab UI

Once that said, I would say that, at least, we could start doing the following regarding this task:

  • Register an experiment and increase its traffic (starting at 0.1%) gradually (every 48 hours) to see how that affects performance (according to the mentioned RUM metrics). I guess we could use something like the synthetic experiments we run at the beginning to test the whole thing for logged-out users
  • Let traffic team know about the above
  • Once we have some results about the above we will know something about how aggregate traffic affects the overall performance and what the limit would be for the current implementation. Then I guess we could start discussing which performance drop we would be willing to accept to set a limit. And, from there I guess that it could be that we had to improve/optimize the implementation in the case we need/want to increase the number of experiments/traffic running without a significant performance drop
  • I'll file a separate ticket about the privacy thing

Thank you for summarizing and sharing your conversations! This is great.

I think we should have some analytics instrumentation. At minimum we should have page visit events and we should monitor EventGate capacity too.

This is excellent. Thank you @Sfaci and @BBlack

Register an experiment and increase its traffic (starting at 0.1%) gradually (every 48 hours)

Would we still follow the increase by doubling?

I think we should have some analytics instrumentation. At minimum we should have page visit events and we should monitor EventGate capacity too.

Good idea @mpopov! I'll prepare it to track page visit events and I'll add the dashboard to monitor EventGate as another relevant metrics to consider here

Would we still follow the increase by doubling?

@JVanderhoop-WMF I guess so. All this is an unknown for us so any criterion, this one for example, is a good start point. Let's see what happens. If metrics ended up showing something relevant we can always reconsider it.

Sfaci updated the task description. (Show Details)

Change #1203513 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[mediawiki/extensions/WikimediaEvents@master] ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument

https://gerrit.wikimedia.org/r/1203513

The change above creates the experiment-specific instrument for the experiment we will register and launch to address the traffic impact test, according to the following configuration:

  • Experiment machine-readable name: synth-aa-test-traffic-impact
  • It will be a synthetic A/A test: There won't be difference between control and treatment groups
  • The experiment will send PageVisit events
  • There will be traffic only for enwiki with 0.1% as its initial value
  • We will double the traffic every 2/3 days after having monitorized the relevant metrics: RUM metrics for Varnish + EventGate
  • Stream: product_metrics.web_base (by default)
  • SchemaID: /analytics/product_metrics/web/base/1.4.2 (by default)
  • Risk Level: Low risk

This setup looks good to me @Sfaci! A good question you raised is: Do we want to stop at 1.6% (so that we can say that 1% of enwiki is safe for experimentation), or do we want to keep going until we see an impact?

@BBlack would you want to see an impact (i.e. do a true stress test - 3.2% enwiki, beyond), or are you happy to stop at 1.6%? We will have two experiments running on enwiki as of Monday 11/24.

Change #1203513 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument

https://gerrit.wikimedia.org/r/1203513

Change #1215214 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[mediawiki/extensions/WikimediaEvents@wmf/1.46.0-wmf.5] ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument

https://gerrit.wikimedia.org/r/1215214

The experiment (https://mpic.wikimedia.org/experiment/synth-aa-test-traffic-impact) has been already registered according to the configuration mentioned above. Its activation has been planned for December 9th. That day early in the morning the implementation will have been backported.

Change #1215214 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.46.0-wmf.5] ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument

https://gerrit.wikimedia.org/r/1215214

Mentioned in SAL (#wikimedia-operations) [2025-12-09T08:21:22Z] <wmde-fisch@deploy2002> Started scap sync-world: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-09T08:23:23Z] <wmde-fisch@deploy2002> wmde-fisch, sfaci: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-09T08:30:19Z] <wmde-fisch@deploy2002> Finished scap sync-world: Backport for [[gerrit:1215214|ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument (T407570)]], [[gerrit:1216553|VE: Don't create a synth ref when there's a LDR main ref (T411245)]] (duration: 08m 56s)

The ticket is BLOCKED because there was an incident related to the cache. We have pre-emptively disabled this experiment until the incident is solved.

@ssingh - when might your team be comfortable with us running the test? Has the incident been resolved or mitigated?

@ssingh - when might your team be comfortable with us running the test? Has the incident been resolved or mitigated?

Hi @JVanderhoop-WMF : The Dec 8 incident has resolved and we can resume this again. Let us know when you plan to run it so that we are aware. Thanks.

@ssingh - when might your team be comfortable with us running the test? Has the incident been resolved or mitigated?

Hi @JVanderhoop-WMF : The Dec 8 incident has resolved and we can resume this again. Let us know when you plan to run it so that we are aware. Thanks.

Thanks @ssingh for letting us know!

We are now configuring the instrument to start next Monday (January 12th). The initial load will be 0.1% for enwiki which is the value we are currently allowing for any other experiment so nothing special should happen initially. As planned in https://phabricator.wikimedia.org/T407570#11361528 we will double the traffic for that wiki every 2 or 3 days after checking that everything looks good. Anyway, we'll reach out to you before doubling to make sure we all agree on that.

Change #1226276 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] Test Kitchen UI: Deploying v1.1.5 release to staging

https://gerrit.wikimedia.org/r/1226276

Change #1226276 merged by jenkins-bot:

[operations/deployment-charts@master] Test Kitchen UI: Deploying v1.1.5 release to staging

https://gerrit.wikimedia.org/r/1226276

Change #1226281 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] Test Kitchen UI: Deploying v1.1.5 release to staging

https://gerrit.wikimedia.org/r/1226281

Change #1226281 merged by jenkins-bot:

[operations/deployment-charts@master] Test Kitchen UI: Deploying v1.1.5 release to production

https://gerrit.wikimedia.org/r/1226281

@ssingh FYI: After a couple of days running this experiment with 0.1% of the enwiki traffic (our current limit per experiment) we are going to double that amount for the rest of the week. The plan will be to double the traffic again next Monday

Hi again @ssingh!
We are going to double the traffic again. It will be 0.4% since now.

We'll also extend the experiment a while because we have been doubling the traffic slower than expected and two weeks won't be enough to reach the limit we planned at the beginning

Hi again @ssingh!
We are going to double the traffic again. It will be 0.4% since now.

We'll also extend the experiment a while because we have been doubling the traffic slower than expected and two weeks won't be enough to reach the limit we planned at the beginning

Thanks @Sfaci. Do you have a date of going to .4%? And can you remind me which experiment is this under https://test-kitchen.wikimedia.org/?

Hi again @ssingh!
We have doubled again the traffic just right now. It's 0.8% since now.

@ssingh Now that we are running with 0.8% traffic and everything seems to be ok, we are wondering if we could/should extend this experiment regarding the time and also the limit we established at the beginning (1.6%) in the case we reach that one and everything keeps being ok as right now. What do you think about it? We think that would be useful to know more about the performance and limits of the whole system for logged-out experiments in preparation for the near/long future when we expect that more experiments will be running at the same time. We don't mean to run this experiment forever until we break something but at least try it with something like 5% or so for enwiki and probably increasing more slowly instead of doubling the traffic once we reach 1.6% (1% each time, for example)

I wanted to take the opportunity as well to ask you whether you have noticed any performance loss because of this experiment. I'm taking a look at the related dashboard and, as far as I understand, everything seems to be unchanged. Is that right?

Thank you very much!

@ssingh Now that we are running with 0.8% traffic and everything seems to be ok, we are wondering if we could/should extend this experiment regarding the time and also the limit we established at the beginning (1.6%) in the case we reach that one and everything keeps being ok as right now. What do you think about it? We think that would be useful to know more about the performance and limits of the whole system for logged-out experiments in preparation for the near/long future when we expect that more experiments will be running at the same time. We don't mean to run this experiment forever until we break something but at least try it with something like 5% or so for enwiki and probably increasing more slowly instead of doubling the traffic once we reach 1.6% (1% each time, for example)

I wanted to take the opportunity as well to ask you whether you have noticed any performance loss because of this experiment. I'm taking a look at the related dashboard and, as far as I understand, everything seems to be unchanged. Is that right?

Thank you very much!

Hi @Sfaci. Most of the team is out this week for the SRE offsite so we will follow up on your question and the ask for observation during the week of Feb 2. Thanks.

Hi @Sfaci. Most of the team is out this week for the SRE offsite so we will follow up on your question and the ask for observation during the week of Feb 2. Thanks.

@ssingh That's ok!
In the meantime we are going to double right now the traffic from 0.8% to 1.6% as we already planned initially. That amount is the limit we planned when we defined this ticket.

Sounds good, thank you. We still have not gone through the dashboards in detail but so far everything looks fine on a quick check.