Background
While comparing the results of T397143: Run a logged-in synthetic A/A test using the PHP SDK analyzed with Test Kitchen automated analytics MVP vs GrowthBook, I found a mistake in the degrees of freedom calculation for the t-distribution in our frequentist engine implementation. When I copied the Welch–Satterthwaite equation – prior to implementation – I didn't notice a small detail. Here is the change that should be made in the implementation:
nu_approx = # Numerator: (((q.s2_C / q.n_C) + (q.s2_T / q.n_T))**2) / # Denominator: - ( (q.s2_C**2 / (q.n_C * (q.n_C - 1))) + (q.s2_T**2 / (q.n_T * (q.n_T - 1))) ) + ( (q.s2_C**2 / ((q.n_C**2) * (q.n_C - 1))) + (q.s2_T**2 / ((q.n_T**2) * (q.n_T - 1))) )
The issue produces incorrect confidence intervals and p-values in the frequentist engine part of the analysis. Bayesian engine part is unaffected, and the estimates are unaffected also.
Acceptance criteria
- Stakeholders are notified of the incident and plan (posted in #talk-to-experiment-platform in WMF Slack)
- Automated analytics job fixed (patched, fix deployed)
- relax R package updated
- Previously analyzed experiments are re-analyzed and corrected results are made available in a spreadsheet (e.g. previous confidence interval & p-value, new CI and p-value for each experiment)
- Corrected results are loaded into the database
- Change Log tab added to the Superset dashboard, where the correction is detailed
- A temporary (month-long) notice is added to top of Superset dashboard noting which experiments were affected and corrected, and to refer to Change Log tab for more details.
- Incident report published at https://wikitech.wikimedia.org/wiki/Test_Kitchen/Incident_reports