Page MenuHomePhabricator

Adjust Discussion Tools' sampling rates
Closed, ResolvedPublic

Description

Per the point @DLynch raised in T265099#6784140, this task is about adjusting the sampling rate within the Discussion Tool to ensure we have the data required to fulfill T263053 and T263054.

Requirements

  • Adjust event sampling within the Reply and New Discussion Tools such that 1 out of every 5 events is logged.
  • This change in the rate at which events are sampled should be applied on every wiki where the New Discussion Tool and/or Reply Tool is available.

Open questions

  • Should the sampling rates for the New Discussion Tool and Reply Tool be the same? If no, additional work will need to be done before this will be possible (currently, the two tools' sampling rates cannot be set differently).
    • Yes, for the time being, it is fine for the sampling rates for both the New Discussion and Reply Tools to be the same. We will revisit this decision before offering the Reply Tool as an opt-out feature on more Wikipedias. See: T274471.
  • On what wiki, should events be oversampled? Currently, events are only be oversampled at the Arabic, Dutch, French and -- Hungarian Wikipedias and mediawiki.
    • Events on all wikis where the New Discussion Tool and/or Reply Tool are available should be oversampnled.

Done

  • All Open questions are answered
  • Ssampling rates are adjusted to meet the Requirements above

Event Timeline

Explanation of sampling rate quirks:

The base sampling rate is set through wgDTSchemaEditAttemptStepSamplingRate, which is the one mentioned as being set explicitly to 0.2 on those five wikis above.

If that value isn't set explicitly, we fall back to the value of wgWMESchemaEditAttemptStepSamplingRate from the WikimediaEvents extension. This defaults to 1/16 and isn't overridden anywhere. This is the config variable that also controls the sampling rate for VisualEditor, WikiEditor, and MobileFrontend's use of EditAttemptStep.

Separately we have a concept called oversampling. An event is oversampled when it's logged when the normal rate wouldn't have caused it to be logged. We log, as part of the EventLogging platform, whether an event was oversampled, so analysis can double-check that to avoid over-representing things.

For DT there's a config variable called wgDTSchemaEditAttemptStepOversample, which if we set to true it'll log 100% of the events. We don't currently override this anywhere, but it's a parallel to the similar feature in MobileFrontend which is used to oversample EAS logging for either specific editors or for all editors depending on how it's set. (On MobileFrontend we currently always oversample VE sessions, and also oversample all sessions for 20 specific wikis.) If we want to artificially bump sampling for one specific feature, this is probably where we should implement it.

There's then a config variable called wgWMESchemaEditAttemptStepOversample from WikimediaEvents which can also trigger oversampling. That's never set by config, but is rather done programmatically -- it automatically oversamples all sessions from users whose account is less than a day old. (Or anyone who has the editingStatsOversample=1 URL parameter set.) There's also a hook which can trigger this route, but I don't think it's used anywhere.

Outcomes from the conversations @MNeisler and I had today, 10-Feb

Open questions

  • Should the sampling rates for the New Discussion Tool and Reply Tool be the same? If NO, additional work will need to be done before this will be possible (currently, the two tools' sampling rates cannot be set differently).

Yes, for the time being, it is fine for the sampling rates for both the New Discussion and Reply Tools to be the same. We wil revisit this decision before offering the Reply Tool as an opt-out feature on more Wikipedias. See: T274471.

  • On what wiki, should events be oversampled? Currently, events are only be oversampled at the Arabic, Dutch, French and Hungarian Wikipedias and mediawiki.

Events on all wikis where the New Discussion Tool and/or Reply Tool are available should be oversampnled.


The task description has been updated to reflect the above.

Events on all wikis where the New Discussion Tool and/or Reply Tool are available should be oversampnled.

Do you mean "oversampled" as in "sampled at 100%", or are we thinking of some other rate like the 20% at the current five wikis?

@MNeisler Actually, question, for data-integrity purposes do you (a) actually use EventLogging's oversampled attribute, and (b) if so should we be logging things as oversampled if they're being disproportionately sampled between tools? (Since it presumably gets more complicated to analyze comparisons if the rates are different?)

@DLynch

Do you mean "oversampled" as in "sampled at 100%", or are we thinking of some other rate like the 20% at the current five wikis?

Per discussions with @ppelberg yesterday, we'd like to sample both the New Discussion and Reply Tool events at 100%. Based on the number of daily reply tool events, this shouldn't be a concern but we should revisit the sampling rate before offering the Reply Tool as an opt-out on additional Wikipedias.

@MNeisler Actually, question, for data-integrity purposes do you (a) actually use EventLogging's oversampled attribute,

Yes, I use the is.oversample field of EditAttemptStep during analysis especially when I'm comparing data across multiple features or tools where different sampling rates may have been applied (for the exact reason you mentioned - it's difficult to make comparisons if the rates are different). It's also helpful during QA to identify if a bucket imbalance in the case of an AB test or other large discrepancy in the number of logged events is due to oversampling.

and (b) if so should we be logging things as oversampled if they're being disproportionately sampled between tools? (Since it presumably gets more complicated to analyze comparisons if the rates are different?)

Yes, anything that is sampled at a different rate than the standard sampling rate should ideally be logged as oversampled.

Change 663672 had a related patch set uploaded (by DLynch; owner: DLynch):
[operations/mediawiki-config@master] Oversample DiscussionTools EditAttemptStep logging

https://gerrit.wikimedia.org/r/663672

That patch changes the current state of affairs: the existing config setting 5 wikis to sample at 0.2 is removed, and replaced by oversampling all logging from DiscussionTools.

Change 663672 merged by jenkins-bot:
[operations/mediawiki-config@master] Oversample DiscussionTools EditAttemptStep logging

https://gerrit.wikimedia.org/r/663672

Mentioned in SAL (#wikimedia-operations) [2021-02-12T00:32:48Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: a022f2b506089ab518b74c1dfca78924c06dc80f: Oversample DiscussionTools EditAttemptStep logging (T273946) (duration: 01m 08s)

ppelberg claimed this task.

Regarding QA...
@MNeisler and I just talked, QA of this task will happen in T273096.