Page MenuHomePhabricator

QA Revise Tone Instrumentation & UX
Open, HighPublic2 Estimated Story Points

Description

User story & summary:

As the Growth team, I want instrumentation tested for the Revise Tone task, so that we can ensure our A/B test will accurately measure what we have defined in our Measurement Plan and Instrumentation Specs.

As the Growth team, I want to full user experience of the Revise Tone task to be tested, so that we can release to newcomers with confidence.

Acceptance criteria:

Details

Other Assignee
Iflorez

Event Timeline

KStoller-WMF moved this task from Inbox to Up Next (estimated tasks) on the Growth-Team board.
KStoller-WMF set the point value for this task to 2.
KStoller-WMF renamed this task from Product Analytics: QA Revise Tone Instrumentation to QA Revise Tone Instrumentation & UX.Nov 21 2025, 7:31 PM
KStoller-WMF reassigned this task from Iflorez to Etonkovidova.
KStoller-WMF updated Other Assignee, added: Iflorez.
KStoller-WMF updated the task description. (Show Details)

@Michael, @Sgs - for review. The following issues might be specific to testwiki environment. I listed them in order from more impactful to less impactful.

Testing results - Summary

Desktop
(1) Suggested: revise tone tag is added for a normal edit.

  • on an article with Revise Tone suggestions, make a simple edit (not related to Revise Tone section) and publish
  • that edit will have Newcomer task Suggested: revise tone tags

Examples:
https://test.wikipedia.org/w/index.php?title=Shovelware&diff=685865&oldid=655204
https://test.wikipedia.org/w/index.php?title=Whetting_Your_Appetite&curid=121461&diff=685868&oldid=562977
(2) Revise Tone label is displayed twice

Screenshot 2025-11-26 at 11.19.21 AM.png (624×2 px, 320 KB)

(3) No scrolling to the Revise Tone section
(4) Revise tone on testwiki has placed a suggestion in the reference sections

https://test.wikipedia.org/wiki/Rushing_game_development

Screenshot 2025-11-26 at 9.45.51 AM.png (638×2 px, 374 KB)

(5) The feed of Revise Tone articles keeps the articles that were revised or marked as "the tone is ok"

@Sgs and @Michael
Live testing on Test wiki while reviewing Console event data, shows discrepancies.

Testing Performed:

Issues Identified:

  • Three events appear in Console but were not sent (due to experiment override rules):
a) `treatment-exposure`
b) `page-visted` for Task Rejection Rate
c) `click` on Get-Started button at the end of the onboarding quiz
  • The`click` on the Get-Started button at the end of the onboarding quiz, contains an incorrect action_source; expected: Quiz10, actual: Quiz-step5.
  • Remaining events were not sent:
d) `Experiment-exposure`
e) `edit_saved` 
f) decline `click` for Task Rejection Rate
g) the `click` on the RT card itself which launches the RT onboarding

Expected vs. Actual

ItemExpectedActual
a) treatment-exposure, b) page-visted for Task Rejection Rate, c) click on Get-Started button at the end of the onboarding quizTriggered + SentTriggered only (not sent)
click on the Get-Started button at the end of the onboarding quizQuiz10Quiz-step5
d) Experiment-exposure, d) edit_saved, f) decline click for Task Rejection Rate, g) click on the RT card itself which launches the RT onboardingTriggered + SentNot triggered, Not sent

Impact: Incorrect or missing events break alignment with the instrumentation spec.

Next Steps

  • Review and fix instrumentation logic so events trigger and send correctly for d) Experiment-exposure, d) edit_saved, f) decline click for Task Rejection Rate, g) click on the RT card itself which launches the RT onboarding
  • Correct action_source for click on the Get-Started button at the end of the onboarding quiz.
  • Confirm experiment override conditions and adjust if needed for a) treatment-exposure, b) page-visted for Task Rejection Rate, c) click on Get-Started button at the end of the onboarding quiz.
  • Once fixes are deployed, re-run validation on Test wiki and/or pilot wikis to confirm resolution.
  • Review data on experiment stream for counts, null, uniqueness checks and schema validation

@Sgs and @Michael
Live testing on Test wiki while reviewing Console event data, shows discrepancies.

First of all, apologies, I think there's been some miss-communication on the suitability of using testwiki as a testing stage for this task. While it is true that after globally enabling the Revise tone feature in testwiki (T407029) the UX of the feature was indeed fully testable by overriding the assigned group with any user, as documented in Test_Kitchen/Conduct_an_experiment#Enrollment_override, the instrumentation and data collection is not testable in an end-to-end manner. I will try to describe further why this is the case at the high level and will provide answers to the specific RT instrumentation issues in at the end (Remaining clarifications), but hopefully this will clarify most things. There are two main aspects that were/are flawed in the way this test was performed: (1) TestKitchen platform has some levels of testing indirection (2) MediaWiki/GrowthExperiments instrumentation is implemented with a mixture of client-side and server-side events.

TestKitchen testing indirection

  • Shared config between testwiki and experiment target wikis: TestKitchen requires to create an experiment configuration in https://mpic.wikimedia.org/ in order to setup the experiment name, and machine readable name, the experiment start and end dates, sampling units, etc. Since the machine readable experiment name is used in code to get the right user assignments, we're locked in two options: having a shared config between testwiki and pilots, which impedes to start the experiment in testwiki without starting it in the rest of wikis. Or having two separate configs, which then requires different code to handle both experiment machine readable names. For the Revise tone experiment we opted for having a single config and it was never enabled for any wiki, so the data collection was never enabled in testwiki. Looking back maybe we could have setup a 0 sampling traffic for pilot wikis and 100% for testwiki, that is assuming the sampling could be increased for pilots wikis after the experiment had already started. This also shifts the start date from what we see in the MPIC dashboard, which may or not be confusing for some.
  • Overridden users do not emit data: validating all events for an experiment requires to use a user in the treatment group to fully complete the funnel, it is difficult to get a user that falls in that group by creating accounts or re-using any priory created. The change was introduced around Aug and at the time I was not yet opinionated about this (see also slack thread).

Testing GrowthExperiments instrumentation

As I understand an instrumentation end-to-end test for GrowthExperiments, it requires to check that the specified events are emitted, that they are properly ingested (valid events) and that the valid events provide the necessary and relevant information for the metric calculation. For checking that events are indeed emitted while interacting with the feature from the browser, looking at the JS dev log console or network tab is not enough. That's because some of these events are sent by the server and the browser has no awareness of that data being collected. This is the case for d) Experiment-exposure and e) edit_saved and g) the click on the RT card itself which launches the RT onboarding. The reason why some events are sent from the server and others from the client is both historical and sometimes/often for technical limitations or convenience. This is a decision I've always found un-ergonomic and we could revisit internally.

In order to check event emission (and ingestion) one could use kafcat from one stat server but I've always used EventStreams beta web app or the one for production cluster (Event_Platform/Instrumentation_How_To) which would show the full ingested event regardless of if it was emitted in the client or server. This allows me to check the full content of the event including contextual attributes, which I find useful.

In summary, testwiki is not yet a suitable environment for testing the Revise tone experiment. I think there's room for refinement of some TestKitchen and Growth decisions/processes so we can simplify instrumentation QA for the next experiments. Orchestrating code changes and config changes through our heterogenous deployment system has proved challenging and it would be nice to streamline this process as much as we can if we want to be able to have an experiment velocity of 1 experiment per month. I think this is something for us Growth (cc @DMburugu @KStoller-WMF ) to discuss further with Test Kitchen maintainers and come up with a more structured QA/deployment plan for next experiments.

Remaining clarifications

  • The`click` on the Get-Started button at the end of the onboarding quiz, contains an incorrect action_source; expected: Quiz10, actual: Quiz-step5.
click {
  "instrument_name": "Revise tone onboarding dialog end click",
  "action_subtype": "get-started",
  "action_source": "Quiz-step-5",
  "action_context": "unanswered,unanswered,unanswered,unanswered,unanswered"
}

This was an implementation decision I deliberately made. I considered Quiz10 to not be very meaningful since, as far I could understand, it stands for some Figma ID and not something that's meaningful from a MediaWiki UI context. Since the interface the interaction comes from (Get started button) is only shown in the 5th step of the onboarding dialog this will always be the same. But if we were to track the clicks on the I already know this button, this would contain the step number the user bailed out, eg: "action_source": "Quiz-step-3",

f) decline `click` for Task Rejection Rate
click {
  "action_subtype": "decline",
  "action_source": "EditCheck-1",
  "instrument_name": "Click on decline revise tone"
}

Not sure why you didn't get this event in the console, I can properly trigger it after clicking on one of the radio button options and clicking on submit. One thing that could have happened is the page gets reloaded right after submission and with default settings that would flush any logs in your JS console. You could try checking Preserve logs on the JS console settings see if then the log is visible.

@JVanderhoop-WMF & @mpopov looping you both in just so you are aware of some of our testing challenges. I don't think any immediate action is needed, but perhaps this is a challenge you have ideas on how to improve in the future? Or perhaps there should be Test Kitchen recommendations on how to check instrumentation before starting an experiment?

@Sgs Thank you for the detailed clarification.

Items a-c (events triggered with an override setting but not sent): That matches what I observed: overrides trigger the flows but don’t result in emitted or ingested experiment events. I’ll note this as a limitation of override-based QA on testwiki for instrumentation validation purposes here.

Items d-f (events triggered but not sent): Same as above per your notes; thank you for confirming.

Item c, Get-Started click: Thanks for flagging this. As discussed on Slack, the main requirement is that the Get-Started click is correctly instrumented; the exact action_source listed is flexible. I appreciate your opting for an action_source title that works in the longer term.

Item f: Thank you for the explanation. That makes sense, a page reload on submission could explain why the event wasn’t visible in the console. I’ll recheck with Preserve logs enabled to confirm the client-side emission, with the understanding that this still wouldn’t validate ingestion under override conditions.

At present, experiment data QA isn’t possible until the experiment is enabled (unless we run a separate experiment pilot for data validation), I'm excited about Test Kitchen's work on experiment phases in the next quarter which may close this gap. To addres this for this experiment, I will review the raw data in mediawiki.product_metrics.contributors.experiments quickly after the experiment starts so if there is an issue it is quickly identified and if needed we can stop the experiment before addressing/fixing the issue and restarting the experiment.

Testing results - Summary

(Revised on Dec16/2025 - testwiki wmf.7) -

The following items were re-tested:

  • (1) Suggested: revise tone tag is added for a normal edit. - ✅ Non-reproducible
  • on an article with Revise Tone suggestions, make a simple edit (not related to Revise Tone section) and publish
  • that edit will have Newcomer task Suggested: revise tone tags

Examples:
https://test.wikipedia.org/w/index.php?title=Shovelware&diff=685865&oldid=655204
https://test.wikipedia.org/w/index.php?title=Whetting_Your_Appetite&curid=121461&diff=685868&oldid=562977

There are some user workflows limitations when switching between editors on Revise Tone articles - filed as T412832: [QA task] Revise Tone - user editing workflows limitations

(2) Revise Tone label is displayed twice -Fixed

Screenshot 2025-11-26 at 11.19.21 AM.png (624×2 px, 320 KB)

(3) No scrolling to the Revise Tone section - ✅ probably need more testing
I tested it only on testwiki wmf.7 - the articles' structure there is often different from actual wiki articles.

(4) Revise tone on testwiki has placed a suggestion in the reference sections - ✅ addressed in a separate task

https://test.wikipedia.org/wiki/Rushing_game_development

Screenshot 2025-11-26 at 9.45.51 AM.png (638×2 px, 374 KB)

(5) The feed of Revise Tone articles keeps the articles that were revised or marked as "the tone is ok" - ✅ addressed in a separate task

Regarding the instrumentation QA, as of today at 15h UTC the experiment is enabled in testwiki and the events can be inspected through the browser network tab or using the EventStreams app for the ones produced in the backend.

Regarding the instrumentation QA, as of today at 15h UTC the experiment is enabled in testwiki and the events can be inspected through the browser network tab or using the EventStreams app for the ones produced in the backend.

Thank you for this update!