Page MenuHomePhabricator

Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment
Closed, ResolvedPublic5 Estimated Story Points

Description

Background

GrowthExperiments currently sends a Getting Started notification 48 hours after an user account is registered, assuming the user did not make enough(*) edits yet. This delay is implemented using a delayed job. Since T394957, the delay of the job can be configured per wiki and user variant, allowing an experiment to be held.

There are several configuration options that need to be changed to make this possible:

  • wgGELevelingUpGetStartedNotificationSendAfterSeconds, to request the job to be executed with an appropriate jobReleaseTimestamp,
  • reenqueue_delay in changeprop, to ensure the jobs are actually executed at a time reasonably matching the requested jobReleaseTimestamp

Because of uncertainities regarding changeprop's behaviour (which were not sufficiently clarified during the T393955 research spike), the Growth engineers are unsure about the appropriate value of reenqueue_delay. This means making this change more riskier than normal deployments.

Checklist
Acceptance Criteria
  • No change is observed outside of the pilot wikis (eswiki, arwiki)
  • On the pilot wikis, 50% of new users are assigned to the control variant and 50% to the get-started-notification variant.
  • Users in the get-started-notification variant receive the Getting started notification between 19 and 21 hours since their registration (desired delay is 20 hours, with an error margin of 1 hour on both sides).
  • Users in the control variant receive the Getting started notification after 48 hours since their registration
  • The conditions for the notification to be fired remain unchanged (no suggested edits + less than GELevelingUpGetStartedMaxTotalEdits edits).
Possible risks

The risks of this change are contained to the job queue. The following (undesired) behaviours have a particularly significant chance of happening (ordered by impact from lowest to highest):

  1. the getting started notification is being delivered, but not at the right time (see A/C for details on when the notification is supposed to arrive)
  2. the getting started notification is being delivered, but only on some wikis (it should be sufficient to test this at both pilot wikis and an arbitrary non-pilot wiki)
  3. the getting started notification is not being delivered at all
  4. jobs stop getting executed reliably

Event Timeline

Restricted Application added subscribers: hubaishan, Aklapper. · View Herald Transcript

Blocked on T394957.

@KStoller-WMF, I wrote this task and the A/Cs based on our discussion during backlog refinement. Please feel free to double check and rewrite as needed!

@Etonkovidova, once we get to this, it would probably be more costing on QA than regular Growth tasks. I tried covering the risks I can foresee. Note the behaviour of this might depend on the volume of the notifications, so it might make sense to test this at various times of the day.

Michael moved this task from Blocked to Up Next (estimated tasks) on the Growth-Team board.

While this is indeed waiting on T394957: Support delaying NotificationGetStartedJob differently based on user variant being completed, ideally we want to get it done early in the upcoming sprint anyway.

Some things even are not yet blocked:

  • shorten the reenqueue_delay for notificationGetStartedJob` to 30 minutes -> this has to globally happen for this job anyway, we might as well learn early if shortening this is a problem or not by trying it out
  • the config change can already be prepared (but not yet merged until T394957 is ready)
  • Adding the code to track the difference between for when the delivery was intended and when it actually happened. Having some base-line data for this would not be bad either.

Change #1150699 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] changeprop: Decrease reenqueue_delay for Getting Started notif job

https://gerrit.wikimedia.org/r/1150699

Agreed. I went ahead and uploaded the changeprop config change, review appreciated.

  • Adding the code to track the difference between for when the delivery was intended and when it actually happened. Having some base-line data for this would not be bad either.

IMO, this is a separate thing, not depending on the reenqueue_delay change or even the Getting started notification project. I boldly split that to a separate task: T395260.

Urbanecm_WMF set the point value for this task to 5.May 26 2025, 4:19 PM

Blocked on T394957.

@KStoller-WMF, I wrote this task and the A/Cs based on our discussion during backlog refinement. Please feel free to double check and rewrite as needed!

@Etonkovidova, once we get to this, it would probably be more costing on QA than regular Growth tasks. I tried covering the risks I can foresee. Note the behaviour of this might depend on the volume of the notifications, so it might make sense to test this at various times of the day.

Thank you, @Urbanecm_WMF! All test scenarios (listed as Possible risks) are in the scope of testing. Via db I can reliably see when an edit happens and when a notification was sent.

Regarding the last risk scenario:

jobs stop getting executed reliably

This grafana board - https://grafana.wikimedia.org/d/onyD7cOMk/echo-extension-notification-baseline-track?orgId=1&from=now-2d&to=now&timezone=utc - may be useful to keep an eye on overall echo performance. I also looked into logstash dashboards - not sure that there are specific echo-service related dashboards. Some regression testing on notifications delivery should be done in testwiki.

KStoller-WMF lowered the priority of this task from High to Medium.May 27 2025, 4:49 PM

Change #1159465 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] [Growth] Prepare for the Get Started notification experiment

https://gerrit.wikimedia.org/r/1159465

Change #1150699 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Decrease reenqueue_delay for Getting Started notif job

https://gerrit.wikimedia.org/r/1150699

Mentioned in SAL (#wikimedia-releng) [2025-06-19T09:10:28Z] <urbanecm> deployment-prep: Update changeprop config per https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1150699 using [[wikitech:Changeprop#To_deployment-prep]] (T394958)

Change #1161443 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] changeprop beta: Decrease reenqueue_delay for Getting Started notif job

https://gerrit.wikimedia.org/r/1161443

Change #1161443 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop beta: Decrease reenqueue_delay for Getting Started notif job

https://gerrit.wikimedia.org/r/1161443

Mentioned in SAL (#wikimedia-releng) [2025-06-19T09:18:53Z] <urbanecm> deployment-prep: Update changeprop config perhttps://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1161443 using [[wikitech:Changeprop#To_deployment-prep]] (T394958; this time actually changing the beta config)

I went ahead and deployed the changeprop change to production:

SAL log
10:58 <+logmsgbot> !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
10:59 <+logmsgbot> !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
10:59 <+logmsgbot> !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
11:01 <+logmsgbot> !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
11:01 <+logmsgbot> !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
11:02 <+logmsgbot> !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply

I discovered beta has a separate values file, so I changed the delay there as well.

Change #1159465 merged by jenkins-bot:

[operations/mediawiki-config@master] [Growth] Prepare for the Get Started notification experiment

https://gerrit.wikimedia.org/r/1159465

Mentioned in SAL (#wikimedia-operations) [2025-06-23T16:04:23Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-23T16:06:19Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-23T16:14:59Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1159465|[Growth] Prepare for the Get Started notification experiment (T394958)]] (duration: 10m 36s)

Change #1163005 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] Revert "[Growth] Prepare for the Get Started notification experiment"

https://gerrit.wikimedia.org/r/1163005

Change #1163005 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "[Growth] Prepare for the Get Started notification experiment"

https://gerrit.wikimedia.org/r/1163005

Mentioned in SAL (#wikimedia-operations) [2025-06-23T18:30:55Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1163005|Revert "[Growth] Prepare for the Get Started notification experiment" (T394958)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-23T18:40:21Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1163005|Revert "[Growth] Prepare for the Get Started notification experiment" (T394958)]] (duration: 09m 25s)

Change #1163022 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] Revert^2 "[Growth] Prepare for the Get Started notification experiment"

https://gerrit.wikimedia.org/r/1163022

Change #1163022 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert^2 "[Growth] Prepare for the Get Started notification experiment"

https://gerrit.wikimedia.org/r/1163022

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:23:19Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:25:26Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:34:58Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1163022|Revert^2 "[Growth] Prepare for the Get Started notification experiment" (T394958)]] (duration: 11m 39s)

Change #1163292 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] [Growth] testwiki: Enable the get-started-experiment

https://gerrit.wikimedia.org/r/1163292

Change #1163292 merged by jenkins-bot:

[operations/mediawiki-config@master] [Growth] testwiki: Enable the get-started-experiment

https://gerrit.wikimedia.org/r/1163292

Mentioned in SAL (#wikimedia-operations) [2025-06-24T12:24:32Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-24T12:27:28Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Experiment deployed at test.wikipedia.org. I briefly tested while at mwdebug by:

  1. Listening to the job events by running kafkacat -b kafka-jumbo1007.eqiad.wmnet:9092 -o -1 -t "eqiad.mediawiki.job.notificationGetStartedJob" | jq 'select(.meta.domain == "test.wikipedia.org")' on a statbox
  2. Creating a test account at test.wikipedia, called MU test 202506241434
  3. Using ge.utils.getUserVariant() in my console to see what variant it was assigned
  4. Waiting for the job event to appear

The scheduled job is as follows:

{
  "$schema": "/mediawiki/job/1.0.0",
  "meta": {
    "uri": "https://placeholder.invalid/wiki/Special:Badtitle",
    "request_id": "ce1ed8ec-80e2-9503-bd99-fbfce632cf54",
    "id": "736c850f-95cf-4464-822e-067776f42f07",
    "dt": "2025-06-24T12:34:52Z",
    "domain": "test.wikipedia.org",
    "stream": "mediawiki.job.notificationGetStartedJob"
  },
  "database": "testwiki",
  "type": "notificationGetStartedJob",
  "delay_until": "2025-06-25T08:34:51Z",
  "params": {
    "userId": 69257,
    "jobReleaseTimestamp": 1750840491,
    "requestId": "ce1ed8ec-80e2-9503-bd99-fbfce632cf54"
  },
  "mediawiki_signature": "redacted"
}

The date of the notification is indeed ~20 hours from the account creation.

I think all the A/Cs are fulfilled here. This is now ready for QA. Over to you, @Etonkovidova!

Mentioned in SAL (#wikimedia-operations) [2025-06-24T12:42:51Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]] (duration: 18m 18s)