Page MenuHomePhabricator

Measure the difference between the intended and actual execution of delayed notification jobs
Closed, ResolvedPublic3 Estimated Story Points

Description

Background

GrowthExperiments currently sends a Getting Started notification 48 hours after an user account is registered, assuming the user did not make enough(*) edits yet. This delay is implemented using a delayed job: notificationGettingStartedJob. The same use-case is present for notificationKeepGoingJob as well.

Problem

As the Growth-Team, we have limited visibility into the actual execution time of those jobs. We do not know if they are getting executed on time, too late or never.

Checklist
  • track the time delay between the job being intended to run and it actually running
  • create a Grafana panel that shows that data

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Urbanecm_WMF moved this task from Inbox to Up Next (estimated tasks) on the Growth-Team board.
KStoller-WMF set the point value for this task to 3.Jun 9 2025, 4:07 PM

Change #1158487 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] feat(LevelingUp): Measure the delay between actual and intended notification timestamp

https://gerrit.wikimedia.org/r/1158487

Once the metric data starts flowing in, we will need to create the panel itself.

Change #1158487 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] feat(LevelingUp): Measure the delay between actual and intended notification timestamp

https://gerrit.wikimedia.org/r/1158487

Change #1159927 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.45.0-wmf.5] feat(LevelingUp): Measure the delay between actual and intended notification timestamp

https://gerrit.wikimedia.org/r/1159927

Change #1159927 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.45.0-wmf.5] feat(LevelingUp): Measure the delay between actual and intended notification timestamp

https://gerrit.wikimedia.org/r/1159927

Mentioned in SAL (#wikimedia-operations) [2025-06-17T07:54:30Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-17T07:56:54Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-17T08:04:53Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1159927|feat(LevelingUp): Measure the delay between actual and intended notification timestamp (T395260)]] (duration: 10m 22s)

The deployment in this task caused a production error: T397135: Error: Call to undefined method Wikimedia\Stats\Metrics\GaugeMetric::observe(). There is also a follow-up task to avoid running into similar issues in the future: T397155: CI should prevent attempts to call GaugeMetric::observe().

Created an initial Grafana dashboard with visualisation of this new data: https://grafana.wikimedia.org/goto/Wz56zGPNR?orgId=1. With this, I think this can go to QA. I also asked @Michael for feedback on the dashboard itself, if he has any.