Page MenuHomePhabricator

Research Spike: Get Started Notification
Closed, ResolvedPublic1 Estimated Story Points

Description

User story & summary:

As a newcomer, I want to receive a well-timed and engaging notification that guides me toward my first edit, so I can quickly understand what to do next and feel encouraged to contribute. By receiving clear guidance at the right moment, I will be more likely to take my first editing step and continue participating in Wikipedia.

Task specific user story:

  • As the Growth team, I want to understand how to best test different delivery times for the "Get Started Notification", because I want to see the impact of sending a notification earlier.
Background & research:

Hypothesis: If new accounts that have not yet edited receive a supportive notification* with a Suggested Edit recommendation within 24 hours of creating an account, then they will be more likely to activate constructively.
*An Echo notification and an email if the account has an associated email address.

Supporting Data & Insights:

  • The “Get Started” notification has already been shown to increase newcomer editing when sent at 48 hours after account creation (1). This suggests that well-timed interventions can positively impact newcomer activation. By sending the notification earlier, we may further improve activation rates by reaching users while their interest is still high.
  • Prior studies show that positive reinforcement, such as the "Thanks" feature, leads to increased editor engagement (2). This suggests that notifications framed as encouragement rather than just instructions may yield better results.
Research Spike:

It might be complex to run the notificationGetStartedJob at 48 hours on non-pilot wikis, and at x hours on our pilot wikis.
Related code:
https://gerrit.wikimedia.org/g/operations/deployment-charts/+/803477de656a10327ac31f3197878487bae86a34/helmfile.d/services/changeprop-jobqueue/values.yaml#68

How should we handle this?

  1. A second Get Started job for the pilot wiki experiment group?
  2. Change the existing job for all wikis during this experiment phase?
  3. Something else?
Acceptance Criteria:
  • Research options, consider pros and cons, discuss with Growth engineers, and post a recommendation in this task.

Event Timeline

KStoller-WMF triaged this task as High priority.
KStoller-WMF moved this task from Inbox to Up Next (estimated tasks) on the Growth-Team board.

Thanks for the research spike, @KStoller-WMF. The most important question here is what is the purpose of reenqueue_delay in changeprop's config. Jobs are submitted in the following way (docs):

  1. MediaWiki sends an event to an appropriate Kafka topic via EventBus
  2. Change-Propagation (ChangeProp) sees the event and transforms it into a POST request to https://mw-jobrunner.discovery.wmnet:4448/rpc/RunSingleJob.php (source)
  3. Within that request (which has a much higher timeout), a MediaWiki jobrunner processes the job

Within the job event, delay_until may be set. If that is defined, ChangeProp waits for that timestamp before triggering the POST request, effectively delaying the job. As far as I can see, ChangeProp supports two waiting mechanisms if the delay_until timestamp is in the future (source code):

  1. Wait for reenqueue_delay seconds (optionally defined on per-job basis, defaults to 20 seconds), then put the event back to the queue of events ChangeProp is picking from, eventually getting back to it again
  2. Wait for the delay_until timestamp directly, then send the POST to MediaWiki to actually execute the job

ChangeProp decides whether to use (1) or (2) by checking how many seconds are left waiting. If there are more than reenqueue_delay seconds left, it uses (1). Otherwise, it uses (2).

The confusion happens with how many of the delayed jobs are defined in values.yaml. The reenqueue_delay value is set (purposefully?) to a higher value than the job's usual delay, making it impossible for the reenqueueing feature to kick in. For example, notificationGetStartedJob is scheduled so that delay_until is 48 hours _after_ the job was fired; the reenqueue_delay value for that job is set to 72 hours.

This seems to be the intention. For example, fetchGoogleCloudVisionAnnotations job (delayed 48 hours, reenqueue_delay set to 72 hours) has the following comment in the configuration:

# All the jobs of this kind are delayed exactly 48 hours, so we don't want
# the reenqueue feature to kick in.

The question is why do we want to disable the reenqueueing feature? Why do we even have two waiting mechanisms in ChangeProp? When is it a good idea to use one versus the other? One of hypothesis that occured to me is that (2) is more precise (as it waits for the exact number of seconds), while (1) might be affected by the health of the job queue (if there are too many jobs waiting to be processed, the actual waiting time might be longer than what is needed, delaying the job more than it was supposed to be). It also seems that (1) is more expensive on ChangeProp's side (as it occupies a slot while it is waiting).

Since Growth-Team's intention is to decrease the waiting period for notificationGetStartedJob to a couple of hours (rather than 48 hours), and we're seemingly OK waiting for 48 hours w/o reenqueueing, this probably doesn't block us. However, I would like to understand the reasoning more, so that we do not accidentally cause any issues.

Based on git blame, it seems that fetchGoogleCloudVisionAnnotations was the first delayed job to do this, with the others following it as an example. @hnowlan @kostajh As authors of the relevant changes to changeprop config, do you have any idea about why the reenqueue_delay is set like it is set? Is there something I'm missing in my description above?

KStoller-WMF set the point value for this task to 1.May 13 2025, 12:27 PM
KStoller-WMF edited projects, added Growth-Team (Current Sprint); removed Growth-Team.

I looked into this from a different angle and am confused about a different thing. From reading the code, it seems we are sending both a "Get Started"-notification and a "Keep Going"-notification 48 hours after registration (both after 5 minutes on beta). Is this actually the case? Is it intentional? Looking more, in the two jobs, we decide if the respective notification should be sent. That way, we're not sending two at the same time.

HomepageHooks.php
					$this->jobQueueGroup->lazyPush(
						new JobSpecification( NotificationKeepGoingJob::JOB_NAME, [
							'userId' => $user->getId(),
							// Process the job X seconds after account creation (default: 48 hours)
							'jobReleaseTimestamp' => (int)wfTimestamp() +
								$this->config->get( 'GELevelingUpKeepGoingNotificationSendAfterSeconds' )
						] )
					);
					$this->jobQueueGroup->lazyPush(
						new JobSpecification( NotificationGetStartedJob::JOB_NAME, [
							'userId' => $user->getId(),
							// Process the job X seconds after account creation (configured in extension.json)
							'jobReleaseTimestamp' => (int)wfTimestamp() +
								$this->config->get( 'GELevelingUpGetStartedNotificationSendAfterSeconds' )
						] )
					);

Glancing at the above code, it would seem to be straight forward to add a different (smaller) jobReleaseTimestamp for a notification based on a user variant, unless I'm missing something?

@Michael The jobReleaseTimestamp can be indeed changed very simply (by changing GELevelingUpGetStartedNotificationSendAfterSeconds on the pilot wikis). The question is whether that can be done safely, and whether reenqueue_delay being set to 72 hours plays a role here.


Summary of the goal: Currently, notificationGetStartedJob is scheduled at 48 hours after the user signup. Growth-Team's goal here is to run notificationGetStartedJob on few pilot wikis much sooner (like five hours after registration), while leaving all other wikis to execute their notificationGetStartedJob at the 48 hours mark (status quo). This is done to verify the impact of such change on users (via instrumentation). If the experiment is successful, we might decide to decrease the delay on all wikis in the future.


My understanding in the rest of this comment is largely based on the following comment:

In T241072#5806588, Pchelolo wrote:

The kafka job queue delayed jobs work in the following way: The message is read from the queue. If it's a delayed message, it's held in memory up to reenqueue_delay seconds. If the reenqueue_delay limit was reached, the message is put back in the queue and a new message is fetched. This is done mostly to support queues with variable delays - we don't want a set of messages with a huge delay block the queue for possibly next messages with a small delay. While the message is held in memory, one concurrency slot of the queue is being blocked. So, up to concurrency messages can be held in the queue simultaneously.

as well as on the rest of the conversation on that task.


Based on my research about this, I'm now convinced that the answer is no, decreasing the delay on per-wiki basis cannot be done safely with the current change-prop configuration. As far as I can see, change-prop currently assumes all instances of notificationGetStartedJob are delayed by the same amount of time.

With the reenqueuing mechanism disabled (by setting reenqueue_delay to a very high value), the jobs are dispatched in the following way:

  1. Change-Prop reads 10 notificationGetStartedJob instances from the queue (as change-prop has 10 execution slots for notificationGetStartedJob, cf. concurrency in values.yaml).
  2. Each job is kept in memory, waiting for the jobReleaseTimestamp to be reached (48 hours)
  3. ChangeProp does not load any new jobs from the queue, because all its execution slots are occupied (by waiting).
  4. Once the release timestamp is reached, the job executes. This frees an execution slot, and makes Change-Prop read another job from the queue, returning to step 1.

This works, because the job release timestamp is set in the same way on all wikis. If we start producing a small portion of notificationGetStartedJob with a much shorter delay, those jobs would likely get stuck in the queue (at step 3). This is because all execution slots would be occupied by the (much more frequent) jobs that are waiting 48 hours before executing, leaving no spots for the (infrequent) jobs that wait only 5 hours (as an example).

Essentially, at the step (3) ChangeProp needs to be checking whether there is a job with a shorter delay that might get processed earlier. This is why the reenqueueing mechanism exists: whenever the reenqueue_delay is reached, ChangeProp puts the job at the end of the queue, which frees the execution slot without executing anything, allowing ChangeProp to fetch another job instance from the job queue (if it has a smaller delay, it would be processed earlier). In other words: with reenqueueing enabled, ChangeProp is periodically checking whether there is something else in the queue that could get executed earlier than the jobs it is currently looking at.

I presume that the reason reenqueueing is disabled for this job is that it is not necessary (given the delays are not variable), and it results in a little bit of extra traffic. Since reenqueueing is necessary to support variable delays, we need to set reenqueue_delay to something shorter. Since the shorter delay will be a couple of hours, I propose to set reenqueue_delay to 30 minutes. That should give ChangeProp enough chances to spot jobs that have a shorter delay, while not reenqueueing too often (as would happen with the default delay of 20 seconds).

@Joe @hnowlan, would one of you be able to verify my understanding? Am I missing something that might be useful to know to be able to meet Growth's goal?

This is essentially Blocked from a Growth perspective, I'm waiting on someone more experienced in job infrastructure to confirm my understanding before taking an action on it.

I've been looking at this today and the unfortunate news of the historical use of reenqueue_delay is that this has been cargo culted forward for various jobs since well before my time at the foundation. Your assessment of how the behaviour changing might impact things seems accurate based on my reading of the source, but unfortunately I can't give more of an authoritative judgement on what will happen. However, my main bit of advice would be to experiment on the jobqueue in beta, which should behave in a similar fashion with the same rules. If you need help modifying the configuration or anything similar please let me know.

Mh. We should probably add some tracking that records the difference between the intended time for these notifications to be delivered and the actual time that they were sent out, so that we spot issues as they arise. (Bigger picture: Why can't that queue be sorted by delivery time?)

Another idea to at least partially mitigate this issue: We could add another job, for example EarlyNotificationGetStartedJob, and based on the user's variant either schedule that or the classic 48h one.

Mh. We should probably add some tracking that records the difference between the intended time for these notifications to be delivered and the actual time that they were sent out, so that we spot issues as they arise.

Good idea!

(Bigger picture: Why can't that queue be sorted by delivery time?)

Assuming "delivery time" means "desired execution time": Because that is not something message brokers support (unless I am missing something). Just like TCP guarantees packets arrive in the order they were sent in, Kafka guarantees messages (=jobs in this context) are produced in the order they are enqueued in (within one partition).

Another idea to at least partially mitigate this issue: We could add another job, for example EarlyNotificationGetStartedJob, and based on the user's variant either schedule that or the classic 48h one.

Personally, I'd prefer getting a reasonable reenqueue_delay (as that is the mechanism that is supposed to make this possible) and using a single job. But yes, if we cannot make it work, then we might want to decide to go via this route too.

Based on the advice here so far (and considering the lack of subject matter experts), my suggestion would be to decrease the reenqueue_delay as I mentioned here T393955#10821936 and make the change. Since we are uncertain what the response would be, this should be considered a risky change – we should sync with @Etonkovidova before making it. The expected risks are (ordered by impact from smallest to highest):

  1. the getting started notification is being delivered, but the intended delay is not followed "precisely enough" (definition of "precisely enough" TBC by @KStoller-WMF)
  2. the getting started notification is being delivered, but only for some wikis (pilots / everything else should be good enough determination)
  3. the getting started notification is not being delivered at all
  4. jobs stop getting executed altogether

If we run into issues, we can also split the jobs into two. However, I'd like to avoid having to do that unless proven necessary.

With that, I think this research spike can be considered resolved.

  1. the getting started notification is being delivered, but the intended delay is not followed "precisely enough" (definition of "precisely enough" TBC by @KStoller-WMF)

IMO "precisely enough" is less than an hour.
If delays are over an hour, then we may want to reconsider the send time.