Page MenuHomePhabricator

Track impressions, success and abandonment rate on the signup form
Open, In Progress, MediumPublic4 Estimated Story Points

Description

Following https://phabricator.wikimedia.org/T300273, ensure we can track impressions, success, and abandonment rate on Special:CreateAccount.

Event Timeline

Cyndymediawiksim renamed this task from Instrumentation to Track impressions, success and abandonment rate on the signup form.Sep 14 2023, 12:42 PM
Urbanecm_WMF changed the task status from Open to In Progress.Oct 4 2023, 5:43 PM
Urbanecm_WMF triaged this task as Medium priority.
Urbanecm_WMF awarded a token.

Hi @nettrom_WMF, there's a patch uploaded for the event schemas by Cyndy: https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/962569. I'm not sure if you have any thoughts on the schema, but if you do, feel free to put them on the patch or the task.

Morten reviewed the schema patch; it needs improvements, so moving back to Doing.

Change 962569 had a related patch set uploaded (by Urbanecm; author: Cyndywikime):

[schemas/event/secondary@master] Add analytics for Impressions, Success and Abandonment rate for temporary Users

https://gerrit.wikimedia.org/r/962569

hi! As part of this work, will you also track these things?

  • abandonment due to failure to solve the CAPTCHA
  • if a user is blocked from creating an account, and if so what type of block

cc @Tchanders

hi! As part of this work, will you also track these things?

Hello!

  • abandonment due to failure to solve the CAPTCHA

I feel like the reason why user navigated away would be pretty much impossible to guess (I could see a CAPTCHA, say "this is too hard to read" and go away without even trying). Would logging whether there was any CAPTCHA-related error in the form be sufficient for you?

  • if a user is blocked from creating an account, and if so what type of block

I think this is already instrumented as analytics/mediawiki/accountcreation/block (from T306018: Instrument blocked account registration, see schema specification), but I might be missing something. Is there some information that's important for you, but not included in this schema?

If you want to measure captcha fallout, I'd include a variant field, and then you can A/B test against more readable captcha, non-latin captcha etc. If you do the logging on the client side, you could also include how many times the user requested a new captcha; on the server side it's more tricky to track that.

Probably should have some kind of "where did the user come from?" field as well. (Signup link in the user menu, signup link on the login form are the two very common ones but there are a number of other ways.)

Probably should have at least a flag to differentiate temp user signing up for permanent account vs. anon signing up for permanent account vs. account created by a logged-in user.

Bot noise is a big problem with authentication metrics so probably worth including at least the IP and user agent, and whether the API or the web form was used.

In theory there is the question of how to log multi-step signup but I don't think any Wikimedia wikis use that today.

hi! As part of this work, will you also track these things?

Hello!

  • abandonment due to failure to solve the CAPTCHA

I feel like the reason why user navigated away would be pretty much impossible to guess (I could see a CAPTCHA, say "this is too hard to read" and go away without even trying). Would logging whether there was any CAPTCHA-related error in the form be sufficient for you?

I think it would be useful to know if the user attempted to solve the CAPTCHA, failed, and then either abandoned or eventually succeeded. (Bonus if we recorded how many CAPTCHA attempts were involved.)

  • if a user is blocked from creating an account, and if so what type of block

I think this is already instrumented as analytics/mediawiki/accountcreation/block (from T306018: Instrument blocked account registration, see schema specification), but I might be missing something. Is there some information that's important for you, but not included in this schema?

Ah, that looks good. We'll have the user ID of the temporary account in those events. We might need to update fragment/mediawiki/common/ to include a user_is_temp property, though.

We haven't yet figured out which would be the best fit to instrument create account form impressions, success and failure rate. Particularly we are looking at the best approach to avoid instrumentation calls from core.

Existing hooks onUserLoginComplete, onLocalUserCreated, onAuthChangeFormFields and they present quirks which make them non suitable. For instance onUserLoginComplete would be called after account creation AND login the user in. We also thought of a dedicated onAccountCreationAudit but it seems there are already quite hooks in the authentication workflows so we are hestitant to add yet another one.

Alternatively a secondary or pre authentication provider (or both) could be used to make the track calls, similar to the approach from Campaigns extension in ServerSideAccountCreation, see source.

@Tgr, since you've worked with authentication flows, do you have any suggestion on which approach fits better the given use case? Thank you!

You can either create a preauthentication provider and use its postAccountCreation callback (for login we have AuthManagerLoginAuthenticateAudit for the hook equivalent, for account creation we don't have one; we could make one), or use the authevents log channel like WikimediaEvents\AuthManagerStatsdHandler. The first is probably cleaner.

You can either create a preauthentication provider and use its postAccountCreation callback (for login we have AuthManagerLoginAuthenticateAudit for the hook equivalent, for account creation we don't have one; we could make one), or use the authevents log channel like WikimediaEvents\AuthManagerStatsdHandler. The first is probably cleaner.

Thanks for the insights, we're gonna explore the preauthentication provider path. However it remains unclear to me what would be a fair "impression" event of the form.

Urbanecm_WMF set the point value for this task to 4.Nov 28 2023, 11:56 AM

If you want to log on the server side, use a rendering-related hook like BeforePageDisplay. If on the client side, add logging code in that hook or AuthChangeFormFields.

One limitation of authentication providers is that you won't see errors that happen outside the authentication layer (that is, form and session validation errors). For session validation, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972401/ might be relevant. For form validation, I think you'd have to do it on the client side or do a core change. (Those are rare though - we enforce few things via form validators, mostly required fields; and those tend to have matching HTML validators so in most cllients it's inpossible to submit invalid input.)

Change 979997 had a related patch set uploaded (by Cyndywikime; author: Cyndywikime):

[mediawiki/extensions/WikimediaEvents@master] Track Impressions, Success and Failure on the Special:CreateAccount and Special:Login pages.

https://gerrit.wikimedia.org/r/979997

Change 984508 had a related patch set uploaded (by Cyndywikime; author: Cyndywikime):

[mediawiki/core@master] Add AccountCreationEventProvider and AccountCreationLogger

https://gerrit.wikimedia.org/r/984508

Change 984508 abandoned by Cyndywikime:

[mediawiki/core@master] Add AccountCreationEventProvider and AccountCreationLogger

Reason:

Put in the wrong place

https://gerrit.wikimedia.org/r/984508

@nettrom_WMF would you mind taking a second look to the analytics schema change 962569, I've added some questions. Mainly, are we instrumenting all account creations or just from temporary users? And how should the auto-login after account creation be logged? Currently it would produce a "success"-Special:CreateAccount event which duplicates the account creation success. cc @Cyndymediawiksim

@nettrom_WMF would you mind taking a second look to the analytics schema change 962569, I've added some questions. Mainly, are we instrumenting all account creations or just from temporary users? And how should the auto-login after account creation be logged? Currently it would produce a "success"-Special:CreateAccount event which duplicates the account creation success. cc @Cyndymediawiksim

I'll respond here since I'm way more confident with phab than Gerrit, sorry. Let me try to take these in order:

  1. If it doesn't add a lot of complexity, I would prefer that we instrument all account creations. This is so we have a comparison when it comes to folks dropping off in this funnel. If we find that X% of temp users who start registration complete it, it helps if we know that Y% of non-temp users do (and preferably X% is higher than Y%). From what I can tell, that means we might want to rename the schema?
  1. I'm definitely in favour of not having duplicate events. Just so I understand the event generation here, we're instrumenting both account creation and login, but the auto-login doesn't trigger a login event through this instrumentation? It's instead that it creates a second account creation "success" event?
  1. About the page title and special pages comment I saw on Gerrit. If we have page_namespace = -1 with page_title = "CreateAccount" and page_title = "UserLogin" then I'm happy. If page_title is localized then we'll run into issues, which is why we might want to use special_page and make sure it uses the canonical English name.

I'm also in the process of thinking about Growth-related IP Masking instrumentation in general (T341651), but I think these can be separate tasks as so far we're only looking at schema modifications rather than creating whole new instrumentation. Wanted to mention it in case it would come up while you're working on this.

  1. If it doesn't add a lot of complexity, I would prefer that we instrument all account creations. This is so we have a comparison when it comes to folks dropping off in this funnel. If we find that X% of temp users who start registration complete it, it helps if we know that Y% of non-temp users do (and preferably X% is higher than Y%). From what I can tell, that means we might want to rename the schema?

Schema can be renamed.

  1. I'm definitely in favour of not having duplicate events. Just so I understand the event generation here, we're instrumenting both account creation and login, but the auto-login doesn't trigger a login event through this instrumentation? It's instead that it creates a second account creation "success" event?

@nettrom_WMF , yes event is recorded twice. IMO, there may be no easy way to tackle this issue without for example creating other event types eg. "success"-Special:CreateAccount-AutoCreate and keep the "success"-Special:CreateAccount event for account creation. Or we can add a new property of an event subtype that specifies the type of account creation event @Sgs, thoughts?

  1. If it doesn't add a lot of complexity, I would prefer that we instrument all account creations. This is so we have a comparison when it comes to folks dropping off in this funnel. If we find that X% of temp users who start registration complete it, it helps if we know that Y% of non-temp users do (and preferably X% is higher than Y%). From what I can tell, that means we might want to rename the schema?

Alright, the schema is renamed to accountcreation/account_conversion.

  1. I'm definitely in favour of not having duplicate events. Just so I understand the event generation here, we're instrumenting both account creation and login, but the auto-login doesn't trigger a login event through this instrumentation? It's instead that it creates a second account creation "success" event?

That's correct. Both events (account creation success and auto-login success) are (wrongly?) defined as:

"event_type": "success",
"performer": {
  "user_id": 63,
  "user_text": "Registered49"
},
"page_title": "CreateAccount",
"page_namespace": -1

But at this point I'm wondering if the tracking of the login (UserLogin page and auto-login) is part of this task or just a miss-understanding of T300273 description because of the login form screenshot and cosmetic requirements.

  1. About the page title and special pages comment I saw on Gerrit. If we have page_namespace = -1 with page_title = "CreateAccount" and page_title = "UserLogin" then I'm happy. If page_title is localized then we'll run into issues, which is why we might want to use special_page and make sure it uses the canonical English name.

Alright, we will adding page_namespace and page_title.

I'm also in the process of thinking about Growth-related IP Masking instrumentation in general (T341651), but I think these can be separate tasks as so far we're only looking at schema modifications rather than creating whole new instrumentation. Wanted to mention it in case it would come up while you're working on this.

Thanks for sharing this; related to point (2), is there overlap between this task and T341650? Or T341650 is for updating existing Grafana metrics while this task is for long term analysis/instrumentation of account creation?

@nettrom_WMF , yes event is recorded twice. IMO, there may be no easy way to tackle this issue without for example creating other event types eg. "success"-Special:CreateAccount-AutoCreate and keep the "success"-Special:CreateAccount event for account creation. Or we can add a new property of an event subtype that specifies the type of account creation event @Sgs, thoughts?

The problem as far I can see is the postAuthentication method does not provide information about the which flow is being performed so we cannot differentiate between auto-login and login (and other authentication flows, see Extension:CentralAuth/authentication#Central_session.

  1. I'm definitely in favour of not having duplicate events. Just so I understand the event generation here, we're instrumenting both account creation and login, but the auto-login doesn't trigger a login event through this instrumentation? It's instead that it creates a second account creation "success" event?

See also my comments above T346327#9446991, a way of solving this could be to have a different dedicated schema for login attempts (and impressions?). Would that be appropriate @nettrom_WMF ?

  1. I'm definitely in favour of not having duplicate events. Just so I understand the event generation here, we're instrumenting both account creation and login, but the auto-login doesn't trigger a login event through this instrumentation? It's instead that it creates a second account creation "success" event?

See also my comments above T346327#9446991, a way of solving this could be to have a different dedicated schema for login attempts (and impressions?). Would that be appropriate @nettrom_WMF ?

I don't know the architecture particularly well so this is a guess, but can we use the same schema and define different streams to separate account creations from logins and get around it that way? (ref wt:Event Platform/Stream Configuration) The reason I ask if that if the UX flow is almost identical in both cases it wouldn't make sense to have two almost identical schemas to instrument it. But maybe I'm missing something about how the architecture works, and we'd still have to find a way around this?

Let me also respond to this other question while I'm at it:

I'm also in the process of thinking about Growth-related IP Masking instrumentation in general (T341651), but I think these can be separate tasks as so far we're only looking at schema modifications rather than creating whole new instrumentation. Wanted to mention it in case it would come up while you're working on this.

Thanks for sharing this; related to point (2), is there overlap between this task and T341650? Or T341650 is for updating existing Grafana metrics while this task is for long term analysis/instrumentation of account creation?

I think T341650 is specifically about the Grafana metrics, whereas what I've been working on is thinking about what data we'd prefer to log to be able to better understand the newcomer journey based on what they did before account signup.

Something to be aware of is that the login form will very likely move over to loginwiki (T348388: Use central login wiki for login (SUL3)). It shouldn't affect things much, other than that it will need an explicit "source wiki" field as the wiki field built into the metrics framework will become useless.

Hi @nettrom_WMF, currently, the code for generating the analytics events produces P59765 for account creation and P59766 for login attempts. If you have any thoughts about the events (such as "is anything missing"), that'd be appreciated.