Page MenuHomePhabricator

[REQUEST] Investigate decrease in New Registered Users
Closed, ResolvedPublic

Description

Name for main point of contact and contact preference
Kate Zimmerman, Phab

What teams or departments is this for?
Product

What are the details of your request? Include relevant timelines or deadlines
@MMiller_WMF noted that "New Registered Users", as reported on Wikistats (https://stats.wikimedia.org/#/all-wikipedia-projects/contributing/new-registered-users/normal|bar|2016-07-02~2021-08-26|~total|monthly), have declined substantially in recent months (see image below). We need to understand what's behind the decrease in reported numbers:

  • are declining registration numbers due to an issue with registration forms or some other part of the registration user flow?
  • is this an early warning sign of declining user engagement? (note: we regularly monitor new active editors, and those numbers do not indicate cause for concern)
  • is this an artifact of bugs in event logging, data pipelines, bots, or some other issue that needs to be addressed to improve the accuracy of our numbers?

Screen Shot 2021-08-26 at 9.58.09 AM.png (1×2 px, 252 KB)

How will you use this data or analysis?
To inform Product or Engineering decisions (for example, if a bug fix is needed)

Is this request urgent or time sensitive?
Yes, given potential implications

Event Timeline

kzimmerman created this task.
kzimmerman renamed this task from [REQUEST] to [REQUEST] Investigate decrease in New Registered Users.Aug 26 2021, 5:05 PM

Thanks for creating this, @kzimmerman. One thing that might be part of this story is a change to the Create:Account page on English Wikipedia, which added a warning at this top about usernames. Perhaps this is deterring some users. I know that it began in the last year, but I'm not sure when. And I don't think other wikis have done similar things.

image.png (1×2 px, 505 KB)

From https://meta.wikimedia.org/wiki/Research:Newly_registered_user#Data_sources:

Newly registered users are logged globally via Schema:ServerSideAccountCreation or locally on a per-project basis via MediaWiki's logging table. Note that the user table is not a reliable source of new registration data as it includes both attached and proxy-registered users.

Looking at the EventGate dashboard for eventlogging_ServerSideAccountCreation stream it appears there are quite a few HTTP errors:

Screen Shot 2021-09-02 at 3.12.43 PM.png (1×3 px, 673 KB)

(According to Andrew the decimal events/second such as 0.6 is because Grafana is taking the average of 1 minute.)

The only events going to that stream should be those produced by this instrumentation, so it looks like in some cases that instrument is making bad requests.

Looking at Logstash, there are MANY errors:

Screen Shot 2021-09-02 at 3.31.41 PM.png (646×3 px, 171 KB)

…it makes me wonder if any of them are from the SSAC instrument. Since SSAC is being used for global account creation count, we should try using it to count on a per-wiki basis. I recommend picking some wikis for investigation and comparing counts from ServerSideAccountCreation vs counts from logging table. If there is a problem with the instrument, we should see the count via logging be greater than the count via SSAC across all selected wikis. If there is no problem with the instrument, the counts will be the same and the issue is something else.

An account gets autocreated for all new users on metawiki and loginwiki (although this is also something that could break in theory, but stewards use the latter for antispam and tend to notice issues) so account autocreation logs there can be compared with global metrics.

Hi all,
I am looking at counts from ServerSideAccountCreation and comparing those to counts from logging for swiki, bnwiki, idwiki, and enwiki to start.
I will check in with @mpopov tomorrow about this and will post an update here by Thursday.

Hi All,
I'm investigating this. Following my recent check-in with @mpopov, my next step is to touch base with @nettrom_WMF when he returns on Monday to review queries/results.

Thanks for creating this, @kzimmerman. One thing that might be part of this story is a change to the Create:Account page on English Wikipedia, which added a warning at this top about usernames. Perhaps this is deterring some users. I know that it began in the last year, but I'm not sure when. And I don't think other wikis have done similar things.

image.png (1×2 px, 505 KB)

If this is deemed the issue, T282494 has been open since May; resolving that would give us the option to move the message to a more appropriate spot on the page where it wouldn't be so scary.

@Sdkb -- thanks for pointing out that task. I actually noticed today that the message has been slimmed down (see below). Do you know of a conversation on enwiki where this change was discussed? I would be interested to see how the community is thinking about it.

image.png (1×2 px, 483 KB)

When comparing logging table records where log_action = 'create' to ssac table records where event.isselfmade = true, I see a very similar number of results in the two tables for the list of wikis queried. Out of 77 wikis: 66 showed the same number of new registrations in July in both tables, 11 showed a .44% to a .02% difference in registration counts between the two tables with the ssac table undercounting by a few registrants each of the 11 times.

A bigger difference is seen when comparing ssac table records where event.isselfmade = true to logging table records where log_action IN ('create', 'create2'). The next step is to look at how Wikistats is computing the number New Registered Users.

SELECT count(log_id)
    FROM logging  
    WHERE log_action IN ('create')
    AND log_timestamp BETWEEN 20210701000000 AND 20210801000000
    AND log_type = 'newusers'
SELECT wiki AS database_code, 
    count(event.userid) AS reg_count_ssac
FROM event_sanitized.serversideaccountcreation
WHERE year = 2021
    AND month = 7
    AND event.isselfmade = true
    AND wiki IN (list)
GROUP BY wiki

@MMiller_WMF, sorry for the delayed reply. There was discussion first here, where I pushed to make it more concise. Some editors had strong views about whether to use "suggest"/"strongly suggest"/"consider", so I threw that to an RfC here. The RfC got CENT-listed, so it's ended up with a ton of comments. Some underlying context is here.

Wikistats numbers for New Registered Users are in line with the numbers seen in the logging table (where log_action = 'create') and SSAC table (where event.isselfmade = true). See how wikistats defines New Registered Users.
These numbers were tested by comparing them to user counts on wikis for user accounts created within the same time period, using the below query:

user_count_query = 
    SELECT
        count(1) AS num_users
    FROM user
    JOIN actor
    ON user_id = actor_user
    JOIN logging
    ON actor_id = log_actor
    WHERE log_timestamp BETWEEN 20210701000000 AND 20210802000000
    AND log_action = "create"
    AND log_type = "newusers"
    AND user_registration BETWEEN 20210701000000 AND 20210801000000

Drop in New Registered Users:

  • is this an artifact of bugs in event logging, data pipelines, bots, or some other issue that needs to be addressed to improve the accuracy of our numbers? No
  • are declining registration numbers due to an issue with registration forms or some other part of the registration user flow? No; Changes to user registration in question on this ticket are noted on enwiki but the drops in New User Registration are seen across wikis. If this question is taking a look at other issues in registration across wikis such as UX or Design, follow-up investigations are recommended.
  • is this an early warning sign of declining user engagement? (note: we regularly monitor new active editors, and those numbers do not indicate cause for concern) We see drops in New User Registration across wikis and we also see that the drops coincide with typical seasonal declines. For some wikis, the drop is steep. For example, dewiki had 5,853NRU in Jul 2021 compared to their next two lowest NRU on record 7,316 on 8/2016 and 7,207 on 8/2020. However, for other wikis the drop is not as steep. For example, swwiki saw a drop in July but that number is higher than many of their previous drops.

We can expect the number of New Registered Users to continue it's decrease in August. Investigation to gather the relative YOY drop in New Registered Users by market may be worthwhile.

@Tgr Do you know of recent bot detection deployments that should be considered here? Or insights about changes to bot activity that I should review? Another potential component is whether bot-created new registered user accounts are part of the picture. If there have been changes in bot activity/behavior or in bot detection, I will include that in this investigation.

Irene and I reviewed the data, and one thing that stood out is that, within wikis, trends in registration over the past few years don't seem to match with trends in new editors.

Looking at the totals for all wikis:

Screen Shot 2021-09-29 at 4.37.58 PM.png (1×1 px, 98 KB)
Screen Shot 2021-09-29 at 4.35.25 PM.png (972×1 px, 124 KB)

This is why we think the change may be suppression of bots from registering, perhaps going back to January 2020.

New editors who have some amount of edits (say 10+ in total) and aren't blocked would be a good noise-free signal for human activity. I'd expect registrations to be noisy - some of them are spambots, we aren't sure how many, the numbers might fluctuate randomly based on e.g. whether bot operators consider spamming Wikipedia lucrative at the moment. (Here's a query for number of spambots blocked - not the same as number of spambot accounts created, just to give you an idea how much these numbers change.)

In general we don't really have any kind of bot detection during registration (other than the captcha + throttling, and those don't change much). There have been changes to password requirements in early 2019 (T208441: 👩‍👦‍👦 AHT password strengthing work, 2018/19), maybe those could break particularly stupid bots. I can't think of anything happening early 2020.

New editors who have some amount of edits (say 10+ in total) and aren't blocked would be a good noise-free signal for human activity.

I think it could be interesting taking one last look using that criteria, but even without it I believe @Iflorez has done all she can here to investigate the proposed hypotheses and this should be marked as resolved. Sometimes the answer is "we don't know" and that's okay because there are so many different factors and systems (many of which are outside of our control or ability to observe/quantify).

We should continue monitoring, but I would also recommend reconsidering how much importance we give this metric as-is. I also want to point out this pattern isn't especially new in the grander view:

Screen Shot 2021-10-01 at 5.02.47 PM.png (948×1 px, 91 KB)

Thanks @Tgr and @mpopov.

Summary and recommendations:
Irene did not find evidence that there are bugs that need to be corrected or changes to user flows that negatively impacted registrations. Moreover, changes in registrations do not consistently correlate with changes in new editors, and do not seem to be a reliable indicator of productive engagement (productive meaning that the registrants go on to edit).

My hypothesis for the "steep decline" in registrations: beginning around early 2020, there was a decline in non-productive (bot?) registrations that was obscured by the large increases in productive registrations around the pandemic. As those large pandemic-related increases wane, the reduction in registrations looks more alarming (whereas, with new editors, we're still seeing increases compared to pre-pandemic numbers).

Based on the findings, I'm resolving this task. For high-level tracking, we'll continue to focus on new editors; registrations are probably more helpful when considered in the context of specific projects or communities where local effects can be taken into account.