Page MenuHomePhabricator

[REQUEST] How many users will receive The Wikipedia Library eligibility notification?
Closed, ResolvedPublic

Description

Name for main point of contact and contact preference
@Samwalton9, contact in this task is fine!

What teams or departments is this for?
The-Wikipedia-Library

What are the details of your request? Include relevant timelines or deadlines
In Q4 we're planning to roll out a notification for editors who become eligible to access content in The Wikipedia Library (T132084). This will retroactively notify everyone who already meets the criteria. We only have a ballpark estimate of 20,000-30,000 as the number of editors who will receive that retroactive notification, and would like to have a more accurate number.

The criteria for access is as follows:

  • At least 500 global edits to Wikimedia projects
  • An account that is at least 6 months old

The notification will be given to eligible users retroactively after their next edit, so we're not concerned about inactive users.

  1. How many users will receive this notification in the first week if we simply turned it on?
  2. With thresholds of 10,000, 4,000, 1,000, and 500 edits, reducing the threshold on a weekly basis (gradual rollout - T271921), how many users would receive the notification each week?
  3. Bonus question: After the initial deployment, how many new users will meet the threshold and therefore receive the notification each week?

How will you use this data or analysis?
This will inform our release thresholds and give us a realistic sense for how many users we expect to visit the tool. It will also help us understand what analytics software to use (T265001) based on our expected user base.

Is this request urgent or time sensitive?
We plan to deploy the notification in mid-Q4. Any time between now and then would be great.

Event Timeline

LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Current Quarter on the Product-Analytics board.

@Samwalton9 we'll plan on taking this on mid-to-late Q3.

Moving to triage so we can prioritize this work in our team meeting next week.

Hey @Samwalton9, we have to push this back because we won't have bandwidth this quarter. Let me know if this becomes more urgent, or, alternatively, if your ballpark number is sufficient and we can close this.

Hi @kzimmerman - no problem. I don't expect that we'll be deploying the notification until near the end of Q4, which is the only hard deadline on this. I do really want some deeper insight than our back-of-the-envelope calculation though. Does it seem feasible for there to be bandwidth early Q4?

Alternatively, a smaller ask could be some guidance on the steps for calculating this number more accurately. I'm happy to roll up my sleeves and write some scripts and queries if someone could spend a little time pointing me to the right data sources and/or suggest potential approaches.

Hey @Samwalton9 - we definitely could help point you in the right direction during office hours! I'm sorry I didn't bring that up as an option earlier. Here's more info about office hours: https://www.mediawiki.org/wiki/Product_Analytics/Office_Hours

You can book any time slots available with people from my team (you might want to book 2 30-minute slots back-to-back). Here's a direct link to our calendar to book appointments (also listed on our page, above).

Thanks - I've booked a time with Neil tomorrow.

Spoke to @nshahquinn-wmf at his office hours today :) Notes:

  • I should be able to get this data from the internal mediawiki_history table. Generated monthly a few days after the start of the month.
  • Can run queries over SSH or at https://hue-next.wikimedia.org/
    • Hive > wmf database > mediawiki_history table
    • event_entity = revision
    • group by event_user_text
    • event_user_is_bot_by to filter bots, if size > 0, it's a bot. This is an array because there are different ways a user can be a bot.
    • Minimum of event_user_first_edit_timestamp and event_user_creation_timestamp to find registration date. Might not be precisely the same as the date in CentralAuth, but mostly due to old account edge cases.
    • revision_is_deleted_by_page_deletion for removing deleted edits (this is just a bool), if that's how CentralAuth counts (I should double check this)
    • Hive has date functions for > 6 months ago, date might be a string and need converting in some way.
    • Need to use snapshot field (YYYY-MM)
  • Can answer bonus question by using event_dt and a date one week ago. Compare the number from above and this to see how many people became eligible in the last week.

I'm going to give this a go this week and will report back!

kzimmerman moved this task from Upcoming Quarter to Tracking on the Product-Analytics board.
kzimmerman subscribed.

Awesome, thanks so much @Samwalton9 and @nshahquinn-wmf! Reassigning to Sam :)

We weren't sure if CentralAuth counts deleted edits or not. From some testing it doesn't seem clear - the numbers don't match total edits with or without deleted contribs, so I think it's being calculated differently. In my query, looking at one test user with ~9000 edits, keeping deleted contribs gave a number that was close - but not exactly - the figure CentralAuth has. Removing deleted contribs gave a substantially different number. I'll move forward including deleted edits, since this is going to count more users rather than less, which is I think preferable.

Here's where I've got to, where I'm running into a problem with duplicated users:

SELECT event_user_text, least(event_user_creation_timestamp, event_user_first_edit_timestamp), COUNT(*)
FROM mediawiki_history
WHERE event_entity = 'revision'
AND snapshot = '2021-02'
AND size(event_user_is_bot_by) = 0 -- Don't count bot edits
AND event_user_text in ( -- Find all users who made an edit in the last week of February, only consider these.
  SELECT DISTINCT event_user_text
  FROM mediawiki_history
  WHERE snapshot = '2021-02'
  AND event_entity = 'revision'
  AND event_timestamp between '2021-02-08' and '2021-02-15'
)
GROUP BY event_user_text, least(event_user_creation_timestamp, event_user_first_edit_timestamp)
HAVING COUNT(*) > 500
AND least(event_user_creation_timestamp, event_user_first_edit_timestamp) < '2020-08-08' -- Account creation or earliest edit is > 6 months ago

In my results I'm seeing some users showing up as many as 5 or 6 times, each with a different value for least(event_user_creation_timestamp, event_user_first_edit_timestamp). What would cause this value to change for a particular user?

As an example, user !Silent has, according to CentralAuth, a registration timestamp of 21:40, 23 January 2011. In the data, they show up with four different least(event_user_creation_timestamp, event_user_first_edit_timestamp) - 2008-03-08 18:02:54.0, 2011-01-29 20:07:11.0, 2011-02-14 19:57:44.0, and 2013-03-09 02:24:23.0.

We weren't sure if CentralAuth counts deleted edits or not. From some testing it doesn't seem clear - the numbers don't match total edits with or without deleted contribs, so I think it's being calculated differently. In my query, looking at one test user with ~9000 edits, keeping deleted contribs gave a number that was close - but not exactly - the figure CentralAuth has. Removing deleted contribs gave a substantially different number. I'll move forward including deleted edits, since this is going to count more users rather than less, which is I think preferable.

I looked at the CentralAuth code, and the per-wiki edit counts it shows are pulled from user_editcount in each wiki's database. According to the documentation for that field, it generally includes deleted edits but is only roughly accurate.

So, the numbers you're getting from mediawiki_history should be more accurate, but I'm pretty sure Echo will be using the less-accurate numbers to decide who to notify 😂

Theoretically, one could do what MediaWiki does and add up all the values of user_editcount, but it sounds like that's unnecessary since you're finding the numbers are close enough.

In my results I'm seeing some users showing up as many as 5 or 6 times, each with a different value for least(event_user_creation_timestamp, event_user_first_edit_timestamp). What would cause this value to change for a particular user?

As an example, user !Silent has, according to CentralAuth, a registration timestamp of 21:40, 23 January 2011. In the data, they show up with four different least(event_user_creation_timestamp, event_user_first_edit_timestamp) - 2008-03-08 18:02:54.0, 2011-01-29 20:07:11.0, 2011-02-14 19:57:44.0, and 2013-03-09 02:24:23.0.

In mediawiki_history, the user creation time is when the user's account at that wiki (their local account) was created. Your first local account is created when you sign up; other local accounts are created when you first view a page on that wiki when logged in. So it's expected that they are all different.

The CentralAuth registration timestamp is the time when the central account was created. For new users, this should be the same as when they created their first local account, but for old users like !Silent, this will be long afterwards, when SUL unification took place.

So, in your query, I would group only by event_user_text. Then select min(least(event_user_creation_timestamp, event_user_first_edit_timestamp)), which will give the lowest value the user had at any wiki, which would be when they registered their first local account.

Hope that helps! Please ask more questions if they come up 😊

Thanks so much - this got me unstuck and I'm now getting what I think is the right data! Saving the query here for posterity:

SELECT COUNT(*) FROM (
  SELECT event_user_text, min(least(event_user_creation_timestamp, event_user_first_edit_timestamp)), COUNT(*)
  FROM mediawiki_history
  WHERE event_entity = 'revision'
  AND snapshot = '2021-02'
  AND size(event_user_is_bot_by) = 0 -- Don't count bot edits
  AND event_user_text in ( -- Find all users who made an edit in the last week of February, only consider these.
    SELECT DISTINCT event_user_text
    FROM mediawiki_history
    WHERE snapshot = '2021-02'
    AND event_entity = 'revision'
    AND event_timestamp between '2021-02-22' and '2021-03-01'
  )
  GROUP BY event_user_text
  HAVING COUNT(*) > 500
  AND min(least(event_user_creation_timestamp, event_user_first_edit_timestamp)) < '2020-08-22' -- Account creation or earliest edit is > 6 months ago
) as x;

I ran this for three different weeks in February and spot checked the resulting list of users. I got an average figure for the number of editors who would have received this notification within the space of a week as 39,275. Over the course of a month the figure is ~57,000.

I played around with edit count thresholds for a particular February week and came up with the following potential distribution plan:

Edit countNumber of editorsSegment size% of total
>50,0005,2495,24913%
>10,00015,1119,86225%
>2,00027,37012,25931%
>50039,14111,77130%

Yes, that looks right to me! Great work figuring this out yourself; if you ever get tired of product management, you can certainly switch to data science 😊