Update fixUserRegistration.php to use newuserlog (where available, prior to r12207), and gaussian estimates for the fossils


Author: herd

User's created before r12207 who have no edits until after r12207 are not assigned guessed data for user.user_registration. This is not critical but often is very confusing and sometimes wildly inaccurate. There are several months worth of data on Wikimedia wikis in the new user log from the extension (see r10573 ) that could populate this data.

Also, for users prior to even the extension, a gaussian curve could be plotted from the data of available edits and log entries (all of which would be after the creation date) and normalized to a curve or wave of user creation date/ID.

Awaiting WONTFIX!

Version: 1.15.x
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=Special:Log&dir=prev&type=newusers

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz18638.
bzimport created this task.Via LegacyMay 1 2009, 2:49 AM
bzimport added a comment.Via ConduitMay 1 2009, 10:51 AM

happy.melon.wiki wrote:

Go on, then, let's see this gaussian curve of yours :D Might as well work for your wontfix!!

The other suggestion, however, is good; that extension provided accurate log data; A quick check on the toolserver suggests that there are at least 290,000 entries in the relevant period; a substantial fraction of these could be recovered in this fashion. It should probably be a separate script, though; there's no guarrantee that wikis needing to populate the column would have had the extension installed, and no point in the script trying to use that data if it's not present.

bzimport added a comment.Via ConduitMay 2 2009, 5:21 AM

herd wrote:

Go on, then, let's see this gaussian curve of yours :D

Too slow of a query to do it for everyone without actually, yknow, DOING it, as in populating the data. But here is 5000 from en.wp. Note there isn't much curve to it, and it skips all users with double/nulls, but there is definitely a trend line:

bzimport added a comment.Via ConduitMay 2 2009, 8:30 AM

herd wrote:

Sampling of normalizable user first-contribution curve

Here is a more distributed sampling, of all users from 1k-750k (1:1000).

Copied from http://test.wikipedia.org/wiki/File:Example_of_user_first_actions_for_en.wp_1-750000_(by_thousand).gif


bzimport added a comment.Via ConduitMay 2 2009, 10:16 AM

happy.melon.wiki wrote:

Wow, that's a much better fit than I was expecting, TBH. And the outliers tell their own story; particularly interesting the ones on the second graph that were registered in 2001-03, but not used until around 2008... More ammunition (as if it were needed) against deleting old accounts.

Still not entirely sure how you'd convert that data into registration timestamps, or are you going to assume that the curve approximately follows the registration time; that is, the average delay between registering and editing is zero? Seems a justifiable assumption, but I notice the curve gets a bit wobbly at the top; lots of double NULLs in the data...

Chad added a comment.Via ConduitMay 6 2011, 5:53 PM
  • Bug 22097 has been marked as a duplicate of this bug. ***

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.