Page MenuHomePhabricator

Sharp dip in MediaWiki.site.users metric since wmf11 roll-out
Closed, DeclinedPublic

Assigned To
None
Authored By
ori
Jan 28 2016, 9:05 AM
Referenced Files
F3522580: download (2).png
Mar 4 2016, 9:46 PM
F3522576: download (1).png
Mar 4 2016, 9:46 PM
F3517110: download.png
Mar 4 2016, 1:52 AM
F3288327: Capture d’écran 2016-01-28 à 11.32.58.png
Jan 28 2016, 10:52 AM
F3288334: Capture d’écran 2016-01-28 à 11.36.47.png
Jan 28 2016, 10:52 AM
F3288330: Capture d’écran 2016-01-28 à 11.34.11.png
Jan 28 2016, 10:52 AM
F3288340: wm-users.png
Jan 28 2016, 10:43 AM
F3288307: render-5.png
Jan 28 2016, 10:27 AM

Description

The roll-out of wmf11 coincides with a sharp decline in new user registrations as recorded by the MediaWiki.site.users graphite metric. This is true of the second roll-out as well, so whatever is depressing the account creation rate is still not fixed.

accts.png (308×586 px, 56 KB)

http://graphite.wikimedia.org/S/s

Related SAL entries for Jan 27th around that time (time is UTC):

21:26 logmsgbot: ori@mira Synchronized docroot and w: (no message) (duration: 02m 26s)
19:48 YuviPanda: started nfs-exports daemon on labstore1001, had been dead for a few days
19:32 mutante: stat1002 - redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out.
19:31 mutante: stat1002 - running puppet, was reported as last run about 4 hours ago but not deactivated
19:14 logmsgbot: dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
19:07 ejegg: set donation queue consumer time limit back to 90 sec
18:49 logmsgbot: jynus@mira Synchronized wmf-config/db-eqiad.php: Repool pc1006 after cloning (duration: 02m 25s)
18:48 bd808: HHVM on mw1019 still dying on a regular basis with "Lost parent, LightProcess exiting"
18:00 csteipp: deploy patch for T103239
17:50 csteipp: deploy patch for T97157

Related Objects

Event Timeline

ori raised the priority of this task from to Unbreak Now!.
ori updated the task description. (Show Details)
ori added subscribers: ori, Anomie, bd808, Tgr.

I don't see an obvious difference in successful registrations (which of course might simply mean that the logging code is wrong).

web

auth-graphite-1.png (308×586 px, 64 KB)

api

auth-graphite-2.png (308×586 px, 47 KB)

Where does MediaWiki.site.users come from, exactly?

I don't see an obvious difference in successful registrations (which of course might simply mean that the logging code is wrong).

Yes, that seems likely, given the graph in your comment.

Where does MediaWiki.site.users come from, exactly?

It is incremented in SiteStatsUpdate::doUpdateContextStats in MediaWiki.

Another graph of MediaWiki.site.rate.users centered on Jan 27th shows the drop occurred between 19:10 and 19:20 :

render-5.png (600×1 px, 76 KB)

@Tgr wrote:

Where does MediaWiki.site.users come from, exactly?

I asked the same to ori, quoting him from IRC:

the rate at which new accounts are being created, I believe
it is incremented in SiteStatsUpdate::doUpdateContextStats() whenever an account is created
(and on some other conditions)
by which i mean that method gets called for other reasons too. but the user count is only incremented when an account is created.

hashar set Security to None.

Where does MediaWiki.site.users come from, exactly?

Ah, found it, SiteStatsUpdate.

Seems like the stats update is broken, not registrations:

tgr@mw1152:~$ sql centralauth -e 'select substr(gu_registration, 1, 11) hour, count(*) users from globaluser where gu_registration > '20160120000000' group by hour order by hour desc' > reg_hourly.tsv

https://docs.google.com/spreadsheets/d/1Kja3tnTZucXurCpYrT-z_WO2aP_J0DHqxnFc4PIU0HI/edit?usp=sharing

wm-users.png (474×754 px, 43 KB)

From the https://grafana.wikimedia.org/dashboard/db/authentication-metrics authentication dashboard. The error rate corresponds to login need_token. The first spike being 19:23 - 19:30 , the second 19:47 - 19:51.

Login:

Capture d’écran 2016-01-28 à 11.32.58.png (593×866 px, 124 KB)

Error rate (relative)

Capture d’écran 2016-01-28 à 11.36.47.png (466×867 px, 160 KB)

The authentication metric dashboard is based on MediaWiki.authmanager.login.<entry point>.failure. The need_token spikes are related to the api entry point.


Account creation doesn't show a drop. Seems to indicate the new session manager does not invoke SiteStatsUpdate::doUpdateContextStats() the same way (i.e. it would skip calling it for some reason. That would explain the MediaWiki.site.users.rate drop.

Graph based on MediaWiki.authmanager.accountcreation.<entry point>.{success,failure}:

Capture d’écran 2016-01-28 à 11.34.11.png (464×867 px, 104 KB)

hashar lowered the priority of this task from Unbreak Now! to High.Jan 28 2016, 10:57 AM

No more unbreak now. @Tgr has shown the # of accounts created per hour is stable.

bd808 renamed this task from Sharp dip in new accounts since wmf11 roll-out to Sharp dip in MediaWiki.site.users metric since wmf11 roll-out.Jan 28 2016, 4:01 PM
bd808 updated the task description. (Show Details)

I updated the description and title to be a bit less inflammatory based on database analysis in T125054#1973610.

This is probably caused by T125133 (and possibly a similar effect on other sites) - the previous graph counts central accounts while Ori's counts local accounts over all wikis.

I suspect this is caused by T70012. The two main sources of autocreation are CentralAuthCreateLocalAccountJob running on loginwiki, mediawiki and meta, and edge login that preemptive logs you in (and thus autocreates) on all second level domains when you use the login page. Loginwiki autocreation was fixed in T125133 which brought the pre-and post-SessionManager graphs closer but not quite overlapping; the remainder is probably due to the edge login requests getting into an autocreation race on loginwiki and thus getting aborted and not autocreating on their eventual target wiki. Given that loginwiki has an accont creation job that should prevent such problems, I wonder if something is wrong with that. (Or this happens to people who register and then immediately log in to another wiki, but I wouldn't expect that to happen often enough to visibly affect stats.)

I don't see any current artifacts as badly misaligned as the 2016-01-22 through 2016-01-23 delta highlighted in the original report. Do we actually have any historical confidence in the periodicity of this statistic?

Six weeks later: Anybody knows how to proceed here? Is this still an issue to investigate?
I'd like to avoid having this task just rotting... Thanks!

Six weeks later: Anybody knows how to proceed here? Is this still an issue to investigate?

Try to reproduce T18864; probably that behavior has been affected somehow and there are less account autocreations on never-visited wikis.

Not sure if it's worth the effort to look into, though. That behavior was always unwanted, and any real bug (e.g. account not autocreated when the user visits the wiki) would probably have been reported by now.

T137261 might be related (not terribly likely though).

Anyone still investigating here, half a year later? Or is this task realistically "declined"?

No replies for ages so I'm going to decline this to reflect reality.
Feel free to reopen if you still plan to investigate.