Page MenuHomePhabricator

[betalabs-Regression] Incorrect count when only cross-wiki notifications are present
Closed, ResolvedPublic

Description

The issue is present in 1.28.0-wmf.10 and betalabs 1.28.0-alpha (ab05a0e).
Several cases of incorrect counter behavior. Since it's a regression bug, all cases, most likely, have the same reason.

  1. Login as a user who has unread notificaitons.
  2. As the same user, log in to another wiki - the wiki where a user does not have any local notifications, i.e. never was logged before.

Note:

  • all wikis have "Cross-wiki notifications" enabled (by default)
  • Special:Notifications page's counters and the page itself display correct info
  • the issue is relatively mild - often goes away upon re-login, or refreshing manually the page.
  1. There are several outcomes

(A) The badge counter shows 0, but the expanded panel shows a different number.

Screen Shot 2016-07-13 at 4.16.37 PM.png (251×693 px, 39 KB)

Below is the case when the badge counter and the panel counter agree, but according to Special:Notifications page, there are cross-wiki messages.
Screen Shot 2016-07-13 at 4.01.37 PM.png (423×1 px, 73 KB)

(B) The badge counters show correct numbers, but the expanded panel shows "You have no notifications" message.

Screen Shot 2016-07-13 at 4.53.13 PM.png (582×997 px, 73 KB)

Screen Shot 2016-07-13 at 3.51.33 PM.png (232×539 px, 25 KB)

(C) Quite often 'Failed to fetch notifications" message is displayed.

Screen Shot 2016-07-13 at 4.03.32 PM.png (298×691 px, 30 KB)

Event Timeline

As a consequence of 298417 and 298569, getting foreign notifications count bails early and returns 0, which is then cached and the only way to bust it is to generate new notifications activity (receive a notification of mark as read/unread).

It seems to be happening in a race condition. I was able to reproduce it 3/8 of the times locally but it happens almost all the times in production.

The following abbreviated stack trace gives an idea of what is going on:

  1. EchoHooks::onPersonalUrls ( line 840 or 841 )
  2. NotifUser::getMessageCount
  3. NotifUser::getNotificationCount ( caches the result )
  4. NotifUser::getForeignCount ( returns 0 )
  5. NotifUser::getForeignData
  6. ForeignWikiRequest::execute ( returns empty array )
  7. ForeignWikiRequest::canUseCentralAuthl ( returns false )

When the transition flags are OFF, this data will be read from echo_unread_wikis so there won't be any http request during auto-creation but we may need to invalidate the cache to get rid of those false zeros.

Now that the transition flags are off, I don't see the issue in production anymore.

Re-checked in 1.28.0-alpha (7e65727) -after the transition scripts run. There is some odd persistence of the described errors for two categories of users

  • users that haven't been logged in to any of wikis for a long time
  • users that are logging to a wiki that they never visited before.

For the above categories of users, all error scenarios are still present.

For example, (C) scenario, the request (user ET1 logs first time to arwiki)

GET /w/api.php?action=query&format=json&meta=notifications&notsections=alert&notformat=model&notlimit=25
&notprop=list%7Ccount&uselang=ar&notcrosswikisummary=1&notbundle=true&notunreadfirst=1&_=1469040223759 HTTP/1.1
Host: ar.wikipedia.beta.wmflabs.org
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
Referer: http://ar.wikipedia.beta.wmflabs.org/wiki/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,ru;q=0.6
Cookie: arwikimwuser-sessionId=0f1789d9843b3f61; GeoIP=US%3ACA%3ACampbell%3A37.280300%3A-121.956703%3Av4; arwikiUserName=ET1

there will be

{servedby: "deployment-mediawiki01", error: {code: "notlogin-required", info: "Login is required",…}}
error
:
{code: "notlogin-required", info: "Login is required",…}
*
:
"See http://ar.wikipedia.beta.wmflabs.org/w/api.php for API usage"
code
:
"notlogin-required"
info
:
"Login is required"
servedby
:
"deployment-mediawiki01"

And the flyout will display

Screen Shot 2016-07-20 at 11.55.19 AM.png (441×806 px, 107 KB)

Change 301382 had a related patch set uploaded (by Sbisson):
[WIP] Moderate notifications

https://gerrit.wikimedia.org/r/301382

This has been marked for inclusion in Tech News, so asking to be on the safe side: Is this just beta cluster, or still affecting content wikis too?

This has been marked for inclusion in Tech News, so asking to be on the safe side: Is this just beta cluster, or still affecting content wikis too?

This probably affected some users on production wikis too

Could not find any instances of the issue in production - 1.28.0-wmf.14 (ba1c692)

The issue is still quite easily reproducible in betalabs - see the attached screen recording.
It seems logical to suspect that it's due to some cetnralauth problems there based on the error that was present. T142543: [beta cluster] Persistently kicked out from a user session was filed.

error
:
{code: "notlogin-required", info: "Login is required",…}

In the screen recording, the flyout counter displays some numbers, but Special:Notifications page is empty.


Two consecutive API calls (for the case in the screenrecording); the first one will state correctly the number of cross-wiki notifications and then, the second one, will report count as 0.

{"batchcomplete":"","query":{"notifications":{"list":[{"wiki":"cawiki","id":-1,"type":"foreign","category":"other","timestamp":{"utciso8601":"2016-08-05T21:38:39Z","utcunix":"1470433119","unix":"1470433119","utcmw":"20160805213839","mw":"20160805213839","date":"5 agost"},"agent":{"id":137,"name":"ET19"},"targetpages":[],"*":{"header":"M\u00e9s notificacions de 2 altres wikis","compactHeader":"M\u00e9s notificacions de 2 altres wikis","body":"English Wikipedia i Meta-Wiki","icon":"global","links":{"primary":[],"secondary":[]},"iconUrl":"/w/extensions/Echo/modules/icons/global.svg"},"section":"all","count":32,"sources":{"enwiki":{"title":"English Wikipedia","url":"https://en.wikipedia.beta.wmflabs.org/w/api.php","base":"https://en.wikipedia.beta.wmflabs.org/wiki/$1","ts":"2016-08-05T21:38:39Z"},"metawiki":{"title":"Meta-Wiki","url":"https://meta.wikimedia.beta.wmflabs.org/w/api.php","base":"https://meta.wikimedia.beta.wmflabs.org/wiki/$1","ts":"2015-09-15T21:38:53Z"}}}],"continue":null,"rawcount":32,"count":"32","seenTime":{"message":"2016-08-12T20:59:37Z","alert":"2016-08-12T20:57:44Z"}}}}


{"batchcomplete":"","query":{"notifications":{"alert":{"rawcount":0,"count":"0"},"message":{"rawcount":0,"count":"0"},"rawcount":0,"count":"0"}}}

Re-checked all scenarios in betalabs - all seem to be fixed. Occasionally, there is 'Failed to fetch notifications' due to spontaneous loss of session data which is specific to betalabs environment (at least, it seems to be specific to betalabs)

Change 301382 merged by jenkins-bot:
Moderate notifications

https://gerrit.wikimedia.org/r/301382