Page MenuHomePhabricator

Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes)
Open, Needs TriagePublic

Assigned To
Authored By
Tgr
May 15 2019, 8:23 PM
Referenced Files
F47226947: image.png
Apr 17 2024, 9:37 PM
F47226788: image.png
Apr 17 2024, 9:37 PM
F47226404: image.png
Apr 17 2024, 9:37 PM
F47226287: image.png
Apr 17 2024, 9:37 PM
F47225858: image.png
Apr 17 2024, 9:37 PM
F40521583: image2.png
Oct 27 2023, 2:16 PM
F40521581: image1.png
Oct 27 2023, 2:16 PM
F38231499: Screenshot 2023-10-14 at 09.08.37.png
Oct 14 2023, 8:08 AM

Description

empty cross-wiki notification.png (277×563 px, 27 KB)

On Meta I have an unread notification count of 1, but no unread notifications.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I still see this a lot. (Always with OAuth admin notifications, but that might just be because that's the type of notification I usually get from meta.)

An example of the action API request on the non-Meta wiki (mediawiki.org, in this case):

https://www.mediawiki.org/w/api.php?action=query&format=json&formatversion=2&meta=notifications&notsections=alert&notformat=model&notlimit=25&notprop=list%7Ccount%7CseenTime&uselang=en&notwikis=metawiki&notfilter=!read&notbundle=true&_=1689094065889

{
    "batchcomplete": true,
    "query": {
        "notifications": {
            "list": [],
            "rawcount": 0,
            "count": "0"
        }
    }
}

An example of the request on Meta:

https://meta.wikimedia.org/w/api.php?action=query&format=json&meta=notifications&notsections=alert&notgroupbysection=1&notmessageunreadfirst=1&notlimit=25&notprop=count&uselang=en&notcrosswikisummary=1&_=1689094179431

{
    "batchcomplete": "",
    "query": {
        "notifications": {
            "alert": {
                "rawcount": 1,
                "count": "1"
            },
            "rawcount": 1,
            "count": "1"
        }
    }
}

(The unread notification in this case was ID 4722181, type oauth-app-propose.)

If I send a more exact equivalent of the first API request to Meta:

https://meta.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&meta=notifications&notsections=alert&notformat=model&notlimit=25&notprop=list%7Ccount%7CseenTime&uselang=en&notfilter=!read&notbundle=true&_=1689094065889

I get the notification as expected.

I'm also seeing this bug today, but for a Thanks notification at Meta-wiki. However it seems to affect all types, and it's only not working for me at Mediawiki-wiki.
I.e. If I mark a "mention" or "reply" on Frwiki and Enwiki as unread, then the cross-wiki notifications at Mediawiki-wiki still fail to load (screenshot), but I can see them fine at Meta-wiki or Swwiki.
That doesn't really help narrow it down, but does eliminate the possibility of it just being from "Notifications from optional extensions" (OAuth and Thanks).

image.png (273×659 px, 27 KB)

matmarex added subscribers: Reedy, matmarex.

Using my crystal ball, I have determined that this happens when $wgLocalHTTPProxy is set, and the wiki's URL is not *.wikipedia.org. I don't understand why the second part is true, but it is (unless you can prove me wrong).

As of today, $wgLocalHTTPProxy is set in two cases: https://codesearch.wmcloud.org/search/?q=LocalHTTPProxy&files=&excludeFiles=&repos=operations%2Fmediawiki-config

matmarex renamed this task from Broken (empty) cross-wiki notification to Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes).Oct 10 2023, 10:17 PM
matmarex added a project: MW-on-K8s.

I am not sure how this related to T342201, but it seems obviously related.

I am not sure how this related to T342201, but it seems obviously related.

That issue seems to be cause by mcrouter failure and is very random, while this one is deterministic. At most, both are related to T347781: MWHttpRequest should not route read requests to the primary DC, but even that's unlikely given $wgLocalHTTPProxy was introduced in T288848: Make HTTP calls work within mediawiki on kubernetes in 2021 and this bug is way older than that (or cross-DC MediaWiki requests). I guess it could have two different causes...

I'm also seeing this - notifications rarely if ever load when they're from a wiki that's different to the one I'm on.

Screenshot 2023-10-14 at 09.08.37.png (554×1 px, 67 KB)

I wanted to create a new task, but I think that this one is actually the same. If this isn't the same, let me know, and I'll open a new one, I didn't want to create a duplicate.

I just opened mediawiki.org, and there are notifications from Serbian Wikipedia. When I click on alerts and then expand, it doesn't display anything on mediawiki.org, like it's supposed to.
When I go on Serbian Wikipedia, it works normally.

Having the same issue and it's occurring on the meta-wiki side as well.

image1.png (231×519 px, 10 KB)

image2.png (409×522 px, 30 KB)

I can't imagine why calling the primary datacenter would be a problem in this case, unless there is a logical race condition or, worse, we completely rely on data being in memcached and/or any other datastore that's not replicated.

I don't think T347781 is the right way to go, honestly. I'd rather not duplicate the logic used at the edge inside of mediawiki itself if at all possible.

So let me rewind a bit and ask: why calling the primary datacenter would be causing an issue? What data pertaining to echo notifications can't be found in the primary dc?

We need to dig a bit more to ensure what triggers this:

  • Does this happen only when people are normally directed to the secundary datacenter? Or is this an issue even when people hit the current primary?
  • What url gets requested by MediaWiki? What changes in the request in the two cases? Just the destination datacenter or something else? For instance, we're going via the service mesh when using the local proxy; it would be interesting to try to see what happens if we instead go to the edge caches via the webproxy instead.

I suspect the fix I made for T342201 actually might have solved this issue as well. Not sure how to verify it though.

@matmarex do you have a way to verify if the bug still presents itself? It's slightly hard for me as I mostly edit mediawiki.org or wikitech.

matmarex assigned this task to Joe.

Cross-wiki notifications reliably show up for me now when testing on https://www.mediawiki.org/ and https://test.wikidata.org/, and previously they never worked (T223413#9240938), so unless someone else can still reproduce, I think we can consider this fixed. Thank you!

I can't imagine why calling the primary datacenter would be a problem in this case, unless there is a logical race condition or, worse, we completely rely on data being in memcached and/or any other datastore that's not replicated.

This issue doesn't look like a memcached failure to me. T342201 & co are related to the CentralAuth token store, one of the few components which use memcached not as a cache but as a way to exchange information (in that case, a proof-of-authenticity token) between servers. If the read fails, the entire interaction fails.

That's not the case here - the Echo request doesn't fail, it just has incorrect results. Unless there is some very broken error handling somewhere in Echo (possible I guess), I don't see how it could be cause by an object cache lookup failure. And as I said, this error seems completely deterministic and independent of which DC you are being routed to.

I don't think T347781 is the right way to go, honestly. I'd rather not duplicate the logic used at the edge inside of mediawiki itself if at all possible.

T347781 is a bug in its own right and should be fixed, regardless of whether it causes this one (I don't think so - this issue long precedes the existence of $wgLocalHTTPProxy). Duplicating logic is not great. Not applying the logic to operations it should apply to is definitely worse.

That's not the case here - the Echo request doesn't fail, it just has incorrect results. Unless there is some very broken error handling somewhere in Echo (possible I guess), I don't see how it could be cause by an object cache lookup failure. And as I said, this error seems completely deterministic and independent of which DC you are being routed to.

I think the possible reason is that in the case of kubernetes, you'd get the information from two different clusters of memcache between requests, while you'd consistently go to the primary-dc one on bare metal, due to the misconfiguration I described. Basically I'm saying that this problem should only have been present if your request was sent to the secondary DC (which is currently true for anyone in say europe or africa).

It's perfectly possible there are multiple bugs that can cause cross-wiki notifications to fail, but I can completely see failure scenarios in which we get the different results due to caches being written inconsistently between datacenters.

For me, I am in USA, and the bug for MediaWiki.org notifications trying to display enwiki notifications went away about a month ago. Was seeing it consistently before, and haven't seen it since. Hope this info helps.

For me, I am in USA, and the bug for MediaWiki.org notifications trying to display enwiki notifications went away about a month ago. Was seeing it consistently before, and haven't seen it since. Hope this info helps.

yes that would fit with the timeline of the switchover (which happens on the week of the equinox) and would further confirm my suspicions that my asinine error was the main cause.

I still don't get how a fix to the networking of k8s could have fixed a bug that precedes k8s, but I can confirm that the bug is not happening anymore.

(Except when notifications get removed, e.g. because the corresponding page gets deleted, so by the time you click to expand the "More alerts from another wiki" line, there is nothing to show. That still happens sometimes, but is unrelated and more of an UX issue than a problem with Echo's notification loading logic.)

Got this again today. I recently turned back on mediawikiwiki notifications, which I had forgotten I turned off. Perhaps that's why I thought this had gone away.

Steps to reproduce:

  • log out
  • go to mediawikiwiki
  • make a talk page section
  • log in
  • subscribe to the talk page section
  • log out
  • make a post
  • log in
  • go to enwiki
  • click open the notification in a new tab
  • click the notification tray icon again twice, to close and reopen the tray, which refreshes the notification tray
  • click the "more notices from another wiki mediawiki" arrow to flip it from collapsed to expanded

Hard refreshing the page does not fix the bug. Clicking the blue dot next to "more notices from another wiki mediawiki" does not fix the bug.

image.png (815×1 px, 66 KB)

image.png (583×755 px, 20 KB)

image.png (439×711 px, 15 KB)

The notification counts between mediawikiwiki and enwiki have diverged. Mediawikiwiki says 22, enwiki oscillates between 22 and 23.

image.png (96×95 px, 1 KB)

Note how this enwiki screenshot says 22 on the tray icon, but there are actually 23 notifications. The tray icon said 23 a few minutes ago but recently changed.

image.png (615×1 px, 48 KB)