Page MenuHomePhabricator

MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}
api_response
data.error.*: See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.
data.error.code: badtoken
data.error.info: The centralauthtoken is not valid.
Impact

Unknown, unable to reproduce.

Notes

Happens when ApiEchoUnreadNotificationPages is invoked in the secondary DC, stores a centralauthtoken, and makes a cross-DC request to the Echo API on another wiki, signed with that token; reading the token from the store in the other DC fails.
(Ideally the API request should be made within the current DC, but the routing logic in MWHttpRequest is naive and always sends MediaWiki requests to the primary: T347781: MWHttpRequest should not route read requests to the primary DC)
Mostly happens for requests initiated from MW-on-k8s although there are a few such errors happening on non-k8s servers as well (~99% of all errors is on k8s).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

$wgCentralAuthTokenCacheType is mcrouter-primary-dc. Maybe something wrong with the k8s mcrouter setup?

FTR, it is now possible to get k8s shell.php session in production now:

[urbanecm@deploy1002 ~]$ mw-debug-repl cswiki
Becoming root...
Finding a mw-debug pod in eqiad...
Now running shell.php for cswiki inside pod/mw-debug.eqiad.pinkunicorn-76755b7fbc-djchl...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
>

might be useful when looking into this.

logstash - 531 instances in last 30 days, 499 on k8s, 32 on appservers. Many different servers, multiple wikis; second most common wiki is mediawiki.org, which is unusual given the low traffic there. Started during 1.41.0-wmf.13 but isn't an obvious train-related error - it's very low volume for a week or so, then becomes more frequent. There is a huge spike (132 events) on July 14 at 13:15; no correlation in SAL.

Now running shell.php for enwiki inside pod/mw-debug.eqiad.pinkunicorn-6bb999f46c-28lpr...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $wgCentralAuthTokenCacheType
= "mcrouter-primary-dc"

> $srv = \MediaWiki\MediaWikiServices::getInstance()
> $casm = $srv->get('CentralAuth.CentralAuthSessionManager')
> $cats = $casm->getTokenStore()
> dump -a $cats
MemcachedPeclBagOStuff {#3483
  // ...snip...
  #client: Memcached {#6022
    servers: [
      [
        "host" => "127.0.0.1",
        "port" => 11213,
        "type" => "TCP",
      ],
    ],
  },
  #routingPrefix: "/eqiad/mw/",
  #keyspace: "enwiki",
}

Same result on a non-k8s host.

Echo is the only source of The centralauthtoken is not valid. log entries. Not sure if anything else uses a centralauthtoken-based backend API call.
CentralAuthTokenSessionProvider returns the same error code for absolutely everything and logs at debug level. We could (probably should) crank up the log level to see how often various errors are happening.

Change 941036 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/CentralAuth@master] Raise logging of token-based session failures to info

https://gerrit.wikimedia.org/r/941036

second most common wiki is mediawiki.org, which is unusual given the low traffic there.

I think that happens because mediawiki.org is one of the wikis (and apart from testing/closed wikis, the only CA-connected one) that's served fully by MW-on-K8s, cf. docs and config. All other wikis (with few exceptions) are on 1% K8S traffic. Assuming this is indeed somehow k8s related, mediawiki.org being fully k8s-served disproportionally affects the number of log entries it has.

CentralAuthTokenSessionProvider returns the same error code for absolutely everything and logs at debug level. We could (probably should) crank up the log level to see how often various errors are happening.

Agreed; +2'ed the patch and moving to In Progress, as this seems to need additional investigation based on the new log entries.

Change 941036 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Raise logging of token-based session failures to info

https://gerrit.wikimedia.org/r/941036

We've encountered spurious memcached timeouts (likely due to packet loss) with MW on k8s since forever. Are there any memcached errors logged as part of the same request?

Me and @Sgs did some sync investigation of the issue. The logged error is "CentralAuthTokenSessionProvider::provideSessionInfo: centralauthtoken is invalid". I tried reproducing in shell.php and mw-debug-repl, but I wasn't able to determine the reason for the token being invalid. I was able to make cross-wiki API calls using the centralauthtoken, even including the token-providing wiki being on k8s (mediawikiwiki) and the token-consuming wiki not (testwiki / cswiki). I'm increasingly convinced the issue is actually at the code, rather than infrastructure, level, but only shows itself within k8s for some reason.

centralauthtoken is invalid is a token store lookup miss (or a malformed object in the store, but that seems unlikely). The store is mcrouter-primary-dc (and has been for years). The lookup is done via WaitConditionLoop so it shouldn't be a race condition (unless the delay is over 3 sec, which again seems unlikely).

We've encountered spurious memcached timeouts (likely due to packet loss) with MW on k8s since forever. Are there any memcached errors logged as part of the same request?

That would make sense. Does mcrouter-primary-dc on k8s route to memcached servers which are themselves on k8s, though?

Echo Unexpected API response errors in last 90 days (2000 in total):

Screenshot Capture - 2023-09-03 - 21-23-35.png (478×1 px, 39 KB)

CentralAuth centralauthtoken is invalid errors in last 90 days (1,200,000 in total; the start of the errors matches the log level change):

Screenshot Capture - 2023-09-03 - 21-23-15.png (482×1 px, 48 KB)

So, no obvious correlation.
There are also ~250 other kinds of CentralAuthTokenSessionProvider::provideSessionInfo errors in total, not enough to be relevant.

The error distribution by server group:

api_appserver1,227,062
kube-mw-api-ext19,238
appserver787

I'm not convinced the centralauthtoken is invalid errors are related to the k8s-only Echo errors. We should probably figure out why are there ~25 centralauthtoken lookup failures per minute, though.

All centralauthtoken is invalid errors happen on meta.wikimedia.org (1,118,528 errors in total), commons.wikimedia.org (34,922), www.wikidata.org (23,799), it.wikipedia.org (6,648) or hi.wikipedia.org (5,700) Again, don't know what to make of that. There are extension features that use centralauthtoken for Commons (ForeignUpload) and probably Wikidata (the interwiki editor thingy?), there might be some gadget that does the same for meta.... but what's special about itwiki and hiwiki?

[…] The store is mcrouter-primary-dc (and has been for years). The lookup is done via WaitConditionLoop so it shouldn't be a race condition (unless the delay is over 3 sec, which again seems unlikely).

For most of our history, tokens like this were stored in the local-dc Redis ("redis_sessions", aka MainStash).

This changed last year. The short-lived tokens were the last user of the legacy not-Multi-DC Redis, and moved it to Memcached via the created mcrouter-primary-dc route in Mcrouter. I believe this is the first time we store these tokens in Memcached. Details at T267270, T278392: Storage solution for cross-datacenter tokens and wikitech:Multi-DC MediaWiki.

Having said that, while a major change, I don't think it explains the issue here since it clearly started with, and remains specific to, mw-on-k8s.

We've encountered spurious memcached timeouts (likely due to packet loss) with MW on k8s since forever. Are there any memcached errors logged as part of the same request?

That would make sense. Does mcrouter-primary-dc on k8s route to memcached servers which are themselves on k8s, though?

I think Máté means that reaching memcached (outside k8s) from within mw-on-k8s has shown to be less reliable.

We've know at WMF we've had issues with the inverse at least (reaching a k8s service from baremetal), where packet loss / timeouts seem to be induced by k8s' networking features. E.g. T292663, T341634, and T249700: k8s/mw: traffic to eventgate dropped by iptables.

[…] while a major change, I don't think it explains the issue here since it clearly started with, and remains specific to, mw-on-k8s.

Actually, it's more specific than that. Searching for message:"Unexpected API response from" on Logstash: mediawiki dashboard over the past 7 days:

Screenshot 2023-09-07 at 22.03.12.png (1×1 px, 191 KB)

The issue happens only in mw-on-k8s inside Codfw. Even the tiny proportion of erros from baremetal servers, are limited to Codfw:

Screenshot 2023-09-07 at 22.03.26.png (646×2 px, 106 KB)

I am gonna add one more data point. In all of these errors, the data.servedby stanza refers to an *eqiad* API server. I looked a bit at the distribution of those API servers to see if there is any pattern that would identify one or more specific ones, (thankfully?) that is not the case, apparently all eqiad API servers have the probability to appear in this dataset.

So we are making cross DC api calls here. MW-on-K8s could somehow exhibit a behavior that makes what appears in this task more probable to happen (e.g. by increasing latencies in some edge cases), but it is probably not the only factor at play here. Since the secondary DC (codfw at the time of this writing) isn't theoretically supposed to bet getting RW requests, is it possible to just shift Echo to use the DC-local API cluster?

Great find @akosiaris! This sounds very much like the cause here. I investigated the request flow in Echo and Echo's doing something like this:

[urbanecm@deploy1002 ~]$ mw-debug-repl --datacenter codfw --wiki cswiki
Becoming root...
Finding a mw-debug pod in codfw...
Now running shell.php for cswiki inside pod/mw-debug.codfw.pinkunicorn-7b8b5bf4c7-2j6x4...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> use MediaWiki\MediaWikiServices
> $http = MediaWikiServices::getInstance()->getHttpRequestFactory()->createMultiClient()
= MultiHttpClient {#3502}

> $resp = $http->runMulti(['dewiki' => ['method' => 'GET', 'url' => 'https://de.wikipedia.org/w/api.php', 'query' => ['action' => 'query', 'format' => 'json', 'meta' => 'siteinfo', 'centralauthtoken' => 'invalid']]]) 
= [
    "dewiki" => [
      "method" => "GET",
      "url" => "http://localhost:6501/w/api.php",
      "query" => [
        "action" => "query",
        "format" => "json",
        "meta" => "siteinfo",
        "centralauthtoken" => "invalid",
      ],
      "response" => [
        "code" => 200,
        "reason" => "OK",
        "headers" => [
          "date" => "Fri, 15 Sep 2023 09:54:52 GMT",
          "server" => "envoy",
          "x-powered-by" => "PHP/7.4.33",
          "x-content-type-options" => "nosniff",
          "mediawiki-api-error" => "badtoken",
          "x-frame-options" => "DENY",
          "content-disposition" => "inline; filename=api-result.json",
          "cache-control" => "private, must-revalidate, max-age=0",
          "vary" => "Accept-Encoding",
          "x-request-id" => "36861369-e4cb-4181-acde-23252adcfea6",
          "content-length" => "358",
          "backend-timing" => "D=3052661 t=1694771692407212",
          "content-type" => "application/json; charset=utf-8",
          "x-envoy-upstream-service-time" => "3187",
        ],
        "body" => "{"error":{"code":"badtoken","info":"The centralauthtoken is not valid.","*":"See https://de.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"mw1361"}",
        "error" => "",
        0 => 200,
        1 => "OK",
        2 => [
          "date" => "Fri, 15 Sep 2023 09:54:52 GMT",
          "server" => "envoy",
          "x-powered-by" => "PHP/7.4.33",
          "x-content-type-options" => "nosniff",
          "mediawiki-api-error" => "badtoken",
          "x-frame-options" => "DENY",
          "content-disposition" => "inline; filename=api-result.json",
          "cache-control" => "private, must-revalidate, max-age=0",
          "vary" => "Accept-Encoding",
          "x-request-id" => "36861369-e4cb-4181-acde-23252adcfea6",
          "content-length" => "358",
          "backend-timing" => "D=3052661 t=1694771692407212",
          "content-type" => "application/json; charset=utf-8",
          "x-envoy-upstream-service-time" => "3187",
        ],
        3 => "{"error":{"code":"badtoken","info":"The centralauthtoken is not valid.","*":"See https://de.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"mw1361"}",
        4 => "",
      ],
      "headers" => [
        "host" => "de.wikipedia.org",
        "content-length" => 0,
        "user-agent" => "MediaWiki/1.41.0-wmf.26",
      ],
      "proxy" => false,
      "body" => "",
      "flags" => [],
    ],
  ]

> ^D

   INFO  Ctrl+D.

[urbanecm@deploy1002 ~]$

The request gets routed through localhost:6501, an Envoy proxy for api-rw.discovery.wmnet, which resolves to api.svc.eqiad.wmnet in both DCs. This explains how the request ends up in eqiad.

is it possible to just shift Echo to use the DC-local API cluster?

For the read-only requests, I don't see why not. From a MW perspective, I only need a different proxy to route the request through. I guess we're talking about api-ro.discovery.wmnet. If so, I can't seem to find an Envoy proxy that would connect me there. Am I missing something? Is there a different way to route the request locally?

The writing requests (coming eg. from ApiEchoMarkRead) would definitely need to reach eqiad. However, as @akosiaris rightly says, "the secondary DC isn't theoretically supposed to be getting RW requests". This leads me to a much bigger question...

...why is wgLocalHTTPProxy (configures proxy to use when MW talks with itself) set to localhost:6501 (aka api-rw.discovery.wmnet) to begin with? If a request is a writing one, it should be in the primary DC anyway, so api-ro and api-rw would resolve to the same DC. If it is a reading request, it can be in either DC, but so can the subrequests. Granted, we might have "reading" requests that fire writing subrequests, and then things would break, but I'd argue both requests should really be classified in the same way (either both the user-facing and internal requests are writing, or none of them is).

ApiEchoUnreadNotificationPages is a readonly API, so as @Urbanecm_WMF says, the root problem here is that MediaWiki core doesn't have a concept of making a read vs. a write request to another MediaWiki, and always routes through $wgLocalHTTPProxy when the target domain is in $wgLocalVirtualHosts. We could probably duplicate some routing logic in the MediaWiki HTTP request layer so it sends to a read or write proxy depending on whether the request is a POST / has a Promise-Non-Write-API-Action header.

Would be nice to know what the direct issue was, though. A race condition between writing the CentralAuth token to the memcached cluster of the other DC and making the followup request? Shouldn't memcached be much faster?

[…] A race condition between writing the CentralAuth token to the memcached cluster of the other DC and making the followup request? Shouldn't memcached be much faster?

I don't think such race is possible since the token store is intentionally set to mcrouter-primary-dc. Not only does it not rely on replication to happen fast enough, there is no replication.

Why is Echo making internal HTTP requests in the first place? For the badge, it afaik reads the global database directly.

For the client-side JS interactions, if I recall correctly the problem there is that we don't store notifications in their rendered form, but instead render them on-demand and thus need the original wiki's extension config and local settings to render them, which is why we talk to the orgin wiki directly. But why does that API request need to "proxy" through the local wiki's API first?

Why is Echo making internal HTTP requests in the first place? For the badge, it afaik reads the global database directly.

Good question... Using the global database seems like something that's supposed to work. @Tgr, do you know why it is implemented in this way?

I don't know if that was the original thinking around it, but there are potentially many other wikis to gather notifications from. Since centralauthtoken is single-use, you would have to do them sequentially, and precede each with a request to the local wiki to get the token. It would be extremely inefficient.

Ugh, ForeignWikiRequest actually does parallelize these requests. So probably all but the first fail when the user has unread notifications from more than one foreign wiki.

Maybe we should introduce a per-wiki version of centralauthtoken. Or something server-side-only that works more like an access token in that it remains valid for some period of time, regardless of how much it is used.

Nevermind, I got myself confused about how those tokens work. Getting many centralauthtokens in parallel should be fine as long as each one is only used once.

413 hits for 7 days (150 for 1 day; seems to start with wmf.28)- logstash link:

Screen Shot 2023-09-28 at 1.27.52 PM.png (706×2 px, 117 KB)

That's funny… it's not new, but it disappeared for a couple of days. Zooming out a little on your logstash chart: https://logstash.wikimedia.org/goto/1252abf420fa68343312fe3427c8f4ba

image.png (537×2 px, 128 KB)

That's funny… it's not new, but it disappeared for a couple of days.

As Alex said, it's a cross-DC issue, and for the first week of the DC switchover we were running from one DC only.

So to recap, there are two issues here:

  • When MediaWiki makes a request to itself (or, presumably, to RESTBase or similar services), that's always routed to the primary DC, which in the case of read requests made from the secondary DC is unnecessary, bad for performance, and in the specific case of Echo also triggers some sort of infrastructure(?) bug. Split out into T347781: MWHttpRequest should not route read requests to the primary DC.
  • The Echo-specific bug: centralauthtoken not found in the primary DC memcache when written from the secondary and read from the primary.

Either writing fails in the first place and that's not properly handled (ApiCentralAuthToken doesn't even check for success; even if it did, mcrouter uses AllAsyncRoute cross-DC, so PHP doesn't learn of failures), or reading fails (in theory CentralAuthUtilityService::getKeyValueUponExistence() differentiates memcached errors from the key simply not being there and logs not found due to I/O error which is almost nonexistent in Logstash, but I'm not sure how reliably the memcached client detects errors).

MW-on-K8s could somehow exhibit a behavior that makes what appears in this task more probable to happen (e.g. by increasing latencies in some edge cases), but it is probably not the only factor at play here.

The timeout for the memcached lookup is 3 seconds (CentralAuthUtilityService::getKeyValueUponExistence() continuously retries memcached reads and gives up after 3s). That seems implausibly high to be caused either by k8s or the cross-DC write. So I'd guess @TK-999 is right and this is simply caused by lost/failed memcached requests. Which could be either reads or writes, but for reads I don't see why it would make a difference that the MediaWiki API request which involved the memcache lookup was initiated from the other DC. So I'd guess we have a high failure rate for cross-DC memcached writes via mcrouter AllAsyncRoute when those writes are initiated from k8s. Is there a way to verify that?

So I'd guess we have a high failure rate for cross-DC memcached writes via mcrouter AllAsyncRoute when those writes are initiated from k8s. Is there a way to verify that?

Tagging SRE because I'm not sure what the next step would be here. mcrouter doesn't seem to log errors to logstash and its grafana board does not have error breakdowns.

Krinkle triaged this task as High priority.Oct 9 2023, 2:46 PM

When mcrouter-primary-dc is selected, we set the routing prefix to /$wmgMasterDatacenter/mw/, so writes are consistently directed there. I just found a bug in the configuration of mcrouter on k8s that should solve the issue once I've fixed it.

Change 971943 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: add remote-dc mcrouter pools

https://gerrit.wikimedia.org/r/971943

Change 971943 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add remote-dc mcrouter pools

https://gerrit.wikimedia.org/r/971943

Change 971948 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug fix mcrouter routes there as well

https://gerrit.wikimedia.org/r/971948

Change 971948 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] mw-debug fix mcrouter routes there as well

https://gerrit.wikimedia.org/r/971948

Since my deployment of this change, MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} messages seem to have disappeared. I'll wait this evening and resolve the task.