Page MenuHomePhabricator

wikifeeds in codfw seems failing health checks intermittently
Closed, ResolvedPublic

Description

Icinga is reporting failures like:

19:32 +<icinga-wm> PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} 
                   (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for 
                   April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: 
                   Test retrieve featu
19:32 +<icinga-wm> e data for April 29, 2016 returned the unexpected status 504 (expecting: 200): 
                   /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with 
                   aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned 
                   the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds

I spent some time with @hnowlan trying to debug but we didn't find a root cause. Here's what we discovered:

  • wikifeeds pods in codfw are in evicted state (not all of them), tracked in T290444. This has been happening for a while, and it doesn't match afaics with the alerts.
  • the wikifeeds tls-proxy container for several pods show a lot of HTTP 503s for various Restbase endpoints.
  • the wikifeeds dashboard doesn't look like anything is happening, and stuff like https://en.wikipedia.org/api/rest_v1/#/Feed appear to work.
  • from the dashboard it seems that the wikifeeds pods are throttled (CPU). The throttling has been happening for a while, it doesn't match with the alert.
  • Hugh was able to find a way to repro the 50x errors, but only intermittently:
elukey@deploy2002:~$ curl https://wikifeeds.discovery.wmnet:4101/en.wikipedia.org/v1/page/random/title
{"status":503,"type":"internal_error","detail":"upstream connect error or disconnect/reset before headers. reset reason: connection failure","method":"GET","uri":"/en.wikipedia.org/v1/page/random/title"}elukey@deploy2002

Last deployment-chart change for wikifeeds: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/715452

Here is a sample error in the Restbase Logstash showing the UA being WikipediaApp (seeing the error for both Android and iOS)

Event Timeline

In logstash there is a huge amount of 503/504 upstream errors reported by wikifeeds (the app, not tls-proxy) (https://logstash.wikimedia.org/goto/6f2cd8f9fe54e4fb897be0689a81f4f8). Unfortunately it does not reveal what upstream actually is in that context.

I'm pretty tired already, but I kind of feel stuck at the point of wikifeeds envoy keep failing with UF to restbase (if I'm not reading this wrong) and the question of why we had so many more requests from wikifeeds to restbase the last days (grafana link from above).

[2021-09-06T21:09:00.733Z] "GET /krc.wikipedia.org/v1/page/summary/%D0%A2%D0%B8%D0%BB HTTP/1.1" 503 UF 0 91 600 - "-" "Wikifeeds/WMF" "68d21d61-5590-407d-990d-a6b9bc8674bd" "localhost:6503" "10.2.1.17:7443"

From 2021-09-04 restbase has been reporting a lot of connection errors (to what it seems Wikifeeds judgding from the URI): https://logstash.wikimedia.org/goto/0cea85618c372a01e6a515a767e8ca21

The source of the failure could be this one in Wikifeeds https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.09.07?id=jmADv3sB9aenX452CN2-, which has happened a lot in the timeframe observed.

TitleError: title-invalid-utf8
    at Function.newFromText (/srv/service/node_modules/mediawiki-title/lib/index.js:422:19)
    at pageviewsSlice.map (/srv/service/lib/most-read.js:118:34)
    at Array.map (<anonymous>)
    at BBPromise.join (/srv/service/lib/most-read.js:116:43)
    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)
    at Holder$2._callFunction (eval at generateHolderClass (/srv/service/node_modules/bluebird/js/release/join.js:91:16), <anonymous>:14:44)
    at Holder$2.checkFulfillment (eval at generateHolderClass (/srv/service/node_modules/bluebird/js/release/join.js:91:16), <anonymous>:29:30)
    at Promise.eval (eval at thenCallback (/srv/service/node_modules/bluebird/js/release/join.js:13:16), <anonymous>:6:20)
    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:601:21)
    at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/srv/service/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/bluebird/js/release/async.js:15:14)
    at runCallback (timers.js:705:18)

The weird part is that, there's no user input on these requests. Somehow, wikifeeds is retrieving titles with invalid utf8 encoding from mwapi. I'm submitting a patch with the fix and looking for some way to properly test it.

Thanks @MSantos

Update from IRC: me and @JMeybohm noticed that in the k8s wikifeeds graphs, the rise of the errors (Sept 4th ~02:30 UTC) corresponded to a big spike in timeouts/errors to the mw api, that corresponds to T290374. @JMeybohm restarted all pods for wikifeeds, and we have observed an immediate reduction in restbase timeouts and envoy metrics errors.

So the current leading theory is that the mw api outage led to an inconsistent state of some tls-proxy state (restbase -> wikifeeds, wikifeeds -> restbase) maybe due to old/stale tcp sockets (broken by the mw api outage) that Envoy kept using until we forced a restart of the pods.

Change 719221 had a related patch set uploaded (by MSantos; author: MSantos):

[mediawiki/services/wikifeeds@master] prevent titles with invalid utf8 enconding

https://gerrit.wikimedia.org/r/719221

To everyone involved, should we have an incident doc about this? Given the amount of people involved and the amount of time that went into this, it might be worth it, especially if there are actionables.

We didn't have time to follow up but it may be worth an incident doc. The envoy issue that we faced may be either something that could be tuned via a config (best case scenario) or some weird corner-case/bug that is difficult to repro (worst case scenario). I am available to kick off an incident page if we think it is worth, but I still don't have a complete understanding of why this happened :)

"but I still don't have a complete understanding of why this happened :)"

I think that's the best justification for having an incident doc. With the help of others we might end up figuring out what transpired.

I 'd be happy to help and shed some light on this.

Change 719221 merged by jenkins-bot:

[mediawiki/services/wikifeeds@master] prevent titles with invalid utf8 enconding

https://gerrit.wikimedia.org/r/719221

@akosiaris: me and @JMeybohm drafted https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-06_Wikifeeds, that should be in a reasonable good state for a first review. If you have time during the next days let us know your thoughts :)

Adding notes in the task before I start moving things in to the incident doc itself.

  • Docs are at https://wikitech.wikimedia.org/wiki/Wikifeeds. That being said, wikifeeds is one of the more difficult services we have to reason about traffic flows as it is highly reliant on restbase with restbase being reliant on it. An actionable should be filed to clarify some things around the relationship between those 2 in docs.
  • During the 2 day degradation (I am not sure I would call it an outage, but let's not go into semantics here) we had > 33M log lines in logstash from the wikifeeds container itself (errors and otherwise). The mean rate seems to be about just north of 0.5M per hour, but many peaks are at 0.8 M
  • A quick distribution visualization (pic below) excluding the very first one (page or revision not found I have no idea how that ends up there) says:

wikifeeds.png (1×1 px, 55 KB)

Of those:

  • The big culprit has already been addressed by @MSantos, thanks for that.
  • The second one Page was deleted makes sense, although I am wondering a bit why it would exist in the first place (as in, how are those pages suggested and make it to the featured or on this day parts?). Plus it's a 400, not a 5xx

But these are not 5xx!!! Filtering out just for 5xx (502, 503, 504 and 500) we got

wikifeeds.png (948×1 px, 56 KB)

but timing wise those are ALL boxed (see https://logstash.wikimedia.org/goto/06a7bf162ebeaad82d900f584129bdc0) around the mediawiki api outage (we don't have an incident doc for that yet. Grafana link though is at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1630719198066&to=1630730546512).

And this doesn't add up. In grafana https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=15&orgId=1&from=1630528663468&to=1631089519734 we clearly see 5xx lasting for up to 40h at high rates (~7-8 per sec) and then subsiding to lower rates (~1 per sec) for another 37h. There is a discrepancy between logstash and grafana from what I can tell.

We can also tell that by envoy's logging which is actually showing the same pattern as wikifeeds' grafana. See https://logstash.wikimedia.org/goto/b0cc81bc78b5e8bbf661543c0fd5beef However, for envoy the situation following the first 40h of high rates is followed by what seems like business as well usual (it's worth it to point out that envoy logs only errors right now).

As far as the actual spike of log errors in logstash around the API outage (as well as grafana), these are somewhat easy to explain. The UF errors are expected during the API outage. The timeouts are a result of envoy being throttled by the linux kernel during the outage and envoy not being able to reply on time.

That being said, as far as envoy throttling goes, it's perplexing. It shows throttling behavior I haven't been fully able to explain yet. We could bump the limit, but it's worth it to point out that during typical usage system+user is ~1/10 of the limit. If we weren't running 4.19 I would attribute that to 512ac999[1] and co

[1] https://bugzilla.kernel.org/show_bug.cgi?id=198197

And this doesn't add up. In grafana https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=15&orgId=1&from=1630528663468&to=1631089519734 we clearly see 5xx lasting for up to 40h at high rates (~7-8 per sec) and then subsiding to lower rates (~1 per sec) for another 37h. There is a discrepancy between logstash and grafana from what I can tell.

Thankfully this is human error (yours truly). The correct logstash would be https://logstash.wikimedia.org/goto/9edce1c96b9703450dec2823ba37e8ca and we can see there that the patterns match.

Couple of more points:

  • https://grafana.wikimedia.org/goto/RY75JPHnz points out that wikifeeds envoy did indeed see the errors. In both downstream (the local wikifeeds app trying to connect to it) and upstream (mwapi AND restbase)
  • The errors, aside from a 25minute window at the beginning of the outage, are exclusively the result of the restbase-for-services-egress listener.
  • Restbase envoy itself issued an order of magnitude more errors than wikifeeds envoyed consumed. It peeked at 400 rps at the time of the mwapi outage and continued to issue ~100 errors per second for the next 40 hours, gradually dropping to 50rps and then 25rps and finally dropping to single digit error another 24 hours later.

I am starting to think this outage is a consequence of some issue in restbase that caused lasting effects for a long amount of time after the initial trigger which was the MediaWiki API outage.

I 've went ahead and created https://grafana-rw.wikimedia.org/d/Y1UyyEH7z/t290445?orgId=1 to depict the findings in grafana a bit more. It's not the entirety of the incident, but rather an arbitrary window towards the end of it.

The interesting issues here.

  • Wikifeeds itself reports having issued at max 11 5xx per second
  • Wikifeeds envoy reports having issued at max 242 5xx per second. It does so at both the upstream and the downstream stats and they match
  • Restbase envoy reports having issued at max 242 5xx per second. It does so at both the upstream and the downstream stats and they match
  • Restbase and wikifeeds envoys 5xx reports match each other.
  • We lack connections statistics for restbase envoy due to our prometheus scraping config
  • Connection statistics for wikifeeds envoy point out some 12 connections failed due to timeouts to reaching restbase at max, the rest 349 connections idling
  • At most, 10.2 connections failed due to failure ejections for envoy in the wikifeeds pod.
  • Very very few wikifeeds requests were terminated because a connection has been destroyed while having active requests. This is consistent with the request having failed above being due to timeouts.
  • Nothing of note for pool overflows, protocol errors, max requests per connections.

I am starting to think this outage is a consequence of some issue in restbase that caused lasting effects for a long amount of time after the initial trigger which was the MediaWiki API outage.

I take that back. This is starting to look like some weird feedback loop in the restbase <-> wikifeeds and back paths.

akosiaris claimed this task.

And all of the above is mostly irrelevant and I am mostly blind and chasing ghosts (on the plus side I got more acquainted with envoy)

The incident is end user generated. And I have doubts the restart of all wikifeeds pods fixed something if anything. It was already becoming way quieter due to organic dissipation of requests. As the graph shows, requests to the /page/random/title more than quadrupled during the peak of the incident. That endpoint is exposed via RESTBase on /api/rest_v1/page/random/summary. Turnilo data points out that indeed during that timeframe, WikipediaApp/6.8.1.1815 iOS apps reached a total of ~20M times(turnilo is sampled, numbers reported by Wikifeeds place it to ~30M) to that endpoint. That's considerable more than the usual traffic, especially during peak times.

Other endpoints' traffic increased as well (e.g. the onthisday feed or the featured image ones, albeit not so dramatically)

That lead to increased overall traffic as well as increased 50x (both 504s and 503s). The traffic dissipated back to closer to normal, yet still elevated levels by 2021-09-06 08:00 UTC, dropping even more by 2021-09-07 having a bit more spikes every now and then.

By the 13th, we switched DCs and codfw no longer sees that amount of traffic as it has switched to eqiad. The patterns follow the distribution we were seeing before the incident.

I 'll move my findings to the incident doc, add some actionables and close this.

Thanks to all who contributed.