Page MenuHomePhabricator

ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)
Closed, ResolvedPublic

Description

At around 2024-03-06 01:20 UTC we had a page: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) This seems to be the result of an increase in 500 error rate starting some 12 hours earlier:

image.png (831×1 px, 93 KB)

The errors are typical of this:

TypeError [ERR_HTTP_INVALID_HEADER_VALUE]: Invalid value "undefined" for header "content-language"
    at storeHeader (_http_outgoing.js:474:5)
    at processHeader (_http_outgoing.js:469:3)
    at ServerResponse._storeHeader (_http_outgoing.js:368:11)
    at ServerResponse.writeHead (_http_server.js:312:8)
    at handleResponse (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/hyperswitch/lib/server.js:220:22)
    at /srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/hyperswitch/lib/server.js:356:16
    at tryCatcher (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues [as _onImmediate] (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:15:14)
    at processImmediate (internal/timers.js:461:21)

I think there is some precedent here; This seems indicative of an aberrant cache value, one that is missing the content-language.

What isn't clear is what changed. Perhaps these previously cached values that are just now being requested (the majority of the associated requests seem to be from bots, GuzzleHttp/7 and Wiktionary Wizard definition populator (contact: cxs6174@gmail.com))?

Event Timeline

Eevans triaged this task as High priority.Mar 6 2024, 3:11 AM
Joe subscribed.

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

I would assume either Traffic on the SRE side, or Content-Transform-Team on the development side should be able to help.

Retagged accordingly, please let me know if we can help in any way.

It looks like the path that is causing the increase in errors is: /v1/page/definition
https://logstash.wikimedia.org/goto/aa7c8ff41ebc18641236e0ff2099c9ab

Change 1009201 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Add missing content-language headers

https://gerrit.wikimedia.org/r/1009201

@hnowlan similar issue. I just sent a patch that fixes it.

Change 1009201 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Add missing content-language headers

https://gerrit.wikimedia.org/r/1009201

I just deployed a patch that should improve things for this issue:
From production:

jgiannelos@deploy2002 curl -v -o /dev/null https://mobileapps.svc.eqiad.wmnet:4102/en.wiktionary.org/v1/page/definition/saltar 2>&1 | grep content-language
< content-language: en
jgiannelos@deploy2002 curl -v -o /dev/null https://mobileapps.svc.codfw.wmnet:4102/en.wiktionary.org/v1/page/definition/saltar 2>&1 | grep content-language
< content-language: en

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

I would assume either Traffic on the SRE side, or Content-Transform-Team on the development side should be able to help.

Retagged accordingly, please let me know if we can help in any way.

For posterity/learning sake, my (admittedly thin) rationale went something like:

  • ownership of the impacting service seemed err...murky
  • I had a vague recollection of a similar issue that claime and hnowlan had helped work (that turned out to be T356369)
  • it was late here, I'd made the call that it didn't warrant waking anyone, but wasn't confident it should wait beyond start-of-day in Europe
  • Jgiannelos seemed like the right person to tap, but I didn't want to rely on assignment

Error rate seems to be at previous levels after deploying the fix.