ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Mar 6 2024, 3:10 AM

Description

At around 2024-03-06 01:20 UTC we had a page: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) This seems to be the result of an increase in 500 error rate starting some 12 hours earlier:

The errors are typical of this:

TypeError [ERR_HTTP_INVALID_HEADER_VALUE]: Invalid value "undefined" for header "content-language"
    at storeHeader (_http_outgoing.js:474:5)
    at processHeader (_http_outgoing.js:469:3)
    at ServerResponse._storeHeader (_http_outgoing.js:368:11)
    at ServerResponse.writeHead (_http_server.js:312:8)
    at handleResponse (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/hyperswitch/lib/server.js:220:22)
    at /srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/hyperswitch/lib/server.js:356:16
    at tryCatcher (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues [as _onImmediate] (/srv/deployment/restbase/deploy-cache/revs/e5ed8d0f95671701df291f786f4c0972d2e72142/node_modules/bluebird/js/release/async.js:15:14)
    at processImmediate (internal/timers.js:461:21)

I think there is some precedent here; This seems indicative of an aberrant cache value, one that is missing the content-language.

What isn't clear is what changed. Perhaps these previously cached values that are just now being requested (the majority of the associated requests seem to be from bots, GuzzleHttp/7 and Wiktionary Wizard definition populator (contact: cxs6174@gmail.com))?

Details

	Subject	Repo	Branch	Lines +/-
	Add missing content-language headers	mediawiki/services/mobileapps	master	+22 -1

Customize query in gerrit

Related Objects

Mentioned Here: T356369: Missing content-language header on PCS responses

Event Timeline

Eevans created this task.Mar 6 2024, 3:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2024, 3:10 AM

Eevans triaged this task as High priority.Mar 6 2024, 3:11 AM

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

I would assume either Traffic on the SRE side, or Content-Transform-Team on the development side should be able to help.

Retagged accordingly, please let me know if we can help in any way.

It looks like the path that is causing the increase in errors is: /v1/page/definition
https://logstash.wikimedia.org/goto/aa7c8ff41ebc18641236e0ff2099c9ab

Jgiannelos claimed this task.Mar 6 2024, 8:13 AM

Is this the same bug as T356369?

Change 1009201 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Add missing content-language headers

https://gerrit.wikimedia.org/r/1009201

gerritbot added a project: Patch-For-Review.Mar 6 2024, 10:19 AM

@hnowlan similar issue. I just sent a patch that fixes it.

Change 1009201 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Add missing content-language headers

https://gerrit.wikimedia.org/r/1009201

I just deployed a patch that should improve things for this issue:
From production:

jgiannelos@deploy2002 curl -v -o /dev/null https://mobileapps.svc.eqiad.wmnet:4102/en.wiktionary.org/v1/page/definition/saltar 2>&1 | grep content-language
< content-language: en
jgiannelos@deploy2002 curl -v -o /dev/null https://mobileapps.svc.codfw.wmnet:4102/en.wiktionary.org/v1/page/definition/saltar 2>&1 | grep content-language
< content-language: en

Jgiannelos added a project: Content-Transform-Team-WIP.Mar 6 2024, 11:25 AM

Jgiannelos moved this task from Backlog to To Deploy on the Content-Transform-Team-WIP board.

Jgiannelos moved this task from To Deploy to To Verify on the Content-Transform-Team-WIP board.

Maintenance_bot removed a project: Patch-For-Review.Mar 6 2024, 11:31 AM

In T359234#9604644, @Joe wrote:

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

I would assume either Traffic on the SRE side, or Content-Transform-Team on the development side should be able to help.

Retagged accordingly, please let me know if we can help in any way.

For posterity/learning sake, my (admittedly thin) rationale went something like:

ownership of the impacting service seemed err...murky
I had a vague recollection of a similar issue that claime and hnowlan had helped work (that turned out to be T356369)
it was late here, I'd made the call that it didn't warrant waking anyone, but wasn't confident it should wait beyond start-of-day in Europe
Jgiannelos seemed like the right person to tap, but I didn't want to rely on assignment

Error rate seems to be at previous levels after deploying the fix.

ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad)
Closed, ResolvedPublic
Actions