Page MenuHomePhabricator

Parsercache issues in codfw causing large-scale outage
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://www.wikimediastatus.net/incidents/b406lmnx5s57

I repeatedly got while trying to access https://commons.wikimedia.org/wiki/Commons:Featured_picture_candidates/candidate_list and https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard/Blocks_and_protections (no edit, just reading):

Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes.
See the error message at the bottom of this page for more information.

Request from 78.242.214.114 via cp6013.drmrs.wmnet, ATS/9.2.5
Error: 502, Broken pipe at 2024-10-24 10:56:45 GMT

Event Timeline

And now, while editing https://commons.wikimedia.org/w/index.php?title=Commons:Featured_picture_candidates/File:Calaveras_of_day_of_the_dead_in_mexico.jpg&action=submit

Request from 89.248.174.2 via cp3067.esams.wmnet, ATS/9.2.5
Error: 502, Broken pipe at 2024-10-24 11:09:38 GMT

https://commons.wikimedia.org/w/index.php?title=Commons:Featured_picture_candidates/File:Narrentag-Oberndorf_2024-Nachtumzug-07985.jpg&action=submit

Original error: upstream connect error or disconnect/reset before headers. reset reason: connection termination

This happens on any page on any wiki trying to open it while logged-in.

CDanis renamed this task from Error: 502, Broken pipe via cp6013.drmrs.wmnet to Parsercache issues in codfw causing large-scale outage.Oct 24 2024, 1:10 PM
CDanis triaged this task as High priority.
CDanis updated the task description. (Show Details)
CDanis added projects: DBA, SRE.

Change #1082778 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] changeprop: parsoidCachePrewarm: halve concurrency

https://gerrit.wikimedia.org/r/1082778

Change #1082778 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: parsoidCachePrewarm: halve concurrency

https://gerrit.wikimedia.org/r/1082778

I linked the above task: T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), as mitigations may have slowed down jobqueue processing. We may need to have a look, as while almost surely was a consequence of the reduced concurrency, it could show some interesting signal, too- as potentially the bottleneck/increased load could have started before the incident.

Ladsgroup moved this task from Refine to Done on the DBA board.
Ladsgroup subscribed.

The issue is over (and has been over for weeks now), we are changing how parsercache works drastically which would reduce the impact of such issues and outages. It doesn't make sense to keep this ticket open.

Change #1108141 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] ParserCache: Set connect and recieve timeouts

https://gerrit.wikimedia.org/r/1108141

Change #1108141 merged by jenkins-bot:

[operations/mediawiki-config@master] ParserCache: Set connect and recieve timeouts

https://gerrit.wikimedia.org/r/1108141

Mentioned in SAL (#wikimedia-operations) [2025-01-06T12:40:55Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-06T12:46:34Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-06T12:54:34Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1108141|ParserCache: Set connect and recieve timeouts (T378076 T373037)]] (duration: 13m 39s)