Page MenuHomePhabricator

Investigate job queue retries for Parsoid jobs
Closed, DuplicatePublic

Description

In https://wikitech.wikimedia.org/wiki/Incident_documentation/20150103-Parsoid requests to a specific page seemed to be retried a large number of times. Since requests for this page locked up parsoid workers, this led to the parsoid cluster being quickly overloaded.

We should check why such a large number of retries are happening. Things to look into:

  • Varnish backend retries on timeout (both frontend and backend)
  • Parsoid job retries

Possibly related: T73853

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.

This again happened today .. with another bug in Parsoid causing an infinite loop on 2 enwiki pages (and yesterday on 3 plwiki pages). The job queue relentlessly retried the two pages over the last 8 hours which required 2 restarts and a hotfix to be deployed to fix the infinite loop.

ssastry triaged this task as Medium priority.Mar 24 2015, 4:47 PM

I thought this was looked into and there was nothing odd about job retries (actually I recall finding/fixing a bug that made it 1 less than it should be).