Page MenuHomePhabricator

Investigate job queue retries for Parsoid jobs
Closed, DuplicatePublic

Description

In https://wikitech.wikimedia.org/wiki/Incident_documentation/20150103-Parsoid requests to a specific page seemed to be retried a large number of times. Since requests for this page locked up parsoid workers, this led to the parsoid cluster being quickly overloaded.

We should check why such a large number of retries are happening. Things to look into:

  • Varnish backend retries on timeout (both frontend and backend)
  • Parsoid job retries

Possibly related: T73853

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a subscriber: GWicke.

This again happened today .. with another bug in Parsoid causing an infinite loop on 2 enwiki pages (and yesterday on 3 plwiki pages). The job queue relentlessly retried the two pages over the last 8 hours which required 2 restarts and a hotfix to be deployed to fix the infinite loop.

ssastry triaged this task as Medium priority.Mar 24 2015, 4:47 PM

I thought this was looked into and there was nothing odd about job retries (actually I recall finding/fixing a bug that made it 1 less than it should be).