Page MenuHomePhabricator

Socket timeout on wdqs.svc.eqiad.wmnet
Closed, ResolvedPublic

Description

March 4 13:54 UTC, icinga alerted on PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds. wdqs-blazegraph was restarted, which resolved the issue (rolling restart, took a few minutes to complete).

Looking at the Grafana dashboards, it seems that only wdqs1004 and wdqs1005 were affected (see the banned requests and lag graphs), from ~13:45 UTC to ~14:05 UTC.

My best guess is that this is related to specific user generated load that evaded throttling, but I have not found the specific problematic requests. I don't have a great idea of how to prevent this happening again, but I'm open to suggestions.

Note that this raises again the question of what SLO we want for WDQS (T199228). Since we don't have a great way to ensure this never happen again, we should manage the expectations.

Event Timeline

Gehel created this task.Mar 4 2019, 2:57 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptMar 4 2019, 2:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel updated the task description. (Show Details)Mar 4 2019, 2:59 PM
jbond triaged this task as Normal priority.Mar 5 2019, 1:24 PM

Not sure what is to be done for this task. Do we want to investigate what caused it (i.e. which queries, why socket timeout happened - was it OOM, CPU exhaustion, too many threads, etc)? I think it needs to be clearer what is the actionable part of this task.

In general, I think if we want to investigate such cases, we should have a data collection runbook which describes which logs, metrics, values, etc. we should collect to facilitate the investigation.

@Gehel any input on this?

Gehel closed this task as Resolved.Apr 11 2019, 7:26 AM
Gehel claimed this task.

I don't think there is anything actionable at this point. Let's close.