Page MenuHomePhabricator

wdqs1012 flatlined after page for wdqs.svc.eqiad.wmnet timing out
Closed, DuplicatePublic

Description

03:25:57 <+icinga-wm> PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
03:27:43 <+icinga-wm> RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

@RLazarus noticed that in all metrics, it appears wdqs1012 went to 0 or disappeared per https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1629775199997&to=1629775911861&var-cluster_name=wdqs

I found some exceptions related to the prometheus exporters, restarted both to no success (see P17064).

@RLazarus has depooled wdqs1012 for now until it can be investigated by a WDQS expedrt since it does appear like something is wrong with it.

Event Timeline

The depool happened at 03:51. At 03:52, systemd(?) restarted wdqs-blazegraph.service, which seems to have brought stuff like the updater back to life, though there's some updater lag it's churning through.

Thanks for depooling this machine!

Seeing this in the graph is generally a symptom of T242453 so I'm tentatively closing this task as duplicate.
As you noted systemd restarted blazegraph which terminated with

Aug 24 03:52:31 wdqs1012 wdqs-blazegraph[23826]: Terminating due to java.lang.OutOfMemoryError: Java heap space

I'll update the runbook to mention these symptons.

I'm repooling this machine as the lag is back to normal.