Page MenuHomePhabricator

wdqs-categories prometheus exporter failing on select wdqs instances
Closed, ResolvedPublic

Description

Current hosts: wdqs1011, wdqs2006 are failing with the following

ryankemper@wdqs1011:~$ sudo systemctl status prometheus-blazegraph-exporter-wdqs-categories.service
● prometheus-blazegraph-exporter-wdqs-categories.service - Prometheus Blazegraph Exporter (wdqs-categories)
   Loaded: loaded (/lib/systemd/system/prometheus-blazegraph-exporter-wdqs-categories.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2020-12-10 17:10:22 UTC; 12min ago
  Process: 21672 ExecStart=/usr/local/bin/prometheus-blazegraph-exporter --listen :9194 --port 9990 (code=exited, status=1/FAILURE)
 Main PID: 21672 (code=exited, status=1/FAILURE)

Dec 10 17:10:22 wdqs1011 systemd[1]: prometheus-blazegraph-exporter-wdqs-categories.service: Service RestartSec=100ms expired, scheduling restart.
Dec 10 17:10:22 wdqs1011 systemd[1]: prometheus-blazegraph-exporter-wdqs-categories.service: Scheduled restart job, restart counter is at 5.
Dec 10 17:10:22 wdqs1011 systemd[1]: Stopped Prometheus Blazegraph Exporter (wdqs-categories).
Dec 10 17:10:22 wdqs1011 systemd[1]: prometheus-blazegraph-exporter-wdqs-categories.service: Start request repeated too quickly.
Dec 10 17:10:22 wdqs1011 systemd[1]: prometheus-blazegraph-exporter-wdqs-categories.service: Failed with result 'exit-code'.
Dec 10 17:10:22 wdqs1011 systemd[1]: Failed to start Prometheus Blazegraph Exporter (wdqs-categories).

This exporter relies on localhost:9194, which is refusing connections currently. It seems like the actual categories service on the "other side" is failing:

ryankemper@wdqs1011:~$ sudo systemctl status wdqs-categories.service
● wdqs-categories.service - Query Service - Blazegraph - wdqs-categories
   Loaded: loaded (/lib/systemd/system/wdqs-categories.service; static; vendor preset: enabled)
   Active: active (running) since Thu 2020-12-10 17:04:48 UTC; 19min ago
 Main PID: 18590 (java)
    Tasks: 114 (limit: 10000)
   Memory: 2.3G
   CGroup: /system.slice/wdqs-categories.service
           └─18590 java -server -XX:+UseG1GC -Xmx8g -Xloggc:/var/log/wdqs/wdqs-categories_jvm_gc.%p-%t.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -X

Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:                           SELECT ( COUNT( * ) AS ?count ) { ?s ?p ?o }
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:                         } UNION {
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:                           SELECT * WHERE { <http://www.wikidata.org> schema:dateModified ?y }
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:                         } } req.requestURI=/bigdata/namespace/wdq/sparql, req.xForwardedFor=null, req.queryString=format=json&query=+prefix+schema%3A+%3Cht
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]: com.bigdata.rdf.sail.webapp.DatasetNotFoundException: namespace=wdq
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:         at com.bigdata.rdf.sail.BigdataSail$BigdataSailConnection.<init>(BigdataSail.java:2071)
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]: Wrapped by: java.lang.RuntimeException: com.bigdata.rdf.sail.webapp.DatasetNotFoundException: namespace=wdq
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:         at com.bigdata.rdf.sail.BigdataSail.getReadOnlyConnection(BigdataSail.java:1506)
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]: Wrapped by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: com.bigdata.rdf.sail.webapp.DatasetNotFoundException: namespace=wdq
Dec 10 17:10:21 wdqs1011 wdqs-categories[18590]:         at java.util.concurrent.FutureTask.report(FutureTask.java:122)

As a separate issue, wdqs2002 is returning 500 response code for curl localhost:9194 leading to a status of Unknown rather than Critical

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RKemper renamed this task from wdqs-categories prometheus exporter failing on wdqs1011 to wdqs-categories prometheus exporter failing on select wdqs instances.Dec 10 2020, 6:47 PM
RKemper updated the task description. (Show Details)

Looking at the prometheus blazegraph exporter, we seem to hardcode the "wdq" namespace, which isn't available for the categories endpoint. I don't know why that was working earlier. This needs to be parameterized.

Note that categories namespaces are aliased by nginx.

Change 647774 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] categories: fix prom exporter's broken namespace

https://gerrit.wikimedia.org/r/647774

Gehel triaged this task as High priority.Dec 11 2020, 7:42 AM
Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Change 647774 merged by Ryan Kemper:
[operations/puppet@production] categories: fix prom exporter's broken namespace

https://gerrit.wikimedia.org/r/647774

Change 649473 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: fix typo breaking prom blazegraph exporter

https://gerrit.wikimedia.org/r/649473

Change 649473 merged by Ryan Kemper:
[operations/puppet@production] wdqs: fix typo breaking prom blazegraph exporter

https://gerrit.wikimedia.org/r/649473

The above two patches (really one patch with a followup patch to fix a typo) seem to have fixed the problem. I'll want to circle back to verify we're getting all the metrics we expect, but at a minimum the prometheus blazegraph exporter for wdqs-categories is working again.