Page MenuHomePhabricator

Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service
Closed, ResolvedPublic2 Estimated Story Points

Description

Per this comment , the prometheus-blazegraph-exporter service spews lots of log spam while the data-transfer.py cookbook is running. Root cause is that the blazegraph exporter is trying to talk to blazegraph which is down during the ticket.

AC

  • Change the systemd dependency of prometheus-blazegraph-exporter such that it requires the blazegraph service to be running
  • Verify the log spam stops (see linked comment for how to do that).

Event Timeline

MPhamWMF triaged this task as Medium priority.Oct 31 2022, 4:42 PM
RKemper removed the point value for this task.
RKemper set the point value for this task to 2.
RKemper renamed this task from Data-transfer.py cookbook: stop prometheus-blazegraph-exporter service to Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.Nov 1 2022, 7:35 PM

Change 851711 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/851711

Mentioned in SAL (#wikimedia-operations) [2022-11-01T20:06:43Z] <ryankemper> T322037 Disabled puppet across A:wdqs-all and A:wcqs-public

Change 851711 merged by Bking:

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/851711

Mentioned in SAL (#wikimedia-operations) [2022-11-01T20:56:15Z] <ryankemper> T322037 Re-enabled puppet across A:wdqs-all and A:wcqs-public

We tried https://gerrit.wikimedia.org/r/851711 out on wdqs1009 (making exporter Requires= and After= the blazegraph instance); didn't behave how we expected. For example we thought restarting blazegraph would restart the exporter, but that wasn't the case.


I found a question on stackoverflow that maps very well to what we're trying to do: https://stackoverflow.com/q/47253020

Looks like PartOf might be a good option for us. I need to do some reading but this article looks really good as far as elucidating the different behavior:

https://pychao.com/2021/02/24/difference-between-partof-and-bindsto-in-a-systemd-unit

Change 852885 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/852885

Change 852885 merged by Ryan Kemper:

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/852885

Mentioned in SAL (#wikimedia-operations) [2022-11-03T19:56:07Z] <ryankemper> FOO Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885; disabled puppet on query service fleet via ryankemper@cumin1001:~$ sudo -E cumin 'A:wcqs-public or A:wdqs-all' 'sudo disable-puppet "T322037"'; testing change on wdqs1009

Change 853006 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] query_service: make blazegraph exporter sleep before starting

https://gerrit.wikimedia.org/r/853006

Change 853006 merged by Ryan Kemper:

[operations/puppet@production] query_service: make blazegraph exporter sleep before starting

https://gerrit.wikimedia.org/r/853006

Mentioned in SAL (#wikimedia-operations) [2022-11-03T20:35:28Z] <ryankemper> T322037 Rolling changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885 and https://gerrit.wikimedia.org/r/853006 out to query service fleet, 4 hosts at a time: ryankemper@cumin1001:~$ sudo -E cumin -b 4 'A:wcqs-public or A:wdqs-all' 'run-puppet-agent --force'

Gehel claimed this task.