Page MenuHomePhabricator

Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service
Closed, ResolvedPublic2 Estimated Story Points

Description

Per this comment , the prometheus-blazegraph-exporter service spews lots of log spam while the data-transfer.py cookbook is running. Root cause is that the blazegraph exporter is trying to talk to blazegraph which is down during the ticket.

AC

  • Change the systemd dependency of prometheus-blazegraph-exporter such that it requires the blazegraph service to be running
  • Verify the log spam stops (see linked comment for how to do that).

Event Timeline

RKemper removed the point value for this task.
RKemper set the point value for this task to 2.
RKemper renamed this task from Data-transfer.py cookbook: stop prometheus-blazegraph-exporter service to Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.Nov 1 2022, 7:35 PM

Change 851711 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/851711

Mentioned in SAL (#wikimedia-operations) [2022-11-01T20:06:43Z] <ryankemper> T322037 Disabled puppet across A:wdqs-all and A:wcqs-public

Change 851711 merged by Bking:

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/851711

Mentioned in SAL (#wikimedia-operations) [2022-11-01T20:56:15Z] <ryankemper> T322037 Re-enabled puppet across A:wdqs-all and A:wcqs-public

We tried https://gerrit.wikimedia.org/r/851711 out on wdqs1009 (making exporter Requires= and After= the blazegraph instance); didn't behave how we expected. For example we thought restarting blazegraph would restart the exporter, but that wasn't the case.


I found a question on stackoverflow that maps very well to what we're trying to do: https://stackoverflow.com/q/47253020

Looks like PartOf might be a good option for us. I need to do some reading but this article looks really good as far as elucidating the different behavior:

https://pychao.com/2021/02/24/difference-between-partof-and-bindsto-in-a-systemd-unit

Change 852885 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/852885

Change 852885 merged by Ryan Kemper:

[operations/puppet@production] query_service: Ensure prometheus exporter depends on blazegraph service

https://gerrit.wikimedia.org/r/852885

Mentioned in SAL (#wikimedia-operations) [2022-11-03T19:56:07Z] <ryankemper> FOO Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885; disabled puppet on query service fleet via ryankemper@cumin1001:~$ sudo -E cumin 'A:wcqs-public or A:wdqs-all' 'sudo disable-puppet "T322037"'; testing change on wdqs1009

Change 853006 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] query_service: make blazegraph exporter sleep before starting

https://gerrit.wikimedia.org/r/853006

Change 853006 merged by Ryan Kemper:

[operations/puppet@production] query_service: make blazegraph exporter sleep before starting

https://gerrit.wikimedia.org/r/853006

Mentioned in SAL (#wikimedia-operations) [2022-11-03T20:35:28Z] <ryankemper> T322037 Rolling changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/852885 and https://gerrit.wikimedia.org/r/853006 out to query service fleet, 4 hosts at a time: ryankemper@cumin1001:~$ sudo -E cumin -b 4 'A:wcqs-public or A:wdqs-all' 'run-puppet-agent --force'

Gehel claimed this task.