Page MenuHomePhabricator

Create alerts for https://query.wikidata.org/bigdata/ldf
Closed, ResolvedPublic

Description

Per T347284 , we lost the LDF endpoint for a few days. Creating this ticket to add alerts for the URL https://query.wikidata.org/bigdata/ldf

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+28 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+21 -0
operations/puppetproduction+9 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -14
operations/puppetproduction+19 -0
operations/puppetproduction+4 -13
operations/puppetproduction+10 -11
operations/puppetproduction+2 -2
operations/puppetproduction+1 -6
operations/puppetproduction+5 -1
operations/puppetproduction+30 -0
operations/puppetproduction+34 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -19
operations/puppetproduction+2 -1
operations/puppetproduction+18 -0
operations/puppetproduction+0 -14
operations/puppetproduction+0 -2
operations/puppetproduction+1 -1
operations/dnsmaster+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/dnsmaster+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+15 -0
operations/puppetproduction+14 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 978118 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] miscweb: fix typo in wdqs ldf endpoint check

https://gerrit.wikimedia.org/r/978118

Change 978118 merged by Bking:

[operations/puppet@production] miscweb: fix typo in wdqs ldf endpoint check

https://gerrit.wikimedia.org/r/978118

Change 978131 had a related patch set uploaded (by Bking; author: Bking):

[operations/dns@master] wdqs: add CNAME for wdqs-ldf endpoint

https://gerrit.wikimedia.org/r/978131

Change 978131 merged by Bking:

[operations/dns@master] wdqs: add CNAME for wdqs-ldf endpoint

https://gerrit.wikimedia.org/r/978131

Change 978134 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: point wdqs ldf endpoint to new CNAME

https://gerrit.wikimedia.org/r/978134

Change 978140 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] WIP: do not merge

https://gerrit.wikimedia.org/r/978140

Change 978140 abandoned by Bking:

[operations/puppet@production] WIP: do not merge

Reason:

just testing CI

https://gerrit.wikimedia.org/r/978140

Change 978142 had a related patch set uploaded (by Bking; author: Bking):

[operations/dns@master] The wdqs ldf endpoint (query.wikidata.org/bigdata/ldf) is hosted from a single server. Create a CNAME under discovery services so we don't have to update multiple places (monitoring, ATS, etc) when we update hosts.

https://gerrit.wikimedia.org/r/978142

Change 978142 merged by Bking:

[operations/dns@master] The wdqs ldf endpoint (query.wikidata.org/bigdata/ldf) is hosted from a single server. Create a CNAME under discovery services so we don't have to update multiple places (monitoring, ATS, etc) when we update hosts.

https://gerrit.wikimedia.org/r/978142

I've created another 24-hour silence for this alert, UUID 59b5ca30-1aeb-4d06-b083-7023a373ccb3 .

Change 978134 merged by Bking:

[operations/puppet@production] query_service: point wdqs ldf endpoint to new CNAME

https://gerrit.wikimedia.org/r/978134

Change 978700 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] miscweb: change wdqs ldf endpoint blackbox check

https://gerrit.wikimedia.org/r/978700

Change 978700 merged by Bking:

[operations/puppet@production] miscweb: change wdqs ldf endpoint blackbox check

https://gerrit.wikimedia.org/r/978700

We've silenced the alert for another 24 hours. The network probes Grafana dashboard is still showing 0% availability for our ldf probe .

After some thought, I think the problem is the blackbox check's association with miscweb. We are actually cutting around miscweb when we access the ldf endpoint, so we should put the blackbox check outside of modules/profile/manifests/microsites/query_service.pp , which creates said association.

Change 979149 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] miscweb: remove wdqs ldf endpoint check

https://gerrit.wikimedia.org/r/979149

Change 979149 merged by Bking:

[operations/puppet@production] miscweb: remove wdqs ldf endpoint check

https://gerrit.wikimedia.org/r/979149

Change 979388 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Add blackbox check for LDF endpoint

https://gerrit.wikimedia.org/r/979388

Change 979388 merged by Bking:

[operations/puppet@production] wdqs: Add blackbox check for LDF endpoint

https://gerrit.wikimedia.org/r/979388

Change 979401 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: fix blackbox check for ldf endpoint

https://gerrit.wikimedia.org/r/979401

Change 979401 merged by Bking:

[operations/puppet@production] wdqs: fix blackbox check for ldf endpoint

https://gerrit.wikimedia.org/r/979401

Change 979408 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Monitor ldf endpoint

https://gerrit.wikimedia.org/r/979408

Change 979408 merged by Bking:

[operations/puppet@production] wdqs: remove ldf endpoint monitoring

https://gerrit.wikimedia.org/r/979408

Change 979983 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Monitor LDF endpoint

https://gerrit.wikimedia.org/r/979983

Change 979983 merged by Bking:

[operations/puppet@production] wdqs: Monitor LDF endpoint

https://gerrit.wikimedia.org/r/979983

Change 980460 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Improve regex for ldf check

https://gerrit.wikimedia.org/r/980460

Change 980460 merged by Bking:

[operations/puppet@production] wdqs: Improve regex for ldf check

https://gerrit.wikimedia.org/r/980460

Mentioned in SAL (#wikimedia-operations) [2023-12-05T20:53:50Z] <inflatador> bking@prometheus1006 reload prometheus-blackbox service T347355

Mentioned in SAL (#wikimedia-operations) [2023-12-05T20:58:36Z] <inflatador> bking@prometheus1006 disable puppet for troubleshooting T347355

Reverted the last change after we noticed some alerts for the following hosts: 1008 1009 1010 1011 2008 2014

None of these hosts are in the wdqs-public tier, but I set the ldf_host hiera var in public.yaml. We need to set this somewhere the non-public hosts can find it.

Change 980499 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: monitor ldf endpoint

https://gerrit.wikimedia.org/r/980499

Change 980503 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] trafficserver: revert to using hostname for wdqs ldf endpoint

https://gerrit.wikimedia.org/r/980503

Change 980503 merged by Bking:

[operations/puppet@production] trafficserver: revert to using hostname for wdqs ldf endpoint

https://gerrit.wikimedia.org/r/980503

Change 980499 merged by Bking:

[operations/puppet@production] wdqs: monitor ldf endpoint

https://gerrit.wikimedia.org/r/980499

Change 981387 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: monitor ldf endpoint

https://gerrit.wikimedia.org/r/981387

Change 981387 merged by Bking:

[operations/puppet@production] wdqs: monitor ldf endpoint

https://gerrit.wikimedia.org/r/981387

Change 981551 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add ldf endpoint logic

https://gerrit.wikimedia.org/r/981551

Change 981551 merged by Bking:

[operations/puppet@production] wdqs: add ldf endpoint logic

https://gerrit.wikimedia.org/r/981551

Mentioned in SAL (#wikimedia-operations) [2023-12-08T16:19:28Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1015.eqiad.wmnet with reason: T347355

Mentioned in SAL (#wikimedia-operations) [2023-12-08T16:19:44Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1015.eqiad.wmnet with reason: T347355

Change 981563 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] miscweb: Notify search platform for sites they own

https://gerrit.wikimedia.org/r/981563

Change 981563 merged by Bking:

[operations/puppet@production] miscweb: Notify search platform for sites they own

https://gerrit.wikimedia.org/r/981563

Change 981578 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: move params from body to path

https://gerrit.wikimedia.org/r/981578

Change 981578 merged by Bking:

[operations/puppet@production] wdqs: move params from body to path

https://gerrit.wikimedia.org/r/981578

Change 981591 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] query_service: duplicate monitoring checks for sre-collab team

https://gerrit.wikimedia.org/r/981591

Change 981624 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: simplify ldf endpoint check

https://gerrit.wikimedia.org/r/981624

Change 981624 merged by Bking:

[operations/puppet@production] wdqs: simplify ldf endpoint check

https://gerrit.wikimedia.org/r/981624

Change 982138 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Try icinga-based check instead of blackbox

https://gerrit.wikimedia.org/r/982138

Change 981591 merged by Dzahn:

[operations/puppet@production] query_service: duplicate monitoring checks for sre-collab team

https://gerrit.wikimedia.org/r/981591

Change 982138 merged by Bking:

[operations/puppet@production] wdqs: Try icinga-based check instead of blackbox

https://gerrit.wikimedia.org/r/982138

Mentioned in SAL (#wikimedia-operations) [2023-12-11T22:09:40Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355

Mentioned in SAL (#wikimedia-operations) [2023-12-11T22:09:59Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on wdqs1015.eqiad.wmnet with reason: T347355

Change 982172 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Change LDF monitoring URI

https://gerrit.wikimedia.org/r/982172

There is a way to test this manually from the icinga host; see Daniel Zahn's comment here . This should help us narrow down the issue.

Packet captures from the wdqs1015 host are here

Change 983260 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove ldf check

https://gerrit.wikimedia.org/r/983260

Change 983260 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove ldf check

https://gerrit.wikimedia.org/r/983260

Change 982172 abandoned by Bking:

[operations/puppet@production] wdqs: Change LDF monitoring URI

Reason:

We know the root cause now and we'll approach this a different way.

https://gerrit.wikimedia.org/r/982172

We've figured out the cause of the issue (thanks @Stevemunene !). The pollers (prometheus blackbox and icinga) do not send an Accept: header. Without the header, Blazegraph always returns a 500. As of now, the Puppet modules for prometheus blackbox and icinga don't expose the ability to add headers. But it should be easy enough to add this to the Blackbox http puppet module . In the meantime, we can configure nginx to add Accept:*/* to requests that don't already have an Accept header set.

@bking - I had an idea, but I'm not sure whether or not it will work.
I looked at the API spec for blazegraph here: https://github.com/blazegraph/database/wiki/REST_API#get-or-post

It turns out that there is a format parameter, which is supposed to override any Accept header that is set.

image.png (306×1 px, 93 KB)

However, I tried a simple GET request with this and it didn't affect the outcome. e.g.

btullis@alert1001:~$ /usr/lib/nagios/plugins/check_http -H query.wikidata.org --sni  -u '/bigdata/ldf?format=json' -f follow  -I 10.64.132.7
HTTP CRITICAL: HTTP/1.1 500 Server Error - 8847 bytes in 0.018 second response time |time=0.018444s;;;0.000000;10.000000 size=8847B;;;0

Then I tried a POST request with `format=json' as the payload, but that didn't work either. The error states that POST isn't permitted for this URL.

btullis@alert1001:~$ /usr/lib/nagios/plugins/check_http -H query.wikidata.org --sni  -u '/bigdata/ldf' -P 'format=json' -f follow  -I 10.64.132.7
HTTP WARNING: HTTP/1.1 400 HTTP method POST is not supported by this URL - 710 bytes in 0.015 second response time |time=0.015386s;;;0.000000;10.000000 size=710B;;;0

I'm not sure why the query string wouldn't be working, but I thought it worth sharing, in case you had any ideas. I think that adding custom header support to the prometheus blackbox probe in puppet is a great idea though, as soon as we get a chance.

Change 983415 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Set default Accept: header

https://gerrit.wikimedia.org/r/983415

I've one-offed wdqs2010 and based on a packet capture, I believe above change will not affect requests that already have an Accept header set.

Change 983415 merged by Bking:

[operations/puppet@production] wdqs: Set default Accept: header

https://gerrit.wikimedia.org/r/983415

Change 983438 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: New LDF endpoint check

https://gerrit.wikimedia.org/r/983438

@bking - I had an idea, but I'm not sure whether or not it will work.
I looked at the API spec for blazegraph here: https://github.com/blazegraph/database/wiki/REST_API#get-or-post

Interesting! I could stand to read that a little more closely ;)...

It turns out that there is a format parameter, which is supposed to override any Accept header that is set.

image.png (306×1 px, 93 KB)

In our case, we don't want to override any header, just add one if it's not present. Apologies for not making it clear that that is actually the desired behavior in my last statement.

Change 983438 merged by Bking:

[operations/puppet@production] wdqs: New LDF endpoint check

https://gerrit.wikimedia.org/r/983438

Change 983893 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Enable ipv6 for envoy tls_terminator

https://gerrit.wikimedia.org/r/983893

Change 983893 merged by Bking:

[operations/puppet@production] wdqs: Enable ipv6 for envoy tls_terminator

https://gerrit.wikimedia.org/r/983893

Change 984212 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Add Accept: header to LDF endpoint check

https://gerrit.wikimedia.org/r/984212

Change 984212 merged by Bking:

[operations/puppet@production] wdqs: Add Accept: header to LDF endpoint check

https://gerrit.wikimedia.org/r/984212

I've rolled out the Puppet patches and confirmed they are working as expected. Closing...