Wikidata Query Service unstable in codfw
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Bugreporter
	Sep 3 2021, 3:48 PM

Description

List of steps to reproduce (step by step, including full links if applicable):

curl --resolve query.wikidata.org:443:198.35.26.96 "https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056"

What happens?:
Returns 502 Bad Gateway error.

(confirmed in Toolforge and PAWS, and my local computer via 2620:0:863:ed1a::1)

Details

Subject	Repo	Branch	Lines +/-
query service: clean up absented resources	operations/puppet	production	+0 -58
wdqs: remove codfw hourly restarts	operations/puppet	production	+4 -3
wdqs: temp mitigation => restart hourly w random	operations/puppet	production	+29 -6
wdqs: temp mitigation => restart hourly w random	operations/puppet	production	+27 -5

Customize query in gerrit

Related Objects

Duplicates Merged Here: T290332: WDQS overloaded in codfw

Event Timeline

Bugreporter created this task.Sep 3 2021, 3:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 3 2021, 3:48 PM

Bugreporter triaged this task as Unbreak Now! priority.Sep 3 2021, 3:48 PM

Bugreporter added a project: Traffic.

Bugreporter updated the task description. (Show Details)

Your port is wrong. Right? The URL works for me.

amsa@C382:~$ curl --resolve query.wikidata.org:443 "https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056"
<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='y'/>
	</head>
	<results>
		<result>
			<binding name='y'>
				<literal datatype='http://www.w3.org/2001/XMLSchema#dateTime'>2021-09-03T15:50:38Z</literal>
			</binding>
		</result>
	</results>
</sparql>

I can confirm via simply viewing https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056 on my computer, where query.wikidata.org resolves to 2620:0:863:ed1a::1

In T290330#7331145, @Ladsgroup wrote:

amsa@C382:~$ curl --resolve query.wikidata.org:443 "https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056"
<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='y'/>
	</head>
	<results>
		<result>
			<binding name='y'>
				<literal datatype='http://www.w3.org/2001/XMLSchema#dateTime'>2021-09-03T15:50:38Z</literal>
			</binding>
		</result>
	</results>
</sparql>

Please use ulsfo to access the server.

Note there are no problem to just access to https://query.wikidata.org/, it only return 502 when a query is executed.

Ladsgroup renamed this task from 502 Bad Gateway on WDQS to 502 Bad Gateway on WDQS on ulsfo.Sep 3 2021, 3:57 PM

Gehel merged a task: T290332: WDQS overloaded in codfw.Sep 3 2021, 3:58 PM

Gehel subscribed.

• Mholloway subscribed.Sep 3 2021, 4:01 PM

WMDE-leszek subscribed.Sep 3 2021, 4:03 PM

We are experiencing overload issues on the WDQS cluster in codfw. We suspect some specific queries are taking blazegraph down, but we haven't been able to identify (or block) those yet. The best workaround at the moment seems to be to regularly restart blazegraph (every hour).

See @Zbyszko's email to the wikidata mailing list: https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/SO7JKEGEO4AT7EDUYGYTVRGFHRLU6SB2/

Gehel renamed this task from 502 Bad Gateway on WDQS on ulsfo to Wikidata Query Service unstable in codfw.Sep 3 2021, 4:05 PM

RhinosF1 subscribed.Sep 3 2021, 4:09 PM

Mentioned in SAL (#wikimedia-operations) [2021-09-03T16:10:13Z] <gehel> blazegraph (public cofdfw cluster) will now restart every hour - T290330

For higher availability you may want to depool each server before restarting them.

Change 717494 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717494

gerritbot added a project: Patch-For-Review.Sep 3 2021, 4:34 PM

Maintenance_bot added a project: Wikidata.Sep 3 2021, 4:45 PM

Change 717494 merged by Ryan Kemper:

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717494

Change 717508 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717508

Change 717508 merged by Ryan Kemper:

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717508

Mentioned in SAL (#wikimedia-operations) [2021-09-03T17:17:55Z] <ryankemper> T290330 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/717508 across wdqs fleet; codfw wdqs hosts will restart on average once per hour now to address ongoing availability issues for wdqs codfw

Maintenance_bot removed a project: Patch-For-Review.Sep 3 2021, 6:11 PM

Mentioned in SAL (#wikimedia-operations) [2021-09-03T19:04:43Z] <ryankemper> T290330 ryankemper@cumin1001:~$ sudo -E cumin 'P{wdqs2*}' 'sudo rm -fv /etc/cron.hourly/restart-blazegraph' (Cleaned up manually created crons now that we have [somewhat hacky] systemd timers doing the same job)

LSobanski subscribed.Sep 6 2021, 5:33 AM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Sep 6 2021, 12:42 PM

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Sep 7 2021, 9:33 AM

Addshore subscribed.Sep 9 2021, 9:21 AM

Change 720102 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove codfw hourly restarts

https://gerrit.wikimedia.org/r/720102

gerritbot added a project: Patch-For-Review.Sep 9 2021, 8:34 PM

Change 720102 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove codfw hourly restarts

https://gerrit.wikimedia.org/r/720102

Mentioned in SAL (#wikimedia-operations) [2021-09-17T01:47:59Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] sudo cumin 'C:query_service::crontasks' 'sudo disable-puppet "Stop doing wdqs codfw ~hourly restarts - T290330"'

Mentioned in SAL (#wikimedia-operations) [2021-09-17T01:55:10Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] Testing on arbitrary codfw host: ryankemper@wdqs2001:~$ sudo run-puppet-agent

Maintenance_bot removed a project: Patch-For-Review.Sep 17 2021, 2:10 AM

Mentioned in SAL (#wikimedia-operations) [2021-09-17T02:22:00Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] wdqs2001 and wdqs2004 look fine after running sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer to clean up dangling timer

Mentioned in SAL (#wikimedia-operations) [2021-09-17T02:28:39Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] Successfully rolled out to rest of fleet sudo cumin 'C:query_service::crontasks' 'sudo run-puppet-agent --force && sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer'

We're no longer doing hourly restarts; we should monitor availability over the weekend to make sure that we're capable of handling current load without the crutch of the hourly restarts. We still have the useragent we banned in the last full outage (dailymotion) banned, so I'd think we'll be okay on load, but I'll be checking Grafana over the next couple days.

We can/should unban dailymotion early next week - perhaps tuesday - if wdqs has been stable over the weekend and during monday's load