Page MenuHomePhabricator

Wikidata Query Service unstable in codfw
Open, HighPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:
Returns 502 Bad Gateway error.

(confirmed in Toolforge and PAWS, and my local computer via 2620:0:863:ed1a::1)

Related Objects

Event Timeline

Bugreporter triaged this task as Unbreak Now! priority.Fri, Sep 3, 3:48 PM
Bugreporter added a project: Traffic.
Bugreporter updated the task description. (Show Details)
Ladsgroup lowered the priority of this task from Unbreak Now! to Needs Triage.Fri, Sep 3, 3:50 PM
Ladsgroup added a subscriber: Ladsgroup.

Your port is wrong. Right? The URL works for me.

amsa@C382:~$ curl --resolve query.wikidata.org:443 "https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056"
<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='y'/>
	</head>
	<results>
		<result>
			<binding name='y'>
				<literal datatype='http://www.w3.org/2001/XMLSchema#dateTime'>2021-09-03T15:50:38Z</literal>
			</binding>
		</result>
	</results>
</sparql>
amsa@C382:~$ curl --resolve query.wikidata.org:443 "https://query.wikidata.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27178056"
<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='y'/>
	</head>
	<results>
		<result>
			<binding name='y'>
				<literal datatype='http://www.w3.org/2001/XMLSchema#dateTime'>2021-09-03T15:50:38Z</literal>
			</binding>
		</result>
	</results>
</sparql>

Please use ulsfo to access the server.

Note there are no problem to just access to https://query.wikidata.org/, it only return 502 when a query is executed.

Ladsgroup renamed this task from 502 Bad Gateway on WDQS to 502 Bad Gateway on WDQS on ulsfo.Fri, Sep 3, 3:57 PM
Gehel triaged this task as High priority.Fri, Sep 3, 4:04 PM
Gehel edited projects, added Discovery-Search (Current work); removed Traffic.
Gehel added a subscriber: Zbyszko.

We are experiencing overload issues on the WDQS cluster in codfw. We suspect some specific queries are taking blazegraph down, but we haven't been able to identify (or block) those yet. The best workaround at the moment seems to be to regularly restart blazegraph (every hour).

See @Zbyszko's email to the wikidata mailing list: https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/SO7JKEGEO4AT7EDUYGYTVRGFHRLU6SB2/

Gehel renamed this task from 502 Bad Gateway on WDQS on ulsfo to Wikidata Query Service unstable in codfw.Fri, Sep 3, 4:05 PM

Mentioned in SAL (#wikimedia-operations) [2021-09-03T16:10:13Z] <gehel> blazegraph (public cofdfw cluster) will now restart every hour - T290330

For higher availability you may want to depool each server before restarting them.

Change 717494 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717494

Change 717494 merged by Ryan Kemper:

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717494

Change 717508 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717508

Change 717508 merged by Ryan Kemper:

[operations/puppet@production] wdqs: temp mitigation => restart hourly w random

https://gerrit.wikimedia.org/r/717508

Mentioned in SAL (#wikimedia-operations) [2021-09-03T17:17:55Z] <ryankemper> T290330 Deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/717508 across wdqs fleet; codfw wdqs hosts will restart on average once per hour now to address ongoing availability issues for wdqs codfw

Mentioned in SAL (#wikimedia-operations) [2021-09-03T19:04:43Z] <ryankemper> T290330 ryankemper@cumin1001:~$ sudo -E cumin 'P{wdqs2*}' 'sudo rm -fv /etc/cron.hourly/restart-blazegraph' (Cleaned up manually created crons now that we have [somewhat hacky] systemd timers doing the same job)

Change 720102 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: remove codfw hourly restarts

https://gerrit.wikimedia.org/r/720102

Change 720102 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove codfw hourly restarts

https://gerrit.wikimedia.org/r/720102

Mentioned in SAL (#wikimedia-operations) [2021-09-17T01:47:59Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] sudo cumin 'C:query_service::crontasks' 'sudo disable-puppet "Stop doing wdqs codfw ~hourly restarts - T290330"'

Mentioned in SAL (#wikimedia-operations) [2021-09-17T01:55:10Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] Testing on arbitrary codfw host: ryankemper@wdqs2001:~$ sudo run-puppet-agent

Mentioned in SAL (#wikimedia-operations) [2021-09-17T02:22:00Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] wdqs2001 and wdqs2004 look fine after running sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer to clean up dangling timer

Mentioned in SAL (#wikimedia-operations) [2021-09-17T02:28:39Z] <ryankemper> T290330 [Remove WDQS codfw ~hourly restarts] Successfully rolled out to rest of fleet sudo cumin 'C:query_service::crontasks' 'sudo run-puppet-agent --force && sudo systemctl reset-failed wdqs-restart-hourly-w-random-delay.timer'

We're no longer doing hourly restarts; we should monitor availability over the weekend to make sure that we're capable of handling current load without the crutch of the hourly restarts. We still have the useragent we banned in the last full outage (dailymotion) banned, so I'd think we'll be okay on load, but I'll be checking Grafana over the next couple days.

We can/should unban dailymotion early next week - perhaps tuesday - if wdqs has been stable over the weekend and during monday's load