Page MenuHomePhabricator

runUpdate.sh script in wikidata stand-alone has abruptly started incurring numerous 429 errors.
Closed, ResolvedPublic

Description

I've been using the wikidata stand-alone facility (https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Standalone_service) for several months now, relatively without incident.

Starting Tuesday Jul 13 '17 about 11am Pacific time, the update script began issuing a large number of stack traces like the one included below.

Did some configuration parameter change that this script is suddenly generating too many queries?

Thanks,

EXCERPT FROM STACK TRACE:

06:34:18.196 [main] INFO  org.wikidata.query.rdf.tool.Update - Polled up to Tue Jun 13 21:28:07 PDT 2017 (next: 20170614042807|531838996) at (11.2, 3.2, 1.1) updates per second and (3579.0, 1024.6, 364.4) milliseconds per second
06:34:18.196 [main] INFO  o.w.q.r.t.w.WikibaseRepository - useBackoff is false or rccontinue is not empty
06:34:18.196 [main] INFO  o.w.q.r.t.w.WikibaseRepository - starting fetchRecentChanges with rccontinue 20170614042807|531838996
06:34:18.196 [main] INFO  o.w.q.r.t.w.WikibaseRepository - Polling for changes from https://www.wikidata.org/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7Cids%7Ctimestamp&rcnamespace=0%7C120&rclimit=100&continue=&rccontinue=20170614042807%7C531838996&rcstart=20170614042807
06:34:18.464 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 100 changes, from Q21473187@500681175@20170614042807|531838996 to Q21476526@500681275@20170614042834|531839096
06:34:18.465 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - nextContinue creating new Batch:  {"rccontinue":"20170614042834|531839097","continue":"-||"}
06:34:19.114 [update 4] WARN  org.wikidata.query.rdf.tool.Update - Contained error syncing.  Giving up on Q13873716
org.wikidata.query.rdf.tool.exception.ContainedException: Unexpected status code fetching RDF for https://www.wikidata.org/wiki/Special:EntityData/Q17577314.ttl?nocache=1497533660340&flavor=dump:  429
	at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRdfForEntity(WikibaseRepository.java:267) ~[wikidata-query-tools-0.2.3-SNAPSHOT-jar-with-dependencies.jar:na]
	at org.wikidata.query.rdf.tool.Update.handleChange(Update.java:474) ~[wikidata-query-tools-0.2.3-SNAPSHOT-jar-with-dependencies.jar:na]
	at org.wikidata.query.rdf.tool.Update.access$0(Update.java:472) ~[wikidata-query-tools-0.2.3-SNAPSHOT-jar-with-dependencies.jar:na]
	at org.wikidata.query.rdf.tool.Update$1.run(Update.java:370) ~[wikidata-query-tools-0.2.3-SNAPSHOT-jar-with-dependencies.jar:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_131]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could you please collect the log with -v option and link it here?

After the limit was raised as above, I tested the script again on my machine, and it ran for about an hour and 10 minutes, then there was a three-minute burst of the same errors cited above . Sorry, I neglected to invoke the -v option on my most recent run). It ran further for about 1/2 hour without incident.

This is a significant improvement. I wonder if raising the limit say 1/2 -again would solve the problem.

ema triaged this task as Medium priority.Jun 16 2017, 9:14 AM

I took a look at rate limited requests with User-Agent 'Wikidata Query Service Updater'. They come from non-WMF IPs (WMF addresses are currently excluded from rate limiting) and uri_path is /wiki/Special:EntityData/.

For instance, these are the number of WDQS Updater requests per second from a single IP this morning at 8AM:

scala> val df = sqlContext.sql("select * from wmf.webrequest where webrequest_source = 'text' and year = 2017 and month = 6 and day = 16 and hour = 8 and user_agent = 'Wikidata Query Service Updater'")
scala> sqlContext.sql("select minute(ts) as m, second(ts) as s, count(1) as c from t1 where ip = 'XXX.XXX.XXX.XXX' and uri_path like '/wiki/Special:EntityData/%' group by minute(ts), second(ts) order by m, s limit 10").collect().foreach(println)
[0,0,35]
[0,3,40]
[0,4,30]
[0,5,23]
[0,7,20]
[0,8,30]
[0,9,35]
[0,10,10]
[0,12,20]
[0,13,31]

That request rate is above the current rate limiting of 20/s, resulting in some 429 responses.

So, two things:

  1. I'm gonna add /wiki/Special:EntityData/ to the list of endpoints with higher rate limiting: up to 1000 requests per 10 seconds (100/s, with 1000 burst). That should be enough given the stats above.
  1. Wikidata Query Service Updater raises an exception for any non-404 response with status code >= 300. In the specific case of 429 responses, it should definitely not crash but rather wait a bit and retry the request later. 429 "Too Many Requests" responses can contain a Retry-After header, specifying either a number of seconds to wait for (integer) or a HTTP-date after which the request can be retried. See https://tools.ietf.org/html/rfc6585#section-4 and https://tools.ietf.org/html/rfc7231#section-7.1.3. We do currently return Retry-After: 1 in our 429 responses, and that should ideally be honored.

Change 359401 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: apply API rate limits to wikidata too

https://gerrit.wikimedia.org/r/359401

Change 359401 merged by Ema:
[operations/puppet@production] VCL: apply API rate limits to wikidata too

https://gerrit.wikimedia.org/r/359401

In the specific case of 429 responses, it should definitely not crash but rather wait a bit and retry the request later.

That makes a lot of sense, I will implement it.

ema claimed this task.

No Wikidata Query Service Updater requests have been rate-limited over the course of the past 3 days. Closing.

Everything is running smoothly on our end. Thanks @ema and @Smalyshev !