Page MenuHomePhabricator

WikidataSPARQLPageGenerator Swallows Failures
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Swallows syntax exception
> from pywikibot import pagegenerators as pg
> generator = pg.WikidataSPARQLPageGenerator("SELECT INVALID SPARQL")
WARNING: Http response status 400
> count = 0
> for _ in generator:
>   count += 1
> print(f"we fetched {count} rows")
we fetched 0 rows

There is no way to get the fact that that WARNING was issued and printed to stderr. This gets a lot worse when a timeout only sometimes happens.

  • Swallows timeout exception (for some query timeoutQuery that times out
> generator = pg.WikidataSPARQLPageGenerator(timeoutQuery)
WARNING: Http response status 500
> count = 0
> for _ in generator:
>   count += 1
> print(f"we fetched {count} rows")
we fetched 0 rows

What happens?:

The SPARQL query fails but there is no way to detect this fact. Code consuming the returned generator will believe that query returned no results when in reality the query never finished execution.

What should have happened instead?:

Optimally an exception would be thrown and could be caught. This likely would break backwards compatibility and so either a new method or a new flag should be added. I need to be able to retry queries that time out and as currently implemented there's no way to do that.

Software version : 7.7.0-87-gbff4621b4

Other information : python-3.8.10

Event Timeline

The cause of the bug is this line in sparql.py. It's not clear to me how big the blast radius from fixing it there would be. Probably big.

@BrokenSegue: The Timeout is raised after several retries. The maximum number of retries is set in config.max_retries. Either you can change this value within your user-config.py or set if within your bot script like:

from pywikibot import config
config.max_retries = 3

Or you can set this parameter when invoking your script like:

pwb -max_retries:3 <your script> [<script options]

Does this help for the second case of this issue?

I am using requests version "2.28.1".

And unfortunately your solution doesn't work. That retry only happens if the HTTP request times out. But often when using wikidata's SPARQL the server itself times out and returns non-JSON as its response. This causes the offending code I linked to not retry at all no matter what max_retries is set to. Arguably this is another bug (it should detect/retry in this situation).

@BrokenSegue: ah, that means the Timeout isn't raised in such case. Can you give me your timeoutQuery example to find a solution for such 500 status response.

sure. here's a query that reliably times out:

SELECT distinct ?item
WHERE
{
  VALUES ?goodRanks { wikibase:NormalRank wikibase:PreferredRank }
  ?item p:P856 ?url.
  ?url wikibase:rank ?goodRanks.
  # don't look at dead urls
  FILTER NOT EXISTS { ?url pq:P582 ?endTime. }
  FILTER NOT EXISTS { ?url pq:P8554 ?endTime. }
  FILTER NOT EXISTS { ?url pq:P1534 ?endCause. }

  ?url ps:P856 ?urlString.

  FILTER (STRSTARTS(STR(?urlString), "http://"))

  FILTER (!(contains(str(?item), "L" ))).
}

Wasn't able to reproduce the response 500 issue. The Timeout was raised for me (after max_retries tries):

WARNING: Waiting 20 seconds before retrying.
...
Traceback (most recent call last):
  File "D:\pwb\GIT\core\pywikibot\data\sparql.py", line 151, in query
    self.last_response = http.fetch(url, headers=headers)
  File "D:\pwb\GIT\core\pywikibot\comms\http.py", line 393, in fetch
    callback(response)
  File "D:\pwb\GIT\core\pywikibot\comms\http.py", line 283, in error_handling_callback
    raise response from None
  File "D:\pwb\GIT\core\pywikibot\comms\http.py", line 384, in fetch
    response = session.request(method, uri,
  File "C:\Python311\Lib\site-packages\requests\sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python311\Lib\site-packages\requests\sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python311\Lib\site-packages\requests\adapters.py", line 578, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='query.wikidata.org', port=443): Read timed out. (read timeout=45)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    generator = pg.WikidataSPARQLPageGenerator(timeout)
  File "D:\pwb\GIT\core\pywikibot\pagegenerators\_generators.py", line 1068, in WikidataSPARQLPageGenerator
    data = query_object.get_items(query,
  File "D:\pwb\GIT\core\pywikibot\data\sparql.py", line 199, in get_items
    res = self.select(query, full_data=True)
  File "D:\pwb\GIT\core\pywikibot\data\sparql.py", line 115, in select
    data = self.query(query, headers=headers)
  File "D:\pwb\GIT\core\pywikibot\data\sparql.py", line 153, in query
    self.wait()
  File "D:\pwb\GIT\core\pywikibot\data\sparql.py", line 166, in wait
    raise TimeoutError('Maximum retries attempted without success.')
pywikibot.exceptions.TimeoutError: Maximum retries attempted without success.

That's probably because you have your timeout in user-config.py is too low. The default value probably also is too low. If you set

socket_timeout = 240

then you get the behavior I described. I tried commenting that line out in my config and I got what you saw. Basically your client was giving up too early and the server didn't yet throw a timeout.

Change 842363 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Raise a generic ServerError if requests response is a ServerError

https://gerrit.wikimedia.org/r/842363

Xqt triaged this task as High priority.

@BrokenSegue: I made the patch for the ServerError. Can you test it`?

yeah this works better. now it throws an exception on server timeouts. but it isn't throwing exceptions on malformed SPARQL. e.g.

>from pywikibot import pagegenerators as pg
>generator = pg.WikidataSPARQLPageGenerator("select blah")
WARNING: Http response status 400

that said it's preferable to merge the current fix

Change 842363 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] Raise a generic ServerError if requests response is a ServerError

https://gerrit.wikimedia.org/r/842363

@Xqt: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Xqt claimed this task.