Page MenuHomePhabricator

ORES service erroring, in a way that throws exceptions in Extension:ORES
Closed, ResolvedPublic

Description

https://ores.wikimedia.org/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=584

{
  "error": {
      "code": "internal server error",
          "message": "Traceback (most recent call last):
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 378, in _make_request
    httplib_response = conn.getresponse()
  File \"/usr/lib/python3.4/http/client.py\", line 1172, in getresponse
    response.begin()
  File \"/usr/lib/python3.4/http/client.py\", line 351, in begin
    version, status, reason = self._read_status()
  File \"/usr/lib/python3.4/http/client.py\", line 313, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), \"iso-8859-1\")
  File \"/usr/lib/python3.4/socket.py\", line 371, in readinto
    return self._sock.recv_into(b)
  File \"/usr/lib/python3.4/ssl.py\", line 745, in recv_into
    return self.read(nbytes, buffer)
  File \"/usr/lib/python3.4/ssl.py\", line 617, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/adapters.py\", line 376, in send
    timeout=timeout
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 609, in urlopen
    _stacktrace=sys.exc_info()[2])
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py\", line 247, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/packages/six.py\", line 310, in reraise
    raise value
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 559, in urlopen
    body=body, headers=headers)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 380, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py\", line 308, in _raise_timeout
    raise ReadTimeoutError(self, url, \"Read timed out. (read timeout=%s)\" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read timeout=5.0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/mwapi/session.py\", line 101, in _request
    auth=auth)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/sessions.py\", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/sessions.py\", line 576, in send
    r = adapter.send(request, **kwargs)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/requests/adapters.py\", line 449, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read timeout=5.0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"./ores/wsgi/routes/v3/scores.py\", line 24, in process_score_request
    score_response = scoring_system.score(score_request)
  File \"./ores/scoring_systems/scoring_system.py\", line 42, in score
    response = self._score(request)
  File \"./ores/scoring_systems/celery_queue.py\", line 192, in _score
    return super()._score(*args, **kwargs)
  File \"./ores/scoring_systems/scoring_system.py\", line 84, in _score
    request, missing_model_set_revs)
  File \"./ores/scoring_systems/scoring_system.py\", line 131, in _extract_root_caches
    model_set, rev_ids, injection_caches=request.injection_caches)
  File \"./ores/scoring_context.py\", line 175, in extract_root_dependency_caches
    for rev_id, (error, _) in zip(rev_ids, error_root_vals):
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api/extractor.py\", line 123, in _extract_many
    rev_docs = self.get_rev_doc_map(revids_to_lookup, rvprop=rvprop)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api/extractor.py\", line 230, in get_rev_doc_map
    return {rd['revid']: rd for rd in rev_docs}
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api/extractor.py\", line 230, in <dictcomp>
    return {rd['revid']: rd for rd in rev_docs}
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/revscoring/extractors/api/extractor.py\", line 240, in query_revisions_by_revids
    revids=batch_ids, **params)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/mwapi/session.py\", line 309, in get
    continuation=continuation)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/mwapi/session.py\", line 171, in request
    files=files)
  File \"/srv/deployment/ores/venv/lib/python3.4/site-packages/mwapi/session.py\", line 103, in _request
    raise TimeoutError(str(e)) from e
mwapi.errors.TimeoutError: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read timeout=5.0)
"
            }
            }

Event Timeline

awight triaged this task as Unbreak Now! priority.Oct 26 2017, 7:01 PM

Seems that the root cause is:

requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='wikidata.org', port=443): Read timed out. (read timeout=5.0)

Ideally we would be more robust to this.

Change 386688 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Blindly choose a timeout of 15 seconds

https://gerrit.wikimedia.org/r/386688

Change 386688 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Blindly choose a timeout of 15 seconds

https://gerrit.wikimedia.org/r/386688

Change 386691 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@STABLE_REVSCORING_1] Blindly choose a timeout of 15 seconds

https://gerrit.wikimedia.org/r/386691

Change 386691 merged by Awight:
[mediawiki/services/ores/deploy@STABLE_REVSCORING_1] Blindly choose a timeout of 15 seconds

https://gerrit.wikimedia.org/r/386691

Mentioned in SAL (#wikimedia-operations) [2017-10-26T19:33:09Z] <awight@tin> Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107

Mentioned in SAL (#wikimedia-operations) [2017-10-26T19:33:20Z] <awight@tin> Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2017-10-26T19:34:00Z] <awight@tin> Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107

Mentioned in SAL (#wikimedia-operations) [2017-10-26T19:41:25Z] <awight@tin> Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 07m 25s)

Change 386698 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] UBN! disbale ores for wikidata

https://gerrit.wikimedia.org/r/386698

Change 386698 merged by jenkins-bot:
[operations/mediawiki-config@master] UBN! disbale ores for wikidata

https://gerrit.wikimedia.org/r/386698

Mentioned in SAL (#wikimedia-operations) [2017-10-26T20:00:54Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s)

After a quick discussion on IRC, the consensus is that Extension:ORES should continue to fatal when it receives fatals from the ORES service. We've disabled Extension:ORES for wikidata until the slow API responses can be diagnosed.

Mentioned in SAL (#wikimedia-operations) [2017-10-26T20:02:56Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s)

One thing we can do in the future is to throw a more specific exception, which passes through the ORES service error. The service error needs to be improved as well, the root cause is buried in some misleading stuff about a "buffering" argument.

I hear that Wikidata API performance has recovered, yet I can still get a timeout some of time, by using the uncached URL:
https://ores.wikimedia.org/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=584&features=

The number of timeouts varies, so I'm not entirely comfortable with the theory that the API has recovered.

Nonetheless, the working theory is that we'll reenable Extension:ORES for wikidata on Monday, before wmf.5 is redeployed to that wiki, so that we have a realistic system state. Please give @BBlack and IRC#wikimedia-operations some warning before pushing the config change.

Mentioned in SAL (#wikimedia-operations) [2017-10-30T14:07:43Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: Revert "UBN! disbale ores for wikidata" (T179107) (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2017-10-30T14:09:14Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: Revert "UBN! disbale ores for wikidata" (T179107) (duration: 00m 50s)

status?

Status? What is next here? A week without updates doesn't suggest UBN! status...

Thanks for the ping--I think the ORES piece of this has calmed down, we were exacerbating T179156 and now that the cause has been alleviated, there's nothing for us to do. Closing.