Page MenuHomePhabricator

ORES worker icinga tests complain during testwiki deployments
Closed, ResolvedPublic

Description

Here's an icinga notice from today:

[20:00:16] <icinga-wm> PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 INTERNAL SERVER ERROR - 6520 bytes in 5.053 second response time https://wikitech.wikimedia.org/wiki/ORES

This happens periodically. It looks like it is related to deployments to test.wikipedia.org.

The following error shows up on our web nodes:

Traceback (most recent call last):
  File "./ores/wsgi/routes/v1/scores.py", line 32, in process_score_request
    score_response = scoring_system.score(score_request)
  File "./ores/scoring_systems/scoring_system.py", line 59, in score
    response = self._score(request)
  File "./ores/scoring_systems/celery_queue.py", line 195, in _score
    return super()._score(*args, **kwargs)
  File "./ores/scoring_systems/scoring_system.py", line 104, in _score
    request, missing_model_set_revs)
  File "./ores/scoring_systems/scoring_system.py", line 151, in _extract_root_caches
    model_set, rev_ids, injection_caches=request.injection_caches)
  File "./ores/scoring_context.py", line 173, in extract_root_dependency_caches
    for rev_id, (error, _) in zip(rev_ids, error_root_vals):
  File "/srv/ores/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 126, in _extract_many
    rev_docs = self.get_rev_doc_map(revids_to_lookup, rvprop=rvprop)
  File "/srv/ores/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 233, in get_rev_doc_map
    return {rd['revid']: rd for rd in rev_docs}
  File "/srv/ores/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 233, in <dictcomp>
    return {rd['revid']: rd for rd in rev_docs}
  File "/srv/ores/venv/lib/python3.5/site-packages/revscoring/extractors/api/extractor.py", line 244, in query_revisions_by_revids
    **params)
  File "/srv/ores/venv/lib/python3.5/site-packages/mwapi/session.py", line 309, in get
    continuation=continuation)
  File "/srv/ores/venv/lib/python3.5/site-packages/mwapi/session.py", line 171, in request
    files=files)
  File "/srv/ores/venv/lib/python3.5/site-packages/mwapi/session.py", line 103, in _request
    raise TimeoutError(str(e)) from e
mwapi.errors.TimeoutError: HTTPSConnectionPool(host='test.wikipedia.org', port=443): Read timed out. (read timeout=5.0)

Event Timeline

Halfak created this task.Apr 2 2019, 8:22 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 8:22 PM
Halfak added a comment.Apr 2 2019, 8:24 PM

I pulled up a browser and tried to load https://test.wikipedia.org as soon as I saw this error and my browser also reported a timeout. When I refreshed, the site loaded and icinga began reporting recoveries for ORES.

Harej triaged this task as High priority.Apr 2 2019, 9:09 PM
Harej moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.
Ladsgroup added a subscriber: Ladsgroup.

So the reason for this is because ./modules/nagios_common/files/check_commands/check_ores_workers in puppet checks against testwiki:
$pluginpath/check_http -f follow -H $host -I $host -u "http://${urlhost}/scores/testwiki/reverted/${timestamp}/"

And we changed testwiki to actually point to test.wikipedia.org which doesn't have all the revisions (An example: https://ores.wikimedia.org/scores/testwiki/reverted/4567890987654356789)
The simplest solution for me is to add "fakewiki" and point to that instead in the tests. I do that.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptApr 17 2019, 10:06 PM

Change 504796 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Add fakewiki to wikis for fake models and extractors

https://gerrit.wikimedia.org/r/504796

Works fine in staging: http://ores-staging.wmflabs.org/v3/scores/fakewiki/6789
Will push it to all of labs tomorrow.

Looks good to me.

Change 504796 merged by Halfak:
[mediawiki/services/ores/deploy@master] Add fakewiki to wikis for fake models and extractors

https://gerrit.wikimedia.org/r/504796

Change 506127 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] nagios: Migrate ores checks from testwiki to fakewiki

https://gerrit.wikimedia.org/r/506127

The new wiki is deployed, now we need the puppet change to be merged and deployed.

Change 506127 merged by Alexandros Kosiaris:
[operations/puppet@production] nagios: Migrate ores checks from testwiki to fakewiki

https://gerrit.wikimedia.org/r/506127

Halfak closed this task as Resolved.Jun 18 2019, 1:39 PM