ORES in eqiad is unhappy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Jun 13 2017, 5:39 PM

Description

As of 1600 UTC on June 13th, errors have increased. Not overload and not timeout errors. Seems to only have affected EQIAD. Not correlated to deployment.

See https://grafana.wikimedia.org/dashboard/db/ores?panelId=2&fullscreen&orgId=1&from=1497364774089&to=1497375574089

Related Objects

Mentioned In: Blog Post: Status update (July 11th, 2017)
T143129: New service request - PDF Render
Mentioned Here: T167834: Limit resources used by pdfrender service
T167830: Extend icinga check to catch 500 errors like those of the 20170613 incident
T143129: New service request - PDF Render

Event Timeline

Halfak created this task.Jun 13 2017, 5:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2017, 5:39 PM

Halfak triaged this task as Unbreak Now! priority.Jun 13 2017, 5:40 PM

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJun 13 2017, 5:40 PM

I'm seeing lots of these in the logs

Traceback (most recent call last):
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/deployment/ores/venv/lib/python3.4/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./ores/wsgi/preprocessors.py", line 27, in nocache_route
    response = make_response(route(*args, **kwargs))
  File "./ores/wsgi/preprocessors.py", line 14, in minifiable_route
    response = route(*args, **kwargs)
  File "./ores/wsgi/routes/v1/scores.py", line 52, in score_model_revisions
    score_request = build_score_request(scoring_system, request, context)
  File "./ores/wsgi/util.py", line 87, in build_score_request
    rev_ids = parse_rev_ids(request, rev_id)
  File "./ores/wsgi/util.py", line 108, in parse_rev_ids
    return read_bar_split_param(request, "revids", type=int)
  File "./ores/wsgi/util.py", line 57, in read_bar_split_param
    .format(param, str(e)))
ores.wsgi.util.ParamError: Could not interpret revids. invalid literal for int() with base 10: 'last'
Exception on /scores/itwiki/ [GET]
Traceback (most recent call last):
  File "./ores/wsgi/util.py", line 54, in read_bar_split_param
    return [type(value) for value in values.split("|")]
  File "./ores/wsgi/util.py", line 54, in <listcomp>
    return [type(value) for value in values.split("|")]
ValueError: invalid literal for int() with base 10: 'last'

It seems number of requests suddenly got increased in 1600 UTC graph

Looks like it's a ton of requests for itwiki.

This looks like a user-error. For some reason, we're getting requests with "?revids=last". I'm downgrading this to "high" priority.

Halfak lowered the priority of this task from Unbreak Now! to High.Jun 13 2017, 5:51 PM

I can't seem to get a single request to go through on scb1001.

halfak@scb1001:~$ wget 'localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json' -O-
--2017-06-13 18:22:21--  http://localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8081... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8081... connected.
HTTP request sent, awaiting response... 500 INTERNAL SERVER ERROR
2017-06-13 18:22:21 ERROR 500: INTERNAL SERVER ERROR.

halfak@scb1001:~$ wget 'localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json' -O-
--2017-06-13 18:22:23--  http://localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8081... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8081... connected.
HTTP request sent, awaiting response... 200 OK
Length: 436 [application/json]
Saving to: ‘STDOUT’

-                                                  0%[                                                                                                            ]       0  --.-KB/s             {"scores": {"ruwiki": {"damaging": {"scores": {"85958727": {"error": {"message": "Timed out after 15 seconds.", "type": "TimeoutError"}}}, "version": "0.3.0"}, "goodfaith": {"scores": {"85958727": {"error": {"message": "Timed out after 15 seconds.", "type": "TimeoutError"}}}, "version": "0.3.0"}, "reverted": {"scores": {"85958727": {"error": {"message": "Timed out after 15 seconds.", "typ-                                                100%[===========================================================================================================>]     436  --.-KB/s   in 0s     

2017-06-13 18:23:08 (69.0 MB/s) - written to stdout [436/436]

The logs in /srv/log/ores/app.log don't show anything for the 500 response.

The log hasn't been written to in an hour:

~$ ls -alh /srv/log/ores/app.log 
-rw-r--r-- 1 www-data www-data 6.3K Jun 13 17:54 /srv/log/ores/app.log

Halfak raised the priority of this task from High to Unbreak Now!.Jun 13 2017, 6:31 PM

I can:

ladsgroup@scb1001:~$ wget 'localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json' -O-
--2017-06-13 18:31:37--  http://localhost:8081/v2/scores/ruwiki/?models=reverted%7Cdamaging%7Cgoodfaith&revids=85958727&format=json
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:8081... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:8081... connected.
HTTP request sent, awaiting response... 200 OK
Length: 489 [application/json]
Saving to: ‘STDOUT’

-                                           0%[                                                                                      ]       0  --.-KB/s             {"scores": {"ruwiki": {"damaging": {"scores": {"85958727": {"prediction": false, "probability": {"false": 0.6761220469324349, "true": 0.3238779530675651}}}, "version": "0.3.0"}, "goodfaith": {"scores": {"85958727": {"prediction": true, "probability": {"false": 0.2027760571206214, "true": 0.7972239428793786}}}, "version": "0.3.0"}, -                                         100%[=====================================================================================>]     489  --.-KB/s   in 0s     

2017-06-13 18:31:39 (102 MB/s) - written to stdout [489/489]

ladsgroup@scb1001:~$

I genuinely have no idea what's going on here.

Memory of scb1001 is not good:

https://logstash.wikimedia.org/goto/9fff9cf4fc588b4a25e3f2090cbc754e is interesting

Johan added a project: User-notice.Jun 13 2017, 7:16 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-13T19:37:19Z] <Amir1> restarting ores-related services in scb1001 (T167819)

Everything is getting back up as the only node having problems (and putting strain on other nodes) was scb1001 and It's very likely that killing Electron-PDFs on that node which was taking too much CPU caused the node to get to full capacity.
This is number of edits it was able to work on: