Maniphest T193417

ReadingLists performance degradation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tgr
	Apr 30 2018, 5:30 PM

Description

ReadingLists API response times have skyrocketed last week:

median	p99

https://grafana.wikimedia.org/dashboard/db/reading-list-service

Average response time is up from 60ms to 400ms. p75 is up from 80ms to 1s. p95 is up from 250ms to 3.5s. p99 is up from 0.5s to 5s. Median is unchanged (around 50ms) so this only affects 30-40% of the requests.

The jump coincides with the MW train deploy:

There was traffic growth during the period but not nearly so abrupt as the response time change (and the traffic comes from mobile apps so traffic patterns are not affected by the train).

Action API logs claim almost all request finish well under 50ms, but might be inaccurate due to T193484: API request time logs should happen after DB commit.

The time increase seems to be happening in increments of 500ms, with a ceiling of 5s:

GET request times, p75	GET request times, p95

That suggest some operation times out and is retried 10 times. There are no timeout (or other) errors though and none of the requests seem to be failing.

Related Objects

Mentioned In: T187951: Intermittent "Error loading data from server" error using VE on officewiki
T190082: 5-second latency for certain API calls?
T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending
T193484: API request time logs should happen after DB commit
Mentioned Here: T190082: 5-second latency for certain API calls?
T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending
T193484: API request time logs should happen after DB commit
T189360: Consider logging 4xx action API errors in RESTBase
T192896: Frequent client-side errors for reading list service

Event Timeline

Tgr created this task.Apr 30 2018, 5:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2018, 5:30 PM

Tgr moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Apr 30 2018, 5:42 PM

Tgr claimed this task.Apr 30 2018, 5:44 PM

Tgr added subscribers: • Fjalapeno, LGoto.

• Imarlier moved this task from Inbox, needs triage to Radar on the Performance-Team board.Apr 30 2018, 7:48 PM

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Most of the change is from the sync endpoint producing lotsof 5s responses, but the others also show slower responses, event though their request count didn't really change.

`GET /changes/since/...`
number of calls	median response time	p99 response time

`POST /lists`

`POST /lists/{id}/entries/`

@Fjalapeno have not had the time to do much useful about this, will continue tomorrow. At a glance though, times on the MediaWiki side (as reported by ApiAction) don't seem long.

Queried MW backend times from ApiAction with

select day, params['action'], params['command'], !params['list'], !params['meta'], avg(timeSpentBackend) from ApiAction where year = 2018 and month = 4 and day >= 20 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries') group by day, params['action'], params['command'], !params['list'], !params['meta'];

(results and some stats here); those are suspiciously fast (20-40ms), not sure how much that can be trusted.

@mobrovac @Pchelolo does RESTBase log anything that I could use to cross-correlate MW API request durations?

@Dbrant @JoeWalsh did you notice anything strange or any error reports from users? Most sync requests result in a 4xx error, but with T192896: Frequent client-side errors for reading list service and lack of more detailed logging of those kinds of errors ATM (T189360) it's hard to tell if that's normal or not.

Another ApiAction query with p99 times:

select day, case when params['list'] is not null then 'readinglists' when params['meta'] is not null then 'readinglistentries' else params['command'] end action, hadError, avg(timeSpentBackend) avg_time, percentile_approx(timeSpentBackend, 0.99) p99_time from ApiAction where year = 2018 and month = 4 and day >= 20 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries') group by day, case when params['list'] is not null then 'readinglists' when params['meta'] is not null then 'readinglistentries' else params['command'] end, hadError

(added to the same sheet)

The processing times are tiny and don't change over time. Either the ApiAction numbers are completely wrong or the delays are after the API has generated the response. The API logs these times before committing the open DB transaction, so maybe that's where the delays happen.

Tgr mentioned this in T193484: API request time logs should happen after DB commit.May 1 2018, 11:31 AM

In T193417#4170727, @Tgr wrote:

with T192896: Frequent client-side errors for reading list service and lack of more detailed logging of those kinds of errors ATM (T189360) it's hard to tell if that's normal or not.

Breakdown is in T192896#4170798, seems all normal (unless ApiAction does not log certain errors at all... RESTBase statsd logs say approx. 200K successful requests and 165K 4xx errors since the 27th, ApiAction logs say 700K successful requests and 560K 4xx errors, which both give a ~40% error rate, so while statsd sampling probably changes the request counts somehow, it doesn't seem like a large class of failed requests are missing from one or the other log.)

(Also, looks like T189360 was fixed already, cool!)

Tgr updated the task description. (Show Details)May 1 2018, 11:59 AM

There are a few 504 errors from MediaWiki but not too many (260 for the last few days), and they seem to correlate with the sync request count but not with the sudden jump in response times: https://logstash.wikimedia.org/goto/8009262e31d5a1a6eb6a520125a8c292

Sync (/data/lists/changed/since/...) stats for the last 7 days (choosing sync as it's about 80% of total ReadingLists traffic):

sync request count, 2xx	sync request count, 4xx	sync response time, median	sync response time, p99	504 response count

@mobrovac @Pchelolo does RESTBase log anything that I could use to cross-correlate MW API request durations?

We do log slow requests with a 1% probability, and it's not per-request type, but across the board. There is a couple of log entries for reading lists here: https://logstash.wikimedia.org/goto/6a66a622663f986a90221b0f8d35fadf

• Pchelolo added a project: Services (watching).May 1 2018, 1:21 PM

Also, playing with latencies dashboard on RESTBase level, we have the ability to separate latencies by response code and I can see the same degradation for 2xx as for 4xx:

2XX Responses:

Screen Shot 2018-05-01 at 10.26.23 AM.png (714×1 px, 286 KB)

4xx Responses:

Screen Shot 2018-05-01 at 10.26.51 AM.png (714×1 px, 455 KB)

3xx and 5xx are negligible in rate, so no way to properly say whether their latencies changed or not.

Thanks @Pchelolo, that doesn't really tell me anything about where the time is spent though. What I would like to confirm is that only a trivial amount of time is spent on the RESTBase side and it's the action API subrequest that's slow.

@Tgr we don't have stats JUST for the Action API request itself, but we do have stats for calls to the action.js module. Since it's such a thin wrapper over a pure request to the Action API, an since for reading lists we only use rawquery that is even thinner, I guess the stats for restbase.external.sys_action_rawquery.ALL.ALL.p95 woul be a decent representation of the actual latencies we see in requests from RESTBase to Action API:

Screen Shot 2018-05-01 at 10.31.42 AM.png (762×1 px, 424 KB)

Thanks, that make it pretty clear that the issue is on the MediaWiki side.

T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending is one possible culprit.

One thing that jumps out is that the long requests typically take some multiple of 500ms to finish:

GET request times, p75	GET request times, p95

That suggests something times out and gets retried but I have no idea what that something might be.

That suggests something times out and gets retried but I have no idea what that something might be.

The RESTBase -> Action API the connect timeout is 5s and request timeout is 2 minutes, so it's not initiated by RESTBase.

GET and POST queries show the same pattern of behavior, so it probably is not related to the bug above or to transaction commit times:

Tgr updated the task description. (Show Details)May 1 2018, 3:16 PM

Tgr updated the task description. (Show Details)

Tgr updated the task description. (Show Details)May 1 2018, 6:41 PM

FWIW I don't believe we've seen any direct user complaints about slowdowns or breakage of syncing...

https://gerrit.wikimedia.org/r/c/430116/ might fix this.

In T193417#4175536, @Tgr wrote:

https://gerrit.wikimedia.org/r/c/430116/ might fix this.

Unfortunately it didn't.

Tgr mentioned this in T190082: 5-second latency for certain API calls?.May 14 2018, 10:15 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.May 21 2018, 8:34 PM

ssastry mentioned this in T187951: Intermittent "Error loading data from server" error using VE on officewiki.May 29 2018, 3:41 PM

Response times have been slowly improving since May 25 or so, but still way worse than in the past:

median	p99

Possibly coordinated with getting rid of the invalid domain names, might be coincidental though:

median response times

grafana.wikimedia.org_dashboard_db_reading-list-service_orgId=1&from=1527015202000&to=1527674002620 (1).png (218×469 px, 156 KB)

non-batch list entry creation count

grafana.wikimedia.org_dashboard_db_reading-list-service_orgId=1&from=1527015202000&to=1527674002620.png (362×528 px, 90 KB)

After T190082#4277676, load times now seem to be back to their original levels. Thanks @aaron!

• Vvjjkkii renamed this task from ReadingLists performance degradation to mydaaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed aaron as the assignee of this task.

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

• bearND renamed this task from mydaaaaaaa to ReadingLists performance degradation.Jul 1 2018, 4:34 AM

• bearND closed this task as Resolved.

• bearND assigned this task to aaron.

• bearND raised the priority of this task from High to Needs Triage.

• bearND updated the task description. (Show Details)

• bearND removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• bearND added a subscriber: Aklapper.

	F18733669: grafana.wikimedia.org_dashboard_db_reading-list-service_refresh=5m&orgId=1&from=now-45d&to=now.png
	Jun 4 2018, 10:00 AM

	F18734463: grafana.wikimedia.org_dashboard_db_reading-list-service_orgId=1&from=1527015202000&to=1527674002620 (1).png
	Jun 4 2018, 10:00 AM

	F18734470: grafana.wikimedia.org_dashboard_db_reading-list-service_orgId=1&from=1527015202000&to=1527674002620.png
	Jun 4 2018, 10:00 AM

	F18733758: grafana.wikimedia.org_dashboard_db_reading-list-service_refresh=5m&orgId=1&from=now-45d&to=now (1).png
	Jun 4 2018, 10:00 AM

	F17489197: readinglists-dashboard-p75.png
	May 1 2018, 2:00 PM

	F17489200: readinglists-dashboard-p95.png
	May 1 2018, 2:00 PM

	F17489010: Screen Shot 2018-05-01 at 10.31.42 AM.png
	May 1 2018, 1:34 PM

ReadingLists performance degradationClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

ReadingLists performance degradation
Closed, ResolvedPublic
Actions