Maniphest T190082

5-second latency for certain API calls?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dbrant
	Mar 19 2018, 7:11 PM

Description

I've observed some serious latency with some of our API calls, and several of our users seem to be reporting similar phenomena. We need to track this down and fix it right away.

Here are some observations and theories:

This seems to be somewhat device-specific. It happens on my real device (Nexus 6P, API 26), but doesn't happen on my emulators, or my other real device (Galaxy S7, API 25).
It seems to be specific to certain MediaWiki API calls, but not all of them. For example the API for searching (action=query&generator=prefixsearch) is affected, but others are not.
The latency, when it happens, seems to be nearly precisely 5 seconds (or a bit greater than 5 seconds), which is suspicious. This suggests that there may be something on the client side that is deliberately blocking the request for 5 seconds, or something on the server side that is deliberately waiting 5 seconds before processing the request.

Details

Subject	Repo	Branch	Lines +/-
Avoid unnecessary WaitConditionLoop delays in ChronologyProtector	mediawiki/core	master	+83 -7
rdbms: include client ID hash in ChronologyProtector cookies	mediawiki/core	master	+104 -44
Rework/improve handling and persistence of cookies.	apps/android/wikipedia	master	+178 -203
[WIP] Rework/improve handling and persistence of cookies.	apps/android/wikipedia	master	+112 -147

Customize query in gerrit

Related Objects

Mentioned In: T193417: ReadingLists performance degradation
T187951: Intermittent "Error loading data from server" error using VE on officewiki
T192494: Slow search on apps
Mentioned Here: rMW628a3a9b2676: rdbms: make sure cpPosIndex cookie is applied to LBFactory in time
T193417: ReadingLists performance degradation
T182322: ChronologyProtector breaks if two requests write different sets of databases

Event Timeline

Dbrant created this task.Mar 19 2018, 7:11 PM

Restricted Application edited projects, added Wikipedia-Android-App-Backlog; removed Wikipedia-Android-App-Backlog (Android-app-release-v2.7.24x-H-Hot-pepper). · View Herald TranscriptMar 19 2018, 7:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Dbrant triaged this task as High priority.Mar 19 2018, 7:11 PM

Dbrant mentioned this in T192494: Slow search on apps.May 10 2018, 8:54 PM

I think I might have a meaningful lead here. On my device (the single device on which I can reproduce the latency), the 5-second latency disappears when I disable our CookieJar as part of the OkHttp interceptor chain. It looks like there's something in the Cookie handling logic that is delaying requests by 5 seconds, and I'm now digging further into it.

Krinkle subscribed.May 10 2018, 9:38 PM

Nearly as soon as @Krinkle subscribed to this task, I uncovered this:T182322 which seems to be very promising indeed in explaining the latency.
I also found that if we suppress the cpPosIndex cookie from being passed back to the server, the latency disappears.

@Krinkle Can you suggest whom else to tag here who might be more familiar with how this cookie works and/or how we should use it more properly from apps?

Dbrant updated the task description. (Show Details)May 10 2018, 11:40 PM

I've cc-ed @aaron whom might know if the issue behind (T182322) or something similar may've resurfaced. It's also possible that it is one of the issues recently fixed in master. If so, then @aaron might know which one to backport for this particular task.

@Dbrant Could you meanwhile document an example of steps to reproduce this when starting with a fresh cookie jar? E.g. which API request(s) to make, and which cpPosIndex cookie was set by each request. That might help narrow down the issue.

Unfortunately at the moment I can't seem to reproduce a case where cpPosIndex gets set at all (starting from an empty cookie jar), and in general this condition has been very sporadic, occurring on only one of my devices, and only a few times. However, when I was observing the latency earlier, I was making the following request:

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&redirects=&converttitles=&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pilicense=any&generator=prefixsearch&gpsnamespace=0&list=search&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&gpslimit=20&pithumbsize=320&gpssearch=a&srsearch=a

with the following cookies:

forceHTTPS=true
cpPosIndex=1
UseDC=master
WMF-Last-Access=10-May-2018
UseCDNCache=false
CP=H2
WMF-Last-Access-Global=10-May-2018

...but I didn't have a record of which request had initially set the cpPosIndex cookie.

Change 432541 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Give ChronologyProtector cookies a lower TTL

https://gerrit.wikimedia.org/r/432541

gerritbot added a project: Patch-For-Review.May 11 2018, 11:32 AM

LGoto subscribed.May 11 2018, 3:39 PM

@Dbrant
Do you think this is likely Android specific or are there changes we should consider on iOS?

@Mhurd Although I'm still not fully clear about the cookie's function, it seems to be independent of the requesting user-agent, so I imagine it's possible that iOS would be affected by it, too. Have any of you seen a sudden latency of 5+ seconds? Based on our eventlogging, our search latency shot up around April 27 (if that helps the Performance team, cc @aaron):

In any case, it sounds like this is being addressed on the server side, so we'll see if it resolves itself once the change is deployed.

• Mholloway subscribed.May 14 2018, 5:13 PM

• Imarlier moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.May 14 2018, 7:58 PM

Change 433080 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] [WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433080

Change 433080 abandoned by Dbrant:
[WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433080

Change 433081 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] [WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433081

T193417: ReadingLists performance degradation is probably the same issue.

Per above patch, I've also uncovered a condition in our app where it can potentially send cookies that are expired. Since the lifetime of the cpPosIndex cookie seems to be quite short, this issue is especially applicable to this cookie, and may in fact be the root cause here...

In T190082#4206075, @Dbrant wrote:

Per above patch, I've also uncovered a condition in our app where it can potentially send cookies that are expired. Since the lifetime of the cpPosIndex cookie seems to be quite short, this issue is especially applicable to this cookie, and may in fact be the root cause here...

Sending the cookie when it should have expired is very bad in this case, as it will cause blocking delays of POS_STORE_WAIT_TIMEOUT, which is 5 seconds.

Well, that settles it, then! Thanks @aaron for confirming, and sorry for the confusion.
It looks like this faulty logic has been part of the app since the very beginning, but only now is it being brought to light. :(

In T190082#4206338, @Dbrant wrote:

It looks like this faulty logic has been part of the app since the very beginning

That might be so, but the performance issues have only started two weeks ago:

so it is almost certainly triggered by rMW628a3a9b2676: rdbms: make sure cpPosIndex cookie is applied to LBFactory in time somehow. Given that many bots mishandle cookies, if it's not too hard to undo whatever that patch did to affect cookie handling, it would be worthwhile IMO.

Note also that while 90% of the slow requests (for ReadingLists, since that's what's easy to measure) come from Android, 10% come from iOS so something weird might be going on there as well:

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select user_agent_map['wmf_app_version'] app_version, count(*) cnt from webrequest where year = 2018 and month = 5 and day = 10 and uri_host = 'en.wikipedia.org' and uri_path like '/api/rest_v1/data/lists%' and time_firstbyte > 5 group by user_agent_map['wmf_app_version']; 

app_version     cnt
-       4
5.8.0.1393      627
2.7.228-beta-2018-03-26 8
2.7.232-alpha-2018-05-10        1
2.7.227-beta-2018-03-23 3
2.7.232-dev-2018-05-09  1
2.7.232-dev-2018-05-10  1
2.7.232-beta-2018-04-17 481
2.7.232-r-2018-04-17    6671
2.7.230-beta-2018-03-29 9
2.7.231-beta-2018-04-05 14
5.8.1.1403      11
2.7.231-r-2018-04-09    37
r/2.7.232-r-2018-04-17  9
2.7.226-beta-2018-03-06 4
r/2.7.231-r-2018-04-09  1
2.7.232-alpha-2018-04-24        1
17 rows selected (205.03 seconds)

Agreed that if it's possible to mitigate this on the server side, that would be a big help. But also note that I myself started experiencing the latency symptoms as early as March (when this task was created), and we've had user complaints about latency earlier than that. iOS should probably double-check their cookie handling code as well, and check whether they're passing cookies that might be expired.

Sharvaniharan moved this task from Needs Triage to Android-app-release-v2.7.24x-I-Ice-lolly on the Wikipedia-Android-App-Backlog board.May 15 2018, 4:05 PM

Sharvaniharan edited projects, added Wikipedia-Android-App-Backlog (Android-app-release-v2.7.24x-I-Ice-lolly); removed Wikipedia-Android-App-Backlog.

Sharvaniharan moved this task from Backlog to Code Review on the Wikipedia-Android-App-Backlog (Android-app-release-v2.7.24x-I-Ice-lolly) board.

Change 433081 merged by jenkins-bot:
[apps/android/wikipedia@master] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433081

Change 432541 merged by jenkins-bot:
[mediawiki/core@master] Avoid unnecessary WaitConditionLoop delays in ChronologyProtector

https://gerrit.wikimedia.org/r/432541

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)).May 19 2018, 11:01 AM

Sharvaniharan moved this task from Code Review to QA signoff on the Wikipedia-Android-App-Backlog (Android-app-release-v2.7.24x-I-Ice-lolly) board.May 21 2018, 4:11 PM

Change 435122 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

Krinkle mentioned this in T187951: Intermittent "Error loading data from server" error using VE on officewiki.May 29 2018, 2:47 PM

Change 435122 merged by jenkins-bot:
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-06-05 (1.32.0-wmf.7)); removed MW-1.32-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)).Jun 4 2018, 11:00 PM

Dbrant moved this task from QA signoff to Done on the Wikipedia-Android-App-Backlog (Android-app-release-v2.7.24x-I-Ice-lolly) board.Jun 8 2018, 7:16 PM

reading-list-service.png (500×1 px, 94 KB)

Seems to gotten better after https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/440031/ was deployed and likewise for logstash (+"ChronologyProtector::initPositions").

Tgr mentioned this in T193417: ReadingLists performance degradation.Jun 13 2018, 6:41 AM

I suggest that cookie set/receive round-tripping should be tested for encoding/truncation issues with @ or # for these apps, as well as letter case changes or such. The above patch simply discards cookie headers for cpPosIndex that are botched.

Looks like something to do with this ticket might be reproducable in my mediawiki docker dev setup https://github.com/addshore/mediawiki-docker-dev/issues/57#issuecomment-396919300

• Mholloway unsubscribed.Jun 13 2018, 3:07 PM

• Imarlier closed this task as Resolved.Jun 21 2018, 11:15 AM

	F17490777: restbase-rawquery.png
	May 15 2018, 10:11 AM

	F18254585: Untitled5.png
	May 14 2018, 4:05 PM

5-second latency for certain API calls?Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

5-second latency for certain API calls?
Closed, ResolvedPublic
Actions