Page MenuHomePhabricator

5-second latency for certain API calls?
Closed, ResolvedPublic

Description

I've observed some serious latency with some of our API calls, and several of our users seem to be reporting similar phenomena. We need to track this down and fix it right away.

Here are some observations and theories:

  • This seems to be somewhat device-specific. It happens on my real device (Nexus 6P, API 26), but doesn't happen on my emulators, or my other real device (Galaxy S7, API 25).
  • It seems to be specific to certain MediaWiki API calls, but not all of them. For example the API for searching (action=query&generator=prefixsearch) is affected, but others are not.
  • The latency, when it happens, seems to be nearly precisely 5 seconds (or a bit greater than 5 seconds), which is suspicious. This suggests that there may be something on the client side that is deliberately blocking the request for 5 seconds, or something on the server side that is deliberately waiting 5 seconds before processing the request.

Event Timeline

Dbrant triaged this task as High priority.Mar 19 2018, 7:11 PM

I think I might have a meaningful lead here. On my device (the single device on which I can reproduce the latency), the 5-second latency disappears when I disable our CookieJar as part of the OkHttp interceptor chain. It looks like there's something in the Cookie handling logic that is delaying requests by 5 seconds, and I'm now digging further into it.

Nearly as soon as @Krinkle subscribed to this task, I uncovered this:T182322 which seems to be very promising indeed in explaining the latency.
I also found that if we suppress the cpPosIndex cookie from being passed back to the server, the latency disappears.

@Krinkle Can you suggest whom else to tag here who might be more familiar with how this cookie works and/or how we should use it more properly from apps?

Krinkle added a subscriber: aaron.

I've cc-ed @aaron whom might know if the issue behind (T182322) or something similar may've resurfaced. It's also possible that it is one of the issues recently fixed in master. If so, then @aaron might know which one to backport for this particular task.

@Dbrant Could you meanwhile document an example of steps to reproduce this when starting with a fresh cookie jar? E.g. which API request(s) to make, and which cpPosIndex cookie was set by each request. That might help narrow down the issue.

Unfortunately at the moment I can't seem to reproduce a case where cpPosIndex gets set at all (starting from an empty cookie jar), and in general this condition has been very sporadic, occurring on only one of my devices, and only a few times. However, when I was observing the latency earlier, I was making the following request:

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&redirects=&converttitles=&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pilicense=any&generator=prefixsearch&gpsnamespace=0&list=search&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&gpslimit=20&pithumbsize=320&gpssearch=a&srsearch=a

with the following cookies:

forceHTTPS=true
cpPosIndex=1
UseDC=master
WMF-Last-Access=10-May-2018
UseCDNCache=false
CP=H2
WMF-Last-Access-Global=10-May-2018

...but I didn't have a record of which request had initially set the cpPosIndex cookie.

Change 432541 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Give ChronologyProtector cookies a lower TTL

https://gerrit.wikimedia.org/r/432541

@Dbrant
Do you think this is likely Android specific or are there changes we should consider on iOS?

@Mhurd Although I'm still not fully clear about the cookie's function, it seems to be independent of the requesting user-agent, so I imagine it's possible that iOS would be affected by it, too. Have any of you seen a sudden latency of 5+ seconds? Based on our eventlogging, our search latency shot up around April 27 (if that helps the Performance team, cc @aaron):

Untitled5.png (374×651 px, 32 KB)

In any case, it sounds like this is being addressed on the server side, so we'll see if it resolves itself once the change is deployed.

Change 433080 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] [WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433080

Change 433080 abandoned by Dbrant:
[WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433080

Change 433081 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] [WIP] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433081

Per above patch, I've also uncovered a condition in our app where it can potentially send cookies that are expired. Since the lifetime of the cpPosIndex cookie seems to be quite short, this issue is especially applicable to this cookie, and may in fact be the root cause here...

Per above patch, I've also uncovered a condition in our app where it can potentially send cookies that are expired. Since the lifetime of the cpPosIndex cookie seems to be quite short, this issue is especially applicable to this cookie, and may in fact be the root cause here...

Sending the cookie when it should have expired is very bad in this case, as it will cause blocking delays of POS_STORE_WAIT_TIMEOUT, which is 5 seconds.

Well, that settles it, then! Thanks @aaron for confirming, and sorry for the confusion.
It looks like this faulty logic has been part of the app since the very beginning, but only now is it being brought to light. :(

It looks like this faulty logic has been part of the app since the very beginning

That might be so, but the performance issues have only started two weeks ago:

restbase-rawquery.png (407×1 px, 193 KB)

so it is almost certainly triggered by rMW628a3a9b2676: rdbms: make sure cpPosIndex cookie is applied to LBFactory in time somehow. Given that many bots mishandle cookies, if it's not too hard to undo whatever that patch did to affect cookie handling, it would be worthwhile IMO.

Note also that while 90% of the slow requests (for ReadingLists, since that's what's easy to measure) come from Android, 10% come from iOS so something weird might be going on there as well:

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select user_agent_map['wmf_app_version'] app_version, count(*) cnt from webrequest where year = 2018 and month = 5 and day = 10 and uri_host = 'en.wikipedia.org' and uri_path like '/api/rest_v1/data/lists%' and time_firstbyte > 5 group by user_agent_map['wmf_app_version']; 

app_version     cnt
-       4
5.8.0.1393      627
2.7.228-beta-2018-03-26 8
2.7.232-alpha-2018-05-10        1
2.7.227-beta-2018-03-23 3
2.7.232-dev-2018-05-09  1
2.7.232-dev-2018-05-10  1
2.7.232-beta-2018-04-17 481
2.7.232-r-2018-04-17    6671
2.7.230-beta-2018-03-29 9
2.7.231-beta-2018-04-05 14
5.8.1.1403      11
2.7.231-r-2018-04-09    37
r/2.7.232-r-2018-04-17  9
2.7.226-beta-2018-03-06 4
r/2.7.231-r-2018-04-09  1
2.7.232-alpha-2018-04-24        1
17 rows selected (205.03 seconds)

Agreed that if it's possible to mitigate this on the server side, that would be a big help. But also note that I myself started experiencing the latency symptoms as early as March (when this task was created), and we've had user complaints about latency earlier than that. iOS should probably double-check their cookie handling code as well, and check whether they're passing cookies that might be expired.

Change 433081 merged by jenkins-bot:
[apps/android/wikipedia@master] Rework/improve handling and persistence of cookies.

https://gerrit.wikimedia.org/r/433081

Change 432541 merged by jenkins-bot:
[mediawiki/core@master] Avoid unnecessary WaitConditionLoop delays in ChronologyProtector

https://gerrit.wikimedia.org/r/432541

Change 435122 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

Change 435122 merged by jenkins-bot:
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

reading-list-service.png (500×1 px, 94 KB)

Seems to gotten better after https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/440031/ was deployed and likewise for logstash (+"ChronologyProtector::initPositions").

I suggest that cookie set/receive round-tripping should be tested for encoding/truncation issues with @ or # for these apps, as well as letter case changes or such. The above patch simply discards cookie headers for cpPosIndex that are botched.

Looks like something to do with this ticket might be reproducable in my mediawiki docker dev setup https://github.com/addshore/mediawiki-docker-dev/issues/57#issuecomment-396919300