Page MenuHomePhabricator

Do not fire/abort AJAX request when it is not necessary
Closed, ResolvedPublic3 Estimated Story Points

Description

The PagePreviews will fire AJAX call every time PagePreviews switch to LINK_DWELL state (more information: https://phabricator.wikimedia.org/T193055#4277206)

Yes, there is always a value in aborting requests:

  • from the user side - fewer data sent to the user
  • from the service side - there is some value, but it will not make any considerable difference. 99% of summary requests are stored in varnish and never reach restbase. But for remaining 1% aborting request will reduce the server load (first restbase checks Cassandra cache, and if the summary is not present it calls MCS).

We should prevent unnecessary AJAX calls to the service, and additionally, if it's possible - code should abort the ongoing requests. When there is a new interaction code should:

  • clear the timed AJAX request
  • if there is ongoing AJAX request it should abort it

This change should include a unit test.

QA

  1. Visit a desktop page in Chrome with many links such as https://readers-web-master.wmflabs.org/wiki/Orange_(colour)
  2. Open Chrome's DevTools and select the network tab
  3. Glide the mouse cursor over links at varying rates

AC

  1. For links that are super quickly passed over, you should observe that no network requests are made to https://en.wikipedia.org/api/rest_v1/page
  2. For links that are passed over quickly, you should observe that the network request is issued but then canceled; these will show in red in DevTools
  3. For all links that are lingered upon, page previews should operate as per usual: a network request is issued and then a preview is shown!

Developer Notes

  1. The current implementation does not store timedAction nor the request object in the state tree, we need to preserve those and on another LINK_DWELL we use these variables to clear the state.
  2. @phuedx: I expect this number to go down and these numbers to all increase.

Event Timeline

Jdlrobson updated the task description. (Show Details)

from the service side - there is some value, but it will not make any considerable difference. 99% of summary requests are stored in varnish and never reach restbase. But for remaining 1% aborting request will reduce the server load (first restbase checks Cassandra cache, and if the summary is not present it calls MCS).

There's nuance here:

The handling of requests that are aborted in-flight will vary between server software. Both Varnish and RESTBase (Node.js) will run the request handler code synchronously. I can't speak to Varnish but RESTBase likely won't stop processing if the request is received but later aborted during that processing – it'll likely not send a response instead.

I do agree with your summary and though I do expect a drop in server load I can only really be sure that the drop in bytes sent from the server will be obvious.

Niedzielski set the point value for this task to 3.

Change 442755 had a related patch set uploaded (by Niedzielski; owner: Stephen Niedzielski):
[mediawiki/extensions/Popups@master] Update: cancel unused HTTP requests in flight

https://gerrit.wikimedia.org/r/442755

Vvjjkkii renamed this task from Do not fire/abort AJAX request when it is not necessary to nnaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii removed pmiazga as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from nnaaaaaaaa to Do not fire/abort AJAX request when it is not necessary.Jul 2 2018, 11:45 AM
CommunityTechBot assigned this task to pmiazga.
CommunityTechBot set the point value for this task to 3.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.

From a data perspective, and as far as I understand the implications here, this change should be fine. The main thing is that we would not want it to impact the EventLogging data about user actions sent to the VirtualPageView schema for the purpose of counting seen previews (and for the time being, Schema:Popups). But I assume that aborting those AJAX requests won't change states or other aspects relevant to the EL data, and of course in case of a seen preview the request will not have been aborted. The Grafana data on the other hand - which is generally geared more towards measuring internal technical processes than user behavior - will of course be impacted, but that's kind of the point of this change.

we would not want it to impact the EventLogging data about user actions sent to the VirtualPageView schema

@Tbayer, that is my understanding as well. I did some testing locally and on enwiki and since virtual pageviews are only logged for "seen" previews, this should have no impact on the number reported.

I wasn't quite clear on your point about Schema:Popups. However, after the proposed changes "dwelledButAbandoned" is still sent for the cases of abandoning before or during / after a network request.

we would not want it to impact the EventLogging data about user actions sent to the VirtualPageView schema

@Tbayer, that is my understanding as well. I did some testing locally and on enwiki and since virtual pageviews are only logged for "seen" previews, this should have no impact on the number reported.

I wasn't quite clear on your point about Schema:Popups. However, after the proposed changes "dwelledButAbandoned" is still sent for the cases of abandoning before or during / after a network request.

Sounds good - just wanted to make sure the change won't affect the data sent by Popups (or its underlying states and state transitions), in case we decide to make use of that instrumentation again.

IMHO we shouldn't count aborted requests as counter.PagePreviewsApiFailure. This will create a noise in our dashboards. This graph supposed to show an API errors (4xx/5xx responses) but now it will be populated with both errors and aborted HTTP calls which makes it worthless
/cc @phuedx

I believe the aborted XHRs have a status of 'abort' so they could be filtered out client side, if wanted.

...

I wasn't quite clear on your point about Schema:Popups. However, after the proposed changes "dwelledButAbandoned" is still sent for the cases of abandoning before or during / after a network request.

Sounds good - just wanted to make sure the change won't affect the data sent by Popups (or its underlying states and state transitions), in case we decide to make use of that instrumentation again.

PS: As Stephen pointed out in standup, that diagram is of course outdated in some details (see also the revision link there), I only brought it up here to remind us that these events also depend on the status changes that precede them.

IMHO we shouldn't count aborted requests as counter.PagePreviewsApiFailure. This will create a noise in our dashboards. This graph supposed to show an API errors (4xx/5xx responses) but now it will be populated with both errors and aborted HTTP calls which makes it worthless

I believe the aborted XHRs have a status of 'abort' so they could be filtered out client side, if wanted.

I actually think it'd be neat to leave the aborted API requests instrumented but count them separately to get a sense of how frequently in-flight requests are aborted.

I think I mentioned this during standup but I'll reiterate here too: You can define a metric in Grafana that corresponds to all subkeys of a key from the data source (here subkeys are delineated by a period). So if we were to construct the counter's key using the type of the failure, e.g. PagePreviewsApiFailure.abort and PagePreviewsApiFailure.generic, we can plot both on the same graph using a single Grafana metric (the metric would be PagePreviewsApiFailure.* in the previous example).

Change 442755 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Update: cancel unused HTTP requests in flight

https://gerrit.wikimedia.org/r/442755

Niedzielski added a subscriber: ABorbaWMF.

@pmiazga done. @ABorbaWMF, over to you! Please see QA notes in the description.

@phuedx: I expect this number to go down and these numbers to all increase.

It looks like this happened..

Per T197700#4421696.

Moreover, per https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?orgId=1&from=1531400400000&to=now&refresh=1m, since the change was deployed:

  • The number of successful API requests has dropped steadily from 400000-600000/min to ~60000/min;
  • Inline with the above, the "API utilization" (the number of successful API requests divided by the number of shown previews) has increased to ~68% from ~23%.

Edit

@phuedx It might be tracked elsewhere already, but it seems "API Request Failures" also increased on that dashboard.

If this is detected server-side then might be an unrelated issue, but if this comes from the client, then it's probably reporting false negative beacons about failures that are presumably not failures but intentional aborts. By default, these trigger the fail/catch handler in jQuery. They can be recognised through textStatus === "abort".

Thanks for the ping, @Krinkle. I didn't address the increase in the API Request Failures in my comment and I should have.

<snip> but if this comes from the client, then it's probably reporting false negative beacons about failures that are presumably not failures but intentional aborts.

You're exactly right and we're aware of this. @pmiazga has submitted follow-on change to fix this as part of T199482: Properly handle the networking errors and aborted requests. When that's deployed, I'll add an annotation to the graph explaining the temporary increase.

  • The number of successful API requests has dropped steadily from 400000-600000/min to ~60000/min;

This is consistent with the data in wmf.webrequest:

[0]
select
  hour,
  count(*) as n
from
  wmf.webrequest
where

  # Thursday, 12th July 2018
  year = 2018 and
  month = 7 and
  day = 12 and

  # Analyse data from 2 hours before the MediaWiki train - European version
  # deployment (11:00 UTC) and everything afterwards.
  hour >= 11 and

  webrequest_source = "text" and
  agent_type = "user" and

  uri_path like "%page/summary%"
group by
  hour
;

+-------+-----------+
| hour  |     n     |
+-------+-----------+
| 11    | 34244249  |
| 12    | 37586379  |
| 13    | 24453987  |
| 14    | 16143586  |
| 15    | 14758791  |
| 16    | 13530555  |
| 17    | 13086279  |
| 18    | 13232265  |
| 19    | 13493663  |
| 20    | 12993924  |
| 21    | 11276235  |
| 22    | 9195177   |
| 23    | 7965494   |
+-------+-----------+