Page MenuHomePhabricator

Do not log errors when they are actually client-initiated XHR cancellations
Closed, ResolvedPublic2 Story Points

Description

During data analysis following the deployment in T134778: Run Hovercards A/B test 1, I noticed that brief dwells, generally below 250 ms, are causing action = error events to be sent to the server.

select event_totalInteractionTime from Popups_15597282 where event_action = 'error' and wiki = 'huwiki' and timestamp > '2016060302' order by event_totalInteractionTime;

event_totalInteractionTime
53
54
55
56
<plenty more rows>
248
251
251
356

As a data analyst, I want these events to not be logged, so that I see less noise and so that I am confident that when error events are observed they are indeed error events.

Acceptance criteria

  • Client induced XHR cancellations from Hovercards ON users do not result in action = error events.

Event Timeline

dr0ptp4kt created this task.Jun 5 2016, 1:24 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 5 2016, 1:24 PM
dr0ptp4kt updated the task description. (Show Details)Jun 5 2016, 1:32 PM

For future reference, queries similar to the following could be useful for trying to identify legitimate server side errors.

Looking at same hour over multiple days for status code trends

select count(1), day, http_status from webrequest
where year = 2016
and (month = 6 or (month = 5 and day > 20))
and hour = 1
and normalized_host.project_class = 'wikipedia'
and access_method = 'desktop'
and uri_path rlike 'api.php$'
and uri_query rlike 'action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions&formatversion=2&redirects=true&exintro=true&exchars=525&explaintext=true&piprop=thumbnail&pithumbsize=300&rvprop=timestamp&titles='
and uri_query rlike '&smaxage=300&maxage=300&uselang=content$'
group by day, http_status;

Looking for empty responses indicative of errors

select http_status, min(response_size), max(response_size), avg(response_size),
percentile(cast(response_size as BIGINT), 0.05),
percentile(cast(response_size as BIGINT), 0.5),
percentile(cast(response_size as BIGINT), 0.95)
 from webrequest
where year = 2016 and month = 6 and day = 4 and hour = 1
and normalized_host.project_class = 'wikipedia'
and access_method = 'desktop'
and uri_path rlike 'api.php$'
and uri_query rlike 'action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions&formatversion=2&redirects=true&exintro=true&exchars=525&explaintext=true&piprop=thumbnail&pithumbsize=300&rvprop=timestamp&titles='
and uri_query rlike '&smaxage=300&maxage=300&uselang=content$'
group by http_status;
dr0ptp4kt renamed this task from Do not log errors when they are client XHR cancellations to Do not log errors when they are actually just client XHR cancellations.Jun 5 2016, 6:39 PM
dr0ptp4kt added a project: Page-Previews.
dr0ptp4kt triaged this task as High priority.Jun 6 2016, 4:45 PM
dr0ptp4kt set the point value for this task to 2.
dr0ptp4kt renamed this task from Do not log errors when they are actually just client XHR cancellations to Do not log errors when they are actually client-initiated XHR cancellations.Jun 21 2016, 5:36 PM
bmansurov moved this task from To Do to Doing on the Reading-Web-Sprint-76-Retrosprekfast board.

Change 297790 had a related patch set uploaded (by Bmansurov):
Do not log XHR cancellations

https://gerrit.wikimedia.org/r/297790

Change 297790 merged by jenkins-bot:
Do not log XHR cancellations

https://gerrit.wikimedia.org/r/297790

bmansurov removed bmansurov as the assignee of this task.Jul 8 2016, 5:06 AM
bmansurov added a subscriber: bmansurov.
phuedx closed this task as Resolved.Jul 8 2016, 9:34 AM
phuedx claimed this task.
phuedx added a subscriber: phuedx.

I've tested this on our staging server with and without @bmansurov's change applied and it LGTM.

@phuedx is there something @Tbayer or I can do to test this live? like apending to the url popups=enabled&el=true? This would allow us to get a more familiar understanding of what is triggering what and why we might be seeing some #strange.

@JKatzWMF: You'll be able to test it live on Thursday, 14th but you should be able to test it on our staging server or on the Beta Cluster.

… Previously, if you hovered over one link and then immediately over another, then an "error" event would be logged as well as a "dwelledButAbandoned" event because the API client was aborting a request, which is typically treated as an error.

… Previously, if you hovered over one link and then immediately over another, then an "error" event would be logged as well as a "dwelledButAbandoned" event because the API client was aborting a request, which is typically treated as an error.

Thanks for the explanation! That this can only happen in the test group (popups enabled) is consistent with the fact that currently almost all error events occur there.
However, the average number of events per link interaction is very close to 1 in the test group currently, so I'm left wondering if both of these events were being logged correctly.

@phuedx, when you say "previously" are you talking before this patch was introduced, or are you referring to the old adjacent links bug we fixed a while back?

phuedx added a comment.EditedJul 12 2016, 10:10 AM

@dr0ptp4kt: Before the change was merged.

Thanks, @phuedx. I take to this to mean that in the past:

if (~250ms or more elapsed && a preview card had not yet rendered && user moused out) {

it sent an error event plus a dwelledbutAbandoned

}

Whereas now, it should
if (~250ms or more elapse && a preview card has not yet been rendered && user moused out) {

it will send only a dwelledButAbandoned event

}

Do I have that right? And was it specifically the case of mousing away from one link to another link? Or did it also occur for mousing away to non-link real estate?

@dr0ptp4kt: As I understand it, it happened whenever the user moused away from a link while the API request was still in flight, i.e. it should not depend on the user mousing away to another link.

phuedx reopened this task as Open.Jul 13 2016, 9:11 AM

Reopening as there's still clarification required and likely more investigation as a result.

phuedx removed phuedx as the assignee of this task.Jul 15 2016, 8:58 AM

I'm having difficulty following the above conversation. Can someone update the description/create a new sub task if this has not been completed as requested?

What can I help clarify @Jdlrobson?

@phuedx The acceptance criteria " Client induced XHR cancellations from Hovercards ON users do not result in action = error events." is verifiable by me. If there are follow ups needed with regards to dwelledButAbandoned they seem unrelated and we should open a new task.

That said you reopened this saying there's still clarification required. It would help to restate this clarification and who you need it from.

dr0ptp4kt closed this task as Resolved.Jul 16 2016, 2:49 AM
dr0ptp4kt claimed this task.

This looks good. Marking as resolved.

@phuedx, your explanation made sense. Notice how the anonymous, Hovercards ON dwelledButAbandoned event counts were similar on 20160708 and 20160715 whereas error counts fell as one would expect from the task & patch.

select
left(timestamp,8) ts,
event_isAnon,
event_popupEnabled,
event_action,
event_totalInteractionTime >= 250 as twofifty_and_up,
count(*)
from Popups_15597282
where
wiki = 'huwiki'
and (
(timestamp > '20160708'
and timestamp < '20160709')
or
(timestamp > '20160715'
and timestamp < '20160716')
)
group by
ts,
event_isAnon,
event_popupEnabled,
event_action,
twofifty_and_up;
tsevent_isAnonevent_popupEnabledevent_actiontwofifty_and_upcount(*)
2016070800dwelledButAbandoned1369
2016070800opened in new tab03
2016070800opened in new tab13
2016070800opened in same tab043
2016070800opened in same tab164
2016070800pageLoadedNULL431
2016070801dismissed01
2016070801dismissed1243
2016070801dwelledButAbandoned1186
2016070801error0266
2016070801opened in new tab04
2016070801opened in new tab14
2016070801opened in same tab058
2016070801opened in same tab1158
2016070801pageLoadedNULL758
2016070810dismissed15
2016070810dwelledButAbandoned125289
2016070810error01
2016070810opened in new tab0201
2016070810opened in new tab1368
2016070810opened in same tab04233
2016070810opened in same tab16356
2016070810pageLoadedNULL23689
2016070811dismissed061
2016070811dismissed110703
2016070811dwelledButAbandoned16144
2016070811error011952
2016070811error143
2016070811opened in new tab026
2016070811opened in new tab1204
2016070811opened in same tab0308
2016070811opened in same tab13075
2016070811pageLoadedNULL21084
2016070811tapped settings cog12
2016071500dwelledButAbandoned1817
2016071500opened in same tab0240
2016071500opened in same tab1335
2016071500pageLoadedNULL1409
2016071501dismissed199
2016071501dwelledButAbandoned1136
2016071501opened in same tab022
2016071501opened in same tab173
2016071501pageLoadedNULL770
2016071510dwelledButAbandoned126337
2016071510opened in new tab0353
2016071510opened in new tab1546
2016071510opened in same tab04501
2016071510opened in same tab16489
2016071510pageLoadedNULL24346
2016071511dismissed047
2016071511dismissed112629
2016071511dwelledButAbandoned16781
2016071511error016
2016071511opened in new tab011
2016071511opened in new tab1226
2016071511opened in same tab0243
2016071511opened in same tab13614
2016071511pageLoadedNULL23158
2016071511tapped settings cog14