Page MenuHomePhabricator

Investigate missing dialog close events
Open, Needs TriagePublic

Description

During verification of our open and close metrics, we discovered a systematic 17% of missing close events for the transclusion feature (template dialog), when comparing window-open-* with dialog-*. See comments for the queries.

We would like to know what workflow or client condition leads to this outcome, and whether we should count these as successful, failed (or unknown) interactions.

Event Timeline

awight edited projects, added VisualEditor; removed WMDE-QWERTY-Sprint-2021-01-06.
select
  count(1),
  event.action
from VisualEditorFeatureUse
where
  year >= 2020
  and event.feature="transclusion"
  and event.action like "dialog-%"
group by event.action;

count     action
364618  dialog-done
141176  dialog-abort
2 rows selected (209.545 seconds)

This disproves my theory, it shows that we were not in fact missing events. I've confirmed by locally inserting a new template, and can see that a dialog-done event is sent.

Counting the raw opens and comparing with the raw closes,

select
  count(1),
  event.action
from VisualEditorFeatureUse
where
  year >= 2020
  and event.feature="transclusion"
  and event.action like "window-open-%"
group by event.action;

count     action
29637   window-open-from-sequence
79505   window-open-from-command
478230  window-open-from-context
21212   window-open-from-tool
4 rows selected (111.365 seconds)

Making for 608584 opens and 505794 closes since Jan 2020. Yes, we receive 17% fewer close events than open. Either some of the opens are being double-counted, or there is a common workflow that results in no close event. Unfortunately, I have no clue what that is, yet. And it makes a difference to our statistics, because that group might be either successful or failing dialog users and we don't know where to assign the difference. Recommendation: let's keep the the success proportion as-is: "done / (abort + done)", which at least gives us a stable number which can be compared over time. We should also decide if we want to investigate the difference in follow-up work.

I'm converting this task into an investigation, and removing from the sprint.

awight renamed this task from Collect and aggregate missing dialog close events to Investigate missing dialog close events.Jan 14 2021, 11:06 AM
awight updated the task description. (Show Details)

We checked whether dialog-remove or dialog-insert events might explain the missing closes, but they never appear for feature transclusion.

@awight – we're glad you flagged this. Do you have a sense for when you're going to start an analysis that will depend on these events?

We're trying to figure out when we should prioritize investigating this.

@awight – we're glad you flagged this. Do you have a sense for when you're going to start an analysis that will depend on these events?

It seems to be quite constant over time, so I feel like we can treat as a known mystery for now. It adds a huge error to the absolute workflow success rate, but we care more about relative changes in the success rate.

This seems to affect not just the transclusion dialog, but all of the dialogs.

Query for https://superset.wikimedia.org/superset/sqllab to compare all dialog types: [note that there are some things in these results that are not normal dialogs, please ignore those]

select 
    *, 
    case when opened != 0 then 1.000*(opened-closed)/opened end as "missing close events" 
from (
select
    event.feature,
    sum(case when event.action like 'window-open-%' then 1 else 0 end) as "opened",
    sum(case when event.action like 'dialog-%' then 1 else 0 end) as "closed"
from visualeditorfeatureuse
group by event.feature
) t
order by "missing close events" desc

The transclusion dialog is indeed missing 17% of close events, but also the gallery dialog is missing 14%, the media dialog is missing 13%, and so on.

One possible explanation would be that 17% of users who open the transclusion dialog immediately recoil in horror and close their browser (or more charitably, maybe they just wanted to see how it's done but did not want to edit it). That would result in an open event but no close event.

Another explanation would be that we have an issue somewhere in the dialog closing code that crashes in such a way that the dialog closes but logging does not happen. I had a quick look and don't see anything obvious.


I think we should try to piece together complete sessions from these events, and see what events (if any) happen after the "unpaired" open events.

Getting out without a dialog-whatever event requires that the dialog be closed in such a way that the getTeardownProcess method is never called on it.

My immediate hypothesis would be the "they closed the browser window" (or otherwise navigated away) case.

I think we should try to piece together complete sessions from these events, and see what events (if any) happen after the "unpaired" open events.

At least in theory, even in the "they closed the window in horror" case, there should be an abort event in EditAttemptStep for that sessionid. (EventLogging is using the appropriate beacon API calls that should still result in an event being sent even then.)

Almost all of the missing close events happen on mobile.

select
    *,
    case when opened != 0 then 1.000*(opened-closed)/opened end as "missing close events",
    case when "opened-phone" != 0 then 1.000*("opened-phone"-"closed-phone")/"opened-phone" end as "missing close events-phone",
    case when "opened-desktop" != 0 then 1.000*("opened-desktop"-"closed-desktop")/"opened-desktop" end as "missing close events-desktop"
from (
select
    event.feature,
    sum(case when event.action like 'window-open-%' then 1 else 0 end) as "opened",
    sum(case when event.action like 'dialog-%' then 1 else 0 end) as "closed",
    sum(case when event.platform='phone' and event.action like 'window-open-%' then 1 else 0 end) as "opened-phone",
    sum(case when event.platform='phone' and event.action like 'dialog-%' then 1 else 0 end) as "closed-phone",
    sum(case when event.platform='desktop' and event.action like 'window-open-%' then 1 else 0 end) as "opened-desktop",
    sum(case when event.platform='desktop' and event.action like 'dialog-%' then 1 else 0 end) as "closed-desktop"
from visualeditorfeatureuse
group by event.feature
) t
order by "missing close events" desc

Mobile is the most likely situation for us not getting that final abort if they abandon the tab -- if they leave the browser app-or-tab, it can get pruned for memory conservation without ever getting to run events again. (But we should still get it if they're e.g. reloading the page.)

I have theories about good/bad potential reasons for this:

Good: people are opening up dialogs on mobile to extract parameters so they can paste them into the same dialog in another tab, and are then abandoning the page they opened to check because mobile encourages not closing tabs.

Bad: there might be a way to wedge a dialog into an un-closeable state on mobile.

Medium: there might be UX confusion on mobile. If you try to swipe-back intending to leave a dialog, it actually navigates back and closes the editor entirely.

...even if that last one isn't the cause of this, I think it's probably an issue and we should intercept those swipes to redirect to "close" when there's a full-screen dialog up.

Medium: there might be UX confusion on mobile. If you try to swipe-back intending to leave a dialog, it actually navigates back and closes the editor entirely.

...even if that last one isn't the cause of this, I think it's probably an issue and we should intercept those swipes to redirect to "close" when there's a full-screen dialog up.

Probably related to T263470.