Page MenuHomePhabricator

QuickSurveys EventLogging missing ~10% of interactions
Closed, ResolvedPublicSpike

Description

Problem

We have run two external QuickSurveys recently (T217080 and T217576) and for both surveys, about 10% of the responses recorded (i.e. surveys filled out via Qualtrics and Google Forms) do not have associated EventLogging data from QuickSurveyInitiation or QuickSurveysResponses. Some of the responses missing QuickSurveysResponses EL do have associated QuickSurveyInitiation EL.

What have we ruled out:

  • Issue with the queries we are using to gather the EL data. The surveys missing EL data are scattered through the whole survey period so it is not related to incorrectly filtering the data.
  • Issues w/ the URL-encoded EL being too long. I only saw six total instances in the event.eventerror table related to the surveys.
  • Respondents tampering with the pageviewToken that we use to connect EL with the survey responses. In Qualtrics, this is not possible and we still see the missing EL.

Hypotheses

  • Missing all EL is possibly related to browser agent. We have observed that certain browsers are being undersampled by QuickSurveys (T218243#5086923). It could be that these browsers are in fact seeing QuickSurveys, they just are not appropriately logging that.
    • It seems other EventLogging schema have run into related challenges (T204143)
  • Just missing QuickSurveysResponses EL (but not QuickSurveyInitiation) is likely related to this bug: T217171#4992112
    • Essentially, right-clicking and opening a quicksurvey link in a new tab is not registered by the extension. Presumably the 91 responses for Reader Trust and 38 for Demographics Pilot that had QuickSurveyInitiation but not QuickSurveysReponses (despite completing the survey) could be the result of this behavior.

Survey Overviews

Reader Trust (T217576):

  • Out of the 1971 survey responses recorded by Qualtrics:
    • 1702 (86%) of those have corresponding QuickSurveysResponses data
    • Another 92 (for a total of 1793 or 91%) can be matched to QuickSurveyInitation data.

Demographics Pilot (T217080):

  • Out of the 626 survey responses recorded by Google Forms:
    • 514 (82%) of those have corresponding QuickSurveysResponses data
    • Another 38 (for a total of 552 or 88%) can be matched to QuickSurveyInitation data.

For these analyses, I fully skipped EventLogging and instead used webrequest logs, using a query like that below to gather the EL (and then attempting to join it to the survey responses provided by Qualtrics/Google Forms):

SELECT *,
       REFLECT(‘java.net.URLDecoder', 'decode', SUBSTR(uri_query, 2)) AS json_event
  FROM wmf.webrequest 
 WHERE uri_path LIKE '%beacon/event'
       AND uri_query LIKE '%QuickSurvey%'
       AND uri_query LIKE ‘%<survey-name>%’ 
       AND year = 2019 AND month = 3 AND day >= 18 AND day < 23

Event Timeline

Is it possible that the link to the survey is being shared outside a QuickSurvey (e.g. social media)?

Is it possible that the link to the survey is being shared outside a QuickSurvey (e.g. social media)?

@Jdlrobson Good point but I'm pretty sure not. We haven't seen any evidence of links being shared on social media and if the links were shared, we'd expect a bunch of survey responses with blank or duplicate pageview tokens (because either the person would share the link w/o the URL parameters that pass the unique pageview token to Qualtics/Google Forms or with the URL parameters particular to their session). The responses that don't have EL associated with them, however, have pageview tokens that are unique and look reasonable (i.e. correct length and random-ish string of characters as expected).

fdans raised the priority of this task from High to Unbreak Now!.
fdans moved this task from Incoming to Data Quality on the Analytics board.
fdans lowered the priority of this task from Unbreak Now! to High.Apr 11 2019, 4:27 PM
fdans raised the priority of this task from High to Unbreak Now!.
fdans moved this task from Data Quality to Ops Week on the Analytics board.

Hey! :]

I've been looking into this for a bit.
Is there any documentation I can read on the flow of the surveys?
Does the user click on a link on-wiki, that opens a Google/Qualtrics form?
And when do events for QuickSurveyInitiation and QuickSurveysResponses trigger?

The browser stats are mysteriously interesting, I think it's worth digging further into that.
Another condition that could be partial cause of the missing data is disabled JS on the browser, no?
Maybe also, the fact that beacon requests are sent only on the unloading of the page, causes them to be a delayed for a couple days (if people leave tabs open).
I've seen a (short) long tail of QuickSurveyResponses events for reader-demographics-en-pilot that are outside of the suggested time intervals of the experiment.

I found https://www.mediawiki.org/wiki/Extension:QuickSurveys,
and it explains the code for the survey is loaded dynamically, so JS disabled is not the cause.
DNT is also not the cause, because when it's on, the surveys don't even show.

I don't understand yet how Google/Qualtrics forms can send responses and at the same time we get corresponding QuickSurveysResponses events.
Are the external forms configured to also send beacons?

@ovasileva: This might benefit from some investigation on our side too.

Is there any documentation I can read on the flow of the surveys? Does the user click on a link on-wiki, that opens a Google/Qualtrics form?

Unfortunately no great documentation that I know of but happy to try to sketch it out. You're right about the click on link on-wiki that opens a Google/Qualtrics form. A few general notes:

  • Regardless of internal vs. external survey, pretty much the same criteria are used to determine whether a given reader will see a survey
  • Each external survey will have a corresponding configuration and message pages that provide the information necessary for sampling as well as what which URL to provide the reader if they click "yes" to take a survey. Depending on the survey, this URL generally takes them either to a Google Form or Qualtrics survey.
  • Importantly, external surveys can be configured to dynamically add a URL parameter to the survey URL that passes that reader's (unique) pageview token to the external survey. For Google Forms, we set it up so that this pageview token automatically is set as the answer to one of the questions. For Qualtrics, this pageview token is just stored alongside the survey.

For the demographics pilot, the flow would be like this:

And when do events for QuickSurveyInitiation and QuickSurveysResponses trigger?

The pageview tokens and our ability to link them up with EventLogging for the demographics survey is here: https://docs.google.com/spreadsheets/d/10s2U1vHGefd6g8Ev4clT4e-MEO5ThXoX--tjVR9GyzM/edit?usp=sharing

I've seen a (short) long tail of QuickSurveyResponses events for reader-demographics-en-pilot that are outside of the suggested time intervals of the experiment.

Yeah, I tried looking into that but I think decided that it only explained potentially a very tiny minority and all of the EL should actually be sent before the user even takes the survey.

And when do events for QuickSurveyInitiation and QuickSurveysResponses trigger?

Unnervingly, this isn't strictly true. Looking at L87 of that same file, there's a test to see if the mw.eventLog property exists. That property is set up in "EventLogging Core" (https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/1db3013946fc0d451d4f7f3fdd5fd7f17cebae02/modules/ext.eventLogging/core.js#L240). QuickSurveys does require EventLogging but it doesn't require the client-side code to be loaded and executed before its client-side code is, i.e. it's unlikely but not impossible that QuickSurveys could be loaded and executed before EventLogging.

I don't think this explains what you're seeing but I think this is an omission in QuickSurveys' design (and a very easy one to fix at that).

i.e. it's unlikely but not impossible that QuickSurveys could be loaded and executed before EventLogging.

Hmm...that would explain a lack of initiation possibly but my (perhaps naive) assumption is that by the time a user clicked on the survey, everything would be properly loaded.

My other current theory is the missing 10% is possibly browsers that don't support sendBeacon (https://developer.mozilla.org/en-US/docs/Web/API/Navigator/sendBeacon#Browser_compatibility), which would potentially block event logging but not QuickSurveys if the backup in EventLogging which is "create an image with the same URL as the beacon" does not work well.

The QuickSurveyInitation does send a value equivalent to !!navigator.sendBeacon here: https://github.com/wikimedia/mediawiki-extensions-QuickSurveys/blob/c40f824c79b4e98993dafe7416a60fd6ad9cce45/resources/ext.quicksurveys.views/QuickSurvey.js#L87

Looking at the stats on eventLogging then when sendBeacon is present in the eventLogging, a few browsers stand out as having high levels of false values (so the EL was sent via a fake image request):

Browser FamilysendBeacon is TruesendBeacon is False
Chrome0.9980.002
Chrome Mobile0.9950.005
Mobile Safari0.9040.096
Firefox0.9930.007
Samsung Internet0.950.05
Safari0.7880.212
Edge0.990.01
IE0.0010.999
Mobile Safari UI/WKWebView0.9120.088
UC Browser0.9510.049
Opera0.9980.002
Chrome Mobile WebView0.9990.001
Chrome Mobile iOS0.6050.395
Firefox Mobile0.9940.006
Amazon Silk10
Opera Mobile0.9690.031

These browsers also largely map up w/ the ones that seemed underrepresented in the sampling (T218243#5086923) though it doesn't fully explain Safari and Firefox

My other current theory is the missing 10% is possibly browsers that don't support sendBeacon

Seems unlikely rather makes sense that if you have a loading issue (per @phuedx ) comment above and that is causing events not being sent (cause EL module is not loaded) that issue will be more prevalent in older browsers that parse and load javascript much more slowly than new ones.
Are you showing surveys in mobile as well as desktop?

FYI that your table above does not take into account browser percentages, for example: the only IE browser you should see is IE11 as the older versions do not receive javascript and thus they cannot execute eventlogging code. "older" versions of IE indicate bots, not users. See: https://www.mediawiki.org/wiki/Compatibility#Modern_(Grade_A)

My other current theory is the missing 10% is possibly browsers that don't support sendBeacon

Seems unlikely rather makes sense that if you have a loading issue (per @phuedx ) comment above and that is causing events not being sent (cause EL module is not loaded) that issue will be more prevalent in older browsers that parse and load javascript much more slowly than new ones.

Yeah, that makes sense to me. Possibly both issues are at play (i.e. both slower parsing/loading of JS + having to rely on the less robust method of creating a fake image w/ the appropriate URL as opposed to the sendBeacon functionality). Regardless, both hypotheses suggest that the 10% of our survey responses that do not have associated EL are very likely responses submitted via older versions of IE or other, older browsers. Looking at the survey responses that are missing EL: they skew younger (below 40) and are more likely to be male than female but no other trends stand out.

Are you showing surveys in mobile as well as desktop?

Yes, both mobile web and desktop but not the app. The final proportion ends up being ~50% mobile and ~50% desktop.

FYI that your table above does not take into account browser percentages

Yeah, I left out versions because it was a lot of data. Same table but w/ browser versions and those w/ at least 1000 data points is below. A few takeaways:

BrowsersendBeacon is TruesendBeacons is False% of total
Chrome (v.72)1027.59%
Mobile Safari (v.12)0.9990.00118.42%
Chrome Mobile (v.72)1018.19%
Firefox (v.65)0.9990.0013.21%
Mobile Safari (v.11)0.7050.2952.55%
Samsung Internet (v.8)102.37%
IE (v.11)0.0010.9992.30%
Chrome Mobile (v.71)102.20%
Safari (v.12)102.19%
Chrome (v.71)102.09%
Edge (v.17)101.65%
Mobile Safari (v.10)010.89%
Chrome Mobile (v.70)100.78%
Mobile Safari UI/WKWebView (v.12)0.9670.0330.74%
Edge (v.18)100.63%
Chrome Mobile (v.68)100.61%
Chrome Mobile (v.69)100.60%
Chrome (v.70)100.54%
UC Browser (v.12)0.9940.0060.48%
Opera (v.58)100.46%
Samsung Internet (v.9)100.45%
Safari (v.11)0.4110.5890.44%
Mobile Safari (v.9)0.0110.9890.40%
Chrome Mobile (v.64)100.39%
Chrome (v.49)100.37%
Chrome Mobile (v.66)100.34%
Amazon Silk (v.72)100.33%
Chrome Mobile (v.67)100.29%
Chrome (v.67)100.29%
Chrome (v.69)100.28%
Chrome Mobile iOS (v.72)0.8730.1270.28%
Safari (v.10)010.28%
Samsung Internet (v.7)100.25%
Firefox (v.60)0.9980.0020.24%
Chrome (v.68)100.23%
Chrome Mobile (v.61)100.22%
Chrome Mobile (v.65)100.20%
Firefox Mobile (v.65)100.19%
Edge (v.16)100.18%
Firefox Mobile (v.48)100.18%
Chrome Mobile WebView (v.72)0.9990.0010.14%
Chrome Mobile (v.55)100.14%

Yes, both mobile web and desktop but not the app. The final proportion ends up being ~50% mobile and ~50% desktop.

Ok, makes sense that loading issues will be more prevalent on mobile connections on low end devices.

For these analyses, I fully skipped EventLogging and instead used webrequest logs, using a query like that below to gather the EL (and then attempting to join it to the survey responses provided by Qualtrics/Google Forms):

This query would give you also non valid events, for some schemas the number could be quite significant. Maybe you know this but just an FYI. See errors for the time period: https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@e902392&_a=h@324967b

Nuria lowered the priority of this task from Unbreak Now! to Needs Triage.Apr 19 2019, 1:32 PM
Nuria moved this task from Ops Week to Radar on the Analytics board.

Moving to radar as further steps of code chnages to Quicksurveys to fix loading issues with JS should be done by (i think) @phuedx team?

Moving to radar as further steps of code chnages to Quicksurveys to fix loading issues with JS should be done by (i think) @phuedx team?

👍 Readers Web maintain QuickSurveys and the EventLogging "backend".

ovasileva triaged this task as Medium priority.Oct 15 2019, 4:05 PM
MBinder_WMF changed the subtype of this task from "Task" to "Spike".Oct 16 2019, 5:11 PM
MBinder_WMF subscribed.

Changed subtype based on Kickoff @phuedx

Change 544286 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/QuickSurveys@master] Depend on EventLogging

https://gerrit.wikimedia.org/r/544286

Change 544287 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/QuickSurveys@master] Hygiene: Don't guard access to mw.eventLog

https://gerrit.wikimedia.org/r/544287

I've reviewed these patches. Do we think those were the reason for the missing events? What are the next steps? Confirming via data? Or if we know for sure, a write up explaining what went wrong here?

Change 544286 merged by jenkins-bot:
[mediawiki/extensions/QuickSurveys@master] Depend on EventLogging

https://gerrit.wikimedia.org/r/544286

Change 544287 merged by jenkins-bot:
[mediawiki/extensions/QuickSurveys@master] Hygiene: Don't guard access to mw.eventLog

https://gerrit.wikimedia.org/r/544287

Do we think those were the reason for the missing events? What are the next steps? Confirming via data?

@Jdlrobson I have a survey that's been running for about four weeks right now. The plan was to undeploy it down on Thursday but if this change will go live shortly, I'm willing to discuss delaying that a few days to gather some data. Thoughts?

@Isaac having some data post Thursday would be useful!

Additionally, the Performance team is running their Perceived Performance survey right now.

Additionally, the Performance team is running their Perceived Performance survey right now.

Thanks for pointing this out -- if I'm not mistaken, that's an internal survey so I'll still extend my external survey until next week. The missing data is only evident in external surveys where we somehow have people responding to the survey via Google Forms with reasonable-looking survey codes and responses but no associated initiation/response eventlogging. Presumably the loss of data happens in internal surveys as well, but we have no second source of data that indicates that we're missing responses.

As I go to do this analysis, what UTC day/hour should be my cut-off for when QuickSurveys would have switched from the old approach to the new approach?

As I go to do this analysis, what UTC day/hour should be my cut-off for when QuickSurveys would have switched from the old approach to the new approach?

According to the Server Admin Log, the change was deployed at 2019/10/24 13:06 UTC.

@Isaac let me know if I can help in any way with identifying to whether that missing 10% has been accounted for.

@Jdlrobson I took a look at the survey responses we had and our ability to match them up with EventLogging. High-level is we're still seeing the problem where ~10% of survey responses have no corresponding EventLogging for neither the QuickSurveysResponses nor QuickSurveyInitiation schemas. It did seem to improve for English but I don't trust that because it got worse for Russian/Polish. I split the survey responses up between before and after the deployment (2019/10/24 13:06 UTC per above). For the three surveys that we were running, this is what we get:

English
% missing before: 8.6%
% missing after: 2.9%

Polish
% missing before: 8.1%
% missing after: 12.7%

Russian
% missing before: 5.4%
% missing after: 8.8%

@Isaac Something else to consider (once loading issues have been solved) is that for any users (on desktop) with adblock installed (or privacy badger or similar) you might not get data either. Adblock is installed in 5 to 10% of desktop browsers so if your survey is happening all in desktop (as I imagine it is) and surveys display regardless of whether users have addblock installed (not sure on this last point but it is easy to test) then you are likely to have some data loss in desktop.

The urls that eventlogging sends data to is "beacon/event", which is present of popular adblocker url lists:
https://easylist.to/easylist/easyprivacy.txt

A data loss of 5% (on desktop, users do not normally run adblock on mobile) doesn't seem outlandish given numbers.

Thanks @Nuria !

if your survey is happening all in desktop

The survey happens on both desktop and mobile. Unfortunately we can't tell whether the survey responses missing EventLogging were from mobile or desktop :/ But adblocking exists on both platforms.

Something else to consider (once loading issues have been solved) is that for any users (on desktop) with adblock installed (or privacy badger or similar) you might not get data either

I think you might have identified at least a good chunk of the problem. I checked with different settings of AdBlock Plus on Firefox:

  • With default settings, it still allows quicksurveys to appear and be logged.
  • When you just check "Block social media icons tracking", it prevented QuickSurveys from even showing up (I'm not sure I understand why).
  • When you just check "Block additional tracking", it allows QuickSurveys to show up but suppresses EventLogging.

So for lack of better hypotheses, this is where I would summarize we're at:

  • adblocking has settings that allow people to see and take our surveys but suppress EventLogging. We can't verify that this is the source, but adblocking is sufficiently popular as to reasonably be the cause. I'm okay with accepting that loss.
  • right-clicking and opening a survey in a new tab does not generate EventLogging for QuickSurveysResponses (i.e. doesn't register that the Yes button was clicked), which likely explains the people who have QuickSurveyInitiation EventLogging but no QuickSurveysResponses EventLogging despite taking the survey (T217171#4992112). This doesn't fully break our workflow though it's not exactly expected behavior.

Resolving this. Thanks to everyone that spent time analyzing.

Just to note, we have the same problem for the new CentralNotice data pipeline, which uses EventLogging, as compared to the old pipeline, which uses a custom call to beacon/impression not blocked by AdBlock. In case it's useful: see T236834#5696044 (and the two comments after that).

Both this task and T236834 have a lot of rich discussion that I'd love to see on-wiki for the general EventLogging pipeline and QuickSurveys' use of that pipeline. I volunteer to write both during the week of All Hands and would welcome collaborators.

Both this task and T236834 have a lot of rich discussion that I'd love to see on-wiki for the general EventLogging pipeline and QuickSurveys' use of that pipeline. I volunteer to write both during the week of All Hands and would welcome collaborators.

@phuedx yeah, I would second that. I'll let you take the lead but let me know how I can help.

Late entry here--reading through the comments, I think the only hypothesis which stands up is that 10% of people are right-clicking the link and opening in a new tab. This could be measured by adding a parameter to the link when it's reached through the "Submit" button (should be present 90% of the time, if this theory is true).

The other explanations such as adblock would make sense if we were considering server-counted pageviews vs. event-logged pageviews, but in this case we're comparing two eventlogging streams so sendBeacon is going to behave consistently for both usages.

but in this case we're comparing two eventlogging streams so sendBeacon is going to behave consistently for both usages.

This is not correct, sendbeacon will "try" to send the event in both cases, in the adblock case it will be prevented to do so as http requests to a set of urls that match some characteristics are stopped.

but in this case we're comparing two eventlogging streams so sendBeacon is going to behave consistently for both usages.

This is not correct, sendbeacon will "try" to send the event in both cases, in the adblock case it will be prevented to do so as http requests to a set of urls that match some characteristics are stopped.

I should be clearer: what I meant is that sendBeacon will consistently fail if and only if the browser is ad-blocking. The failure is systematic in a way that both the Initiation and Responses events will not be sent, therefore adblock cannot explain the gap here.

I should be clearer: what I meant is that sendBeacon will consistently fail if and only if the browser is ad-blocking. The failure is systematic in a way that both the Initiation and Responses events will not be sent, therefore adblock cannot explain the gap here.

Just to clarify, there were two unexplained logging issues that we were trying to debug:

  • Survey taken but only QuickSurveyInitiation EL data exists and no QuickSurveyResponse EL data -- this seems to be explained by right-click + open in new tab, which doesn't trigger the QuickSurveyResponse eventlogging but would have no effect on the QuickSurveyInitiation eventlogging going through. We can't verify that this fully explains the gap, but it's quite reasonable I think.
  • Survey taken but no eventlogging data exists: this is what the ad-block hypothesis was focused on. I found at least one configuration in AdBlock Plus on Firefox (the only one I checked) where EventLogging was blocked but the javascript etc. necessary to surface QuickSurveys continued to function and so someone could still take the survey. This is a possible explanation and no better one arose, so I stick by it. I don't have a great sense of what sorts of Ad-Block configurations people use though so hard to say whether my assumptions are reasonable.

Just to clarify, there were two unexplained logging issues that we were trying to debug:

  • Survey taken but only QuickSurveyInitiation EL data exists and no QuickSurveyResponse EL data -- this seems to be explained by right-click + open in new tab, which doesn't trigger the QuickSurveyResponse eventlogging but would have no effect on the QuickSurveyInitiation eventlogging going through. We can't verify that this fully explains the gap, but it's quite reasonable I think.

It should be possible to test this explanation. We can make QuickSurveys use button tags rather than a tags, removing the ability to right-click + open in new tab. This should be a relatively simple change as OOUI provides consistent styling for both tags when used as buttons.

it should be possible to test this explanation. We can make QuickSurveys use button tags rather than a tags, removing the ability to right-click + open in new tab. This should be a relatively simple change as OOUI provides consistent styling for both tags when used as buttons.

i'm certainly interested about whether this does explain the whole issue, but regardless this would be desirable if someone has the time and it fixes the right-click issue. I don't see any drawbacks to this approach and improving logging for QuickSurveys is pretty important to it being useful.