Page MenuHomePhabricator

QA Eventlogging to Google Forms set up
Closed, ResolvedPublic

Description

Once the test in T125946 is merged (expected to be merged in the 16:00-17:00 PST SWAT deploy interval), do a complete quality assessment of the survey data being collected.

  • Assess the effect of the change introduced by T127980: There will be a 30 min period between 16:00-17:00 PST on 2016-02-25 during which we increase the sampling rate to 1:500. Use that data to assess the extent at which by not showing the survey to DNT users we have been able to close the 30% data loss.
  • Compare the distribution of top user agents that interact with the survey with the distribution of user agents from webrequest logs during the same period of time. Given that we are doing random sampling when we show the survey, putting biases for opting in aside, we expect to see roughly the same user agents in EL, at least for the top ~15 user agents. Are we missing a major user agent in EL?
  • Check QuickSurveysResponses_15266417 table to make sure all the data we are interested in is being collected, correctly.
  • Check if we have responses in the survey spreadsheet for users who have chosen to participate in the survey.
  • Check QuickSurveyInitiation_15278946 table for impressions, number of yes, number of no, and the rest of the funnel.
  • Ping people in T125946 if the survey needs to be turned off immediately in case there is a critical issue.
  • [1]

[1] I think we may have a data loss problem. I took the survey about 15 minutes ago, using the URL parameter override to load the widget from the page I was visiting. I completed both screens of the survey, opting into data collection. My record doesn't show up in log.QuickSurveysResponses_15266417 on analytics-store (and there's virtually no replag). Is the URL override skipping logging or was my record meant to be logged? Do the complete responses with a Yes on the 2nd screen match the EL table so far?

Event Timeline

leila assigned this task to ellery.
leila raised the priority of this task from to High.
leila updated the task description. (Show Details)
leila added a project: Research.
leila moved this task to Staged (ready to execute) on the Research board.
leila added subscribers: leila, DarTar, Cervisiarius.
leila set Security to None.

@ellery can you look into the items in the Description and check/expand those that you have already looked into. If you have a link to the way you did the QA, please share it here. This can help us for future QAs. thanks! :-)

@ellery are you working on this item today? we should close this task before 16:00 PST. Please update the task with what you have done so we know where we are with QA.

@leila You can see the full analysis here.

Everything looks good, except that only 60% of responses in Google Forms are joinable to EL. Dario observed this issue for himself. Maybe it has to do with forcing the survey

@ellery I request a hold to the survey until we resolve this.

@dr0ptp4kt we need your help here. Please check Summary in the bottom of this link. At the time of this analysis, few hours ago, we had 33 responses on Google Forms but only 23 corresponding responses in EL. It seems we are missing to collect all responses. Can you help identifying where the problem can be?

Following up on our face to face discussions, when Do Not Track is turned on explicitly or implicitly, Event Logging events will not be sent to the server.

https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/71c566107f8120b4c8c2590ad87a57f9e76dd094/modules/ext.eventLogging.core.js#L249

Note: The referenced code uses the term sendBeacon, but notice it uses the navigator.sendBeacon feature if possible and and <img> tag otherwise (but neither if DNT is turned on).

Given the relatively small set of responses, I suspect a nontrivial portion of the responses lacking corresponding Event Logging originated from employees trying out the survey with the forced quicksurvey=true parameter while Do Not Track was turned on in their browsers, artificially bumping up the figure. I also suspect some of the other responses lacking corresponding Event Logging originated from regular users with Do Not Track turned on.

User agent distribution seemed fairly well rounded in the funnel, which suggests things are generally working across browsers.

Furthermore, sendBeacon capability - which I thought at first might be a factor at play in the tap-through action in particular - looked fairly balanced at the start of the funnel.

select event_beaconCapable, count(*) from QuickSurveyInitiation_15278946 where timestamp > '20150218' and event_eventName='eligible' group by event_beaconCapable;
event_beaconCapable	count(*)
0	8227
1	10339

At the button tap part of the funnel, sendBeacon-incapable responses were actually more common than the earlier part of the funnel would have suggested, meaning differences in sendBeacon capabilities were probably not a significant factor influencing the missing EL, either.

select event_beaconCapable, event_surveyResponseValue, count(*) from QuickSurveyInitiation_15278946 inner join QuickSurveysResponses_15266417 on QuickSurveyInitiation_15278946.event_surveyInstanceToken = QuickSurveysResponses_15266417.event_surveyInstanceToken where QuickSurveyInitiation_15278946.event_eventName = 'eligible' group by event_beaconCapable, event_surveyResponsevalue;
event_beaconCapable	event_surveyResponseValue	count(*)
0	ext-quicksurveys-external-survey-no-button	67
0	ext-quicksurveys-external-survey-yes-button	26
1	ext-quicksurveys-external-survey-no-button	65
1	ext-quicksurveys-external-survey-yes-button	29

As noted earlier today on videoconference, when EL events were missing for responses they were generally missing in both QuickSurveysResponses and QuickSurveyInitiation events, a hint that the button tap part of the funnel wasn't specifically broken.

My recommendation would be to keep letting the survey run at the lower level, not to do any quicksurvey=true parameter forcing (especially not if DNT is enabled!), and then examine the data from the weekend days (20- and 21-Feb, 2016) specifically. I would not be surprised then if some 20% of the responses lack corresponding EL events when looking at the weekend data.

Now, as for whether the QuickSurvey CTA for a third party survey should be shown in the first place when DNT is enabled, in order to filter away noise, I don't know. Although correlation against the funnel would be strictly out of scope when EL is missing for a response, which is easy enough to do programmatically, it may still be interesting as an incidental matter to determine whether the response tuples without EL roughly mirror the response tuples for those with EL.

From a UX perspective, I guess eventually I would prefer that we stop showing the CTA if DNT is turned on, simply for the reason that we know the JavaScript code isn't going to do EL anyway for the general on-site or garden variety third party surveys; no point interjecting stuff if we won't have the responses for the standard cases But I don't know that disablement of the CTA when DNT is turned on should be a dependency of this effort. It may make more sense to actually make this CTA suppression tweak after the surveys have run.

thanks @dr0ptp4kt. Just a quick note that, as discussed, the rate of DNT-enabled responses over the weekend will not be the same as during the week, since we'll have a substantially larger volume of mobile traffic (I have no idea what the DNT distribution looks like on mobile browsers).

@leila I talked to @Nuria last night and she said she'll be available this morning to give you a quick overview of the implications of NAT on IP addresses for mobile clients, if you're around.

Agreed, we'd likely see a difference in the DNT between desktop and mobile UAs, and we can use the webHost field to determine which traffic is mobile or desktop. I think if we all stay away from quicksurvey=true for today (Friday) and the weekend and let it run over the weekend we'll have a data set that's more representative (I could be wrong, though!) for Friday through Sunday. What do you think?

ellery renamed this task from Large scale reader survey quality assessment. to QA Eventlogging to Google Forms set up.Feb 19 2016, 9:24 PM

@dr0ptp4kt let's go with what you've suggested. I'll start a discussion in analytics to see how feasible it is to assess the DNT impact more thoroughly.

@dr0ptp4kt's analysis sounds right to me. However, I scanned the JavaScript logging code for obvious bugs, and I have a question: why are the mw.eventLog.logEvent() calls enclosed in an if ( mw.eventLog ) { ... } block, when the module explicitly declares a dependency on EventLogging? I can't think of a case when mw.eventLog would be falsey, unless ResourceLoader's dependency management logic is being subverted.

@dr0ptp4kt FYI, an estimated 11% of of the requests have DNT on per early results in T127571.

thanks @dr0ptp4kt. Just a quick note that, as discussed, the rate of > @leila I talked to @Nuria last night and she said she'll be available this morning to give you a quick overview of the implications of NAT on IP addresses for mobile clients, if you're around.

I talked to @Nuria. We should still run the survey on both mobile and desktop sites, but we need to keep in mind that

  • we will need to come up with a different strategy for building traces for mobile users;
  • we may be able to only build traces for some parts of the mobile population;
  • It is still useful to compare the distribution of responses between Desktop and Mobile users, even if we can't build traces for mobile users.

As of this morning, we have the following counts:

Yes Clicks: 114
Google Responses: 81
Google Responses tracked in EL: 59

As a comparison, we had the following counts on Thursday:

Yes Clicks: 45
Google Responses: 33
Google Responses tracked in EL: 23

In summary,

  • since the start of the experiment, 73% of Google Forms responses have a matching "Yes Click" event in EL
  • since Thursday, 75% of Google Forms responses have a matching "Yes Click" event in EL
  • the rate of missing data still is not fully explained by the usage of DNT

Thanks, @ellery

@dr0ptp4kt at this point, I do not recommend launching the survey until we figure out what the issue is. Still, 27% of those who responded to the survey don't have a matching "Yes Click" event in EL. Here are two directions to look into the problem:

  • Do you think what @ori mentioned above can give us a lead for where to look for the problem?
  • If there is a slight chance that DNT is causing this issue, the only way to verify it that I can think of is to let EL accept events from DNT for a short period of time, we increase the sampling rate, and we see if the percentage above changes.

@dr0ptp4kt can you help us with figuring out where the problem lies?

Do those responses have QuickSurveyInitiation events with a corresponding surveyInstanceToken value?

We know that DNT results in suppression of EL responses; that's by design. This surely accounts for a significant portion of responses that lack a corresponding EL event. What's in question is what, assuming 11% of the population has DNT, could cause relatively higher rates of missing events.

One thing we could do is not show the CTA if DNT is on and see what happens. @Jdlrobson, @bmansurov, @jhobs, about how hard would it be to make QS not show surveys when DNT is on and then get it SWAT'd?

T127980: Do not show Quick Surveys to Do Not Track User created. @bmansurov, @jhobs, @Jdlrobson:

  • Would you please attend to T127980: Do not show Quick Surveys to Do Not Track User? Let's discuss particulars briefly at standup.
  • Do you happen to have insight into @ori's question above about conditional execution for the non-falsey dependency case? Going to the heart of it, is it possible there was an edge case where mw.eventLog wasn't for some reason present for the developer and the developer was just trying to avoid that edge case? If yes, does that mean we could have QS shown (not just for DNT users, but others) yet no EL events?

@dr0ptp4kt's analysis sounds right to me. However, I scanned the JavaScript logging code for obvious bugs, and I have a question: why are the mw.eventLog.logEvent() calls enclosed in an if ( mw.eventLog ) { ... } block, when the module explicitly declares a dependency on EventLogging? I can't think of a case when mw.eventLog would be falsey, unless ResourceLoader's dependency management logic is being subverted.

Since the 'mobile.loggingSchemas' module is loaded regardless of the availability of event logging and since the language switcher schema depends on the eventlogging Schema, we need to make sure event logging is available bore using it. I'm currently working on T122504 to make things clearer and cleaner.

Edit: Sorry I mixed up my schemas. The above statement is true, but we're talking about QuickSurveys here.

Going to the heart of it, is it possible there was an edge case where mw.eventLog wasn't for some reason present for the developer and the developer was just trying to avoid that edge case? If yes, does that mean we could have QS shown (not just for DNT users, but others) yet no EL events?

No, it was a precautionary measure not to break the site in case event logging is not available. We know it is in Wikipedias, but not all mediawiki installs will have the event logging extension installed.

No, it was a precautionary measure not to break the site in case event logging is not available. We know it is in Wikipedias, but not all mediawiki installs will have the event logging extension installed.

But the module has a hard dependency on schema.QuickSurveysResponses, so it is going to break on MediaWiki installs which don't have EventLogging anyway.

@leila, @ellery, do the responses with missing tap events have any QuickSurveyInitiation events with a corresponding surveyInstanceToken value? Or are those missing altogether as well?

I'm going to be out the next few business days - if you can please coordinate with @Jdlrobson, @bmansurov, and @jhobs while I'm out I'll be interested to see where we land upon returning.

No, it was a precautionary measure not to break the site in case event logging is not available. We know it is in Wikipedias, but not all mediawiki installs will have the event logging extension installed.

But the module has a hard dependency on schema.QuickSurveysResponses, so it is going to break on MediaWiki installs which don't have EventLogging anyway.

You're right, I made a mistake and talked about the language switcher schema. So the if check is unnecessary in this case.

@ellery I've added two tasks in Description. It would be great if you can work on them today. We can discuss the results late evening tonight or tomorrow (my hope is that we find a convincing answer and we can move to the actual survey on Monday :)

@dr0ptp4kt Most of the google responses with the missing tap events also do not have 'impression' or 'eligible' events in QuickSurveyInitiation. There are only two cases where the click event is missing, but there is both an 'impression' and an 'eligible' event.

We checked the data and after disabling the survey for DNT users, we only have 8% data loss. This confirms @dr0ptp4kt hypothesis. As far as our work, we are planning to start the survey on Monday PST, since 8% data loss is in the order we can accept.

We also looked into the most common user agents that appear in webrequest logs and compared those with those collected via our schema. We can share the top 25 that are available in webrequest logs but not EL if someone is interested in looking into them. (I won't share those UAs here unless it's recommended.)