Test data quality of the test surveys
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	leila
	May 15 2017, 5:43 PM

Description

Do data quality assessment based on the test surveys run in T164960, T164963, T164769, and T164894.

[Leila] Responses
[Florian] Webrequest logs in relation to EL
[Nathaniel] EventLogging

Related Objects
Search...

Status	Assigned	Task
Resolved	leila	T171561 Why we read Wikipedia
Resolved	leila	T131949 Repeat the big English reader survey in one or two more languages
Resolved	None	T165364 Test data quality of the test surveys
Resolved	• schana	T165678 clientIP needs to be collected as part of the schema or ...

Event Timeline

leila created this task.May 15 2017, 5:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 15 2017, 5:43 PM

@schana can you pick up the item in the Description which is for you? Florian and I will pick up the other two. Thanks.

leila added a parent task: T131949: Repeat the big English reader survey in one or two more languages .May 15 2017, 5:44 PM

The one thing I've noticed is that clientIp is not being collected for events. See T128407. Last time we blocked that task until the survey was complete. Is this data we need?

Impressions by wiki as of 2017-05-16 12:00 UTC:

nschaaf@stat1002:~$ mysql --host dbstore1002.eqiad.wmnet -e "select count(*),wiki from log.QuickSurveyInitiation_15278946 where timestamp > 20170515130000 and event_eventName='impression' group by wiki"
+----------+--------+
| count(*) | wiki   |
+----------+--------+
|    70423 | dewiki |
|    26325 | hewiki |
|    24574 | jawiki |
|    20341 | rowiki |
+----------+--------+

@schana yup. We need clientIp. How can we proceed to make the change for this schema?

@leila This data isn't specified by the schema that's being logged (https://meta.wikimedia.org/wiki/Schema:QuickSurveyInitiation), but rather the implementation/configuration of EventLogging itself. I sent an email to Analytics to clarify the situation.

got you, @schana. Yup, got the other email and added Bob and Florian to the thread. thanks.

leila mentioned this in T165678: clientIP needs to be collected as part of the schema or ....May 18 2017, 4:07 PM

leila created subtask T165678: clientIP needs to be collected as part of the schema or ....

@leila I've marked the event logging data as tested. Let me know if there was anything else that was needed.

@schana can you share a link to the code that tests the EL data? I can then go over the aspects you have tested to make sure everything that we should look into is tested.

@leila The check over all the data consisted of looking for NULL values where there shouldn't be. The other checks were checking aggregate counts by wiki. Beyond that, I did a few spot checks of joins between the two tables on surveySessionToken and surveyInstanceToken.

mysql:research@analytics-store.eqiad.wmnet [log]> select * from QuickSurveyInitiation_15278946 where timestamp > 20170515000000 and timestamp < 20170515235959 and concat(id, uuid, userAgent, webHost, wiki, event_eventName, event_surveyCodeName, event_surveyInstanceToken, event_surveySessionToken) is null;
Empty set (42.80 sec)

mysql:research@analytics-store.eqiad.wmnet [log]> select * from QuickSurveysResponses_15266417 where timestamp > 20170515000000 and timestamp < 20170515235959 and concat(id, uuid, userAgent, webHost, wiki, event_countryCode, event_editCountBucket, event_isLoggedIn, event_isTablet, event_namespaceId, event_pageId, event_pageTitle, event_platform, event_skin, event_surveyCodeName, event_surveyInstanceToken, event_surveySessionToken, event_surveyResponseValue, event_userLanguage) is null;
Empty set (0.02 sec)

mysql:research@analytics-store.eqiad.wmnet [log]> select count(*),wiki from QuickSurveysResponses_15266417 where timestamp > 20170515000000 and timestamp < 20170516000000 group by wiki;
+----------+--------+
| count(*) | wiki   |
+----------+--------+
|     1676 | dewiki |
|     1117 | hewiki |
|      329 | jawiki |
|      332 | rowiki |
+----------+--------+
4 rows in set (0.32 sec)

mysql:research@analytics-store.eqiad.wmnet [log]> select count(*),wiki from QuickSurveyInitiation_15278946 where timestamp > 20170515000000 and timestamp < 20170516000000 group by wiki;
+----------+--------+
| count(*) | wiki   |
+----------+--------+
|   152064 | dewiki |
|    58793 | hewiki |
|    54258 | jawiki |
|    48556 | rowiki |
+----------+--------+
4 rows in set (15.26 sec)

@schana we should go deeper than this and check pageview counts, impression counts, survey token uniqueness, etc. I suggest you start from the code Ellery used last for doing QA between EL data and survey responses collected. If you check new things, please add them to the code with short documentation so we know later how to repeat it. We may need to breakdown the QA per language to make sure we can spot possible issues.

@leila I'll need access to the survey responses to run the same checks that Ellery used.

Here are results from the event logging data: P5473

I'll make the code available once I have a more stable connection

In T165364#3283422, @schana wrote:

@leila I'll need access to the survey responses to run the same checks that Ellery used.

done. You should have email notifications for all four. If you're missing an access either request it directly or let me know here.

Results: P5478
Code: https://github.com/schana/wikimedia-utils/blob/master/quick_survey_qa.py

@schana thanks. can you write in the output file what the question you asked in every step was and whether the response was healthy or not, and if not, in what sense?

@leila The results have labels for what is being represented. Beyond that, I'm not sure what you mean by whether the response was healthy or not or how to go about determining that.

@schana this is the report of the quality assessment of the data collected. Can you summarize in a few paragraphs what you have tested and what you have observed and whether there are any red flags or if all the outputs you have seen match what you expect to see?

@leila I tested whether there were null values in the collected data and how the two event logging tables and google form responses looked after merging. In event logging, the responses can be pretty reliably mapped to impressions, but there is some loss when tracking the Google form responses. I did not see any red flags with the data; the proportions of the results were similar to what was encountered when the survey was run last. This can be seen in the summaries:

Last Survey

Eligible: 53373
Impressions: 41957
Clicks: 463
Yes: 134 (1 in 300)
Google responses: 81 (60% of Yes)
Google responses with registered click: 59 (74% of Google responses)

DE

Eligible: 81774
Impressions: 70290
Clicks: 1676
Yes: 675 (1 in 100)
Google responses: 501 (74% of Yes)
Google responses with registered click: 397 (79% of Google responses)

HE

Eligible: 32528
Impressions: 26265
Clicks: 1117
Yes: 778 (1 in 33)
Google responses: 640 (82% of Yes)
Google responses with registered click: 490 (77% of Google responses)

JA

Eligible: 29731
Impressions: 24527
Clicks: 329
Yes: 123 (1 in 200)
Google responses: 101 (82% of Yes)
Google responses with registered click: 76 (75% of Google responses)

RO

Eligible: 28241
Impressions: 20315
Clicks: 332
Yes: 164 (1 in 124)
Google responses: 139 (85% of Yes)
Google responses with registered click: 89 (64% of Google responses)

leila closed this task as Resolved.Jun 18 2017, 11:17 AM

leila updated the task description. (Show Details)

leila closed subtask T165678: clientIP needs to be collected as part of the schema or ... as Resolved.Jun 22 2017, 9:39 AM

Test data quality of the test surveysClosed, ResolvedPublicActions