Maniphest T206802

Understanding first day: audit other schema logic
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MMiller_WMF
	Oct 11 2018, 8:32 PM

Description

Once we have determined in T205762 what other schemas we want use, we need to dig deeper to make sure that the business logic for them will allow us to use them. In particular:

Are they still running and active?
Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?
Are they keyed on IDs that will allow us to join them in?

* Do they have timestamps that we can use for analysis? (Yes, all of them do.)

Are they writing to Hive? We learned that not all schemas are writing to Hive, which has to do with the differences in MariaDB and Hive column names and datatypes. We will need to reconcile or work around this

Here are those questions as a table to fill in like a checklist:

Schema	Active	Hive	Keys	Sampling
Echo	No (see notes)	N/A	`recipientUserId`	100%
EchoInteraction	Yes	Yes	`userId`	100%
Edit	Yes	No (see notes)	`user.id` (see notes)	6.25%, but soon to be configurable to 100%
GettingStartedRedirectImpression	Yes	Yes	`userId`	100% of logged-in users, nothing from anonymous users
GuidedTourButtonClick	Yes	Yes	`userId`	100%
GuidedTourExited	Yes	Yes	`userId`	100%
GuidedTourExternalLinkActivation	Yes	Yes	`userId`	100%
GuidedTourGuiderHidden	Yes	Yes	`userId`	100%
GuidedTourGuiderImpression	Yes	Yes	`userId`	100%
GuidedTourInternalLinkActivation	Yes	Yes	`userId`	100%
PrefUpdate	Yes	Yes	`userId`	100%
ServerSideAccountCreation	Yes	Yes	`userId`	100% in theory, but note "There is no guarantee this will be called in a successful account creation process"

Notes:

Echo schema: the echo_event and echo_notification tables in MW can get us data we need.
Edit schema: not currently in Hive, but we are working on that in T202348. The user.id field will then be renamed.

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T205754 [EPIC] Growth: Understanding first day
Resolved	MMiller_WMF	T205758 Understanding first day: define questions, metrics, and data
Resolved	nettrom_WMF	T205762 Understanding first day: determine which instrumentation can be re-used
Resolved	nettrom_WMF	T206802 Understanding first day: audit other schema logic
Resolved	nettrom_WMF	T202348 Resume refinement of edit events in Data Lake
Resolved	Catrope	T207803 Update EventLogging code to facilitate move to EditAttemptStep schema
Declined	None	T207526 New account interaction with Echo post registration

Event Timeline

MMiller_WMF created this task.Oct 11 2018, 8:32 PM

Restricted Application added subscribers: Urbanecm, revi. · View Herald TranscriptOct 11 2018, 8:32 PM

MMiller_WMF mentioned this in T205762: Understanding first day: determine which instrumentation can be re-used.Oct 11 2018, 10:14 PM

JTannerWMF moved this task from Inbox to Needs Discussion on the Growth-Team board.Oct 11 2018, 10:48 PM

MMiller_WMF moved this task from Needs Discussion to Upcoming Work on the Growth-Team board.Oct 12 2018, 6:55 PM

Copying over from T205762, the list of schemas we are interested in using are:

Echo
EchoInteraction
Edit
GettingStartedRedirectImpression
GuidedTourButtonClick
GuidedTourExited
GuidedTourExternalLinkActivation
GuidedTourGuiderHidden
GuidedTourGuiderImpression
GuidedTourInternalLinkActivation
PrefUpdate
ServerSideAccountCreation

We'll also be using the page creation EventBus schema (it's a variant of the revision creation schema).

MMiller_WMF added a subtask: T202348: Resume refinement of edit events in Data Lake.Oct 16 2018, 10:02 PM

After discussing with @nettrom_WMF and @kostajh it looks like we can answer the questions in the description of this task for each of the schemas listed above. I think @nettrom_WMF can answer all the questions except for "Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?", which can be answered by @kostajh.

Could you both fill in the table on the task description as a checklist to make sure we've got everything? Please leave notes if appropriate. I know we've already discussed some of this, so the same information can be replicated here.

MMiller_WMF assigned this task to nettrom_WMF.Oct 16 2018, 10:57 PM

MMiller_WMF added a project: Product-Analytics.

MMiller_WMF edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.

MMiller_WMF moved this task from Incoming to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.

nettrom_WMF moved this task from Triage to Doing on the Product-Analytics board.Oct 18 2018, 8:32 PM

MMiller_WMF updated the task description. (Show Details)Oct 19 2018, 7:08 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 7:27 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 7:55 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 8:05 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 8:09 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 8:13 PM

kostajh updated the task description. (Show Details)Oct 19 2018, 8:16 PM

kostajh updated the task description. (Show Details)

nettrom_WMF added a subtask: T207526: New account interaction with Echo post registration.Oct 19 2018, 11:42 PM

kostajh updated the task description. (Show Details)Oct 22 2018, 3:12 PM

kostajh updated the task description. (Show Details)Oct 22 2018, 3:35 PM

nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 6:37 PM

nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 7:43 PM

nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 8:50 PM

Regarding the note for the ServerSideAccountCreation schema, that "there is no guarantee this will be called in a successful account creation process". Because it was easy, I wrote the HQL query below to check all registered accounts between March and September 2018 on Czech and Korean Wikipedia. In neither of them could I find an account that was not also logged by the schema (zero out of 7,074 in Czech, zero out of 15,378 in Korean). But, we do know that EventLogging isn't designed to be completely reliable (ref the EventLogging vs EventBus documentation).

SELECT sum(if(s.event.userId IS NULL, 1, 0)) AS num_null,
       sum(if(s.event.userId IS NOT NULL, 1, 0)) AS num_not_null
FROM wmf_raw.mediawiki_user AS u
LEFT JOIN event.serversideaccountcreation AS s
ON s.event.userId = u.user_id
   AND s.year = 2018
   AND s.month >= 3
   AND s.month < 9
WHERE u.snapshot = '2018-09'
  AND u.wiki_db = 'cswiki'
  AND s.wiki = 'cswiki'
  AND u.user_registration >= '20180301000000'
  AND u.user_registration < '20180901000000';

nettrom_WMF triaged this task as Medium priority.Oct 22 2018, 9:30 PM

nettrom_WMF moved this task from Ready for Development to Needs PM Review on the Growth-Team (Sprint 0 (Growth Team)) board.

@nettrom_WMF and @kostajh -- thanks so much for doing this thorough audit. Looking at the table, it seems that there is nothing new that we turned up beyond what we already have tasked and are pursuing: getting the Edit schema into Hive, making its sampling rate flexible, and deciding not to worry about the Echo schema for now.

Do you two agree that this task has not surfaced any new information or action items? In that case, we can consider it accomplished.

@nettrom_WMF thanks for checking into that. I consider this task done! :)

@kostajh Of course! And thanks for taking the lead and getting so much of this done before I got around to it! :)

@MMiller_WMF Agree with you here, there doesn't appear to be any new action items.

MMiller_WMF closed this task as Resolved.Oct 23 2018, 9:58 PM

MMiller_WMF moved this task from Needs PM Review to Epics In Progress on the Growth-Team (Sprint 0 (Growth Team)) board.

nshahquinn-wmf closed subtask T202348: Resume refinement of edit events in Data Lake as Resolved.Dec 19 2018, 11:22 PM

Urbanecm edited subscribers, added: Urbanecm_WMF; removed: Urbanecm.Aug 26 2020, 3:25 PM

nettrom_WMF closed subtask T207526: New account interaction with Echo post registration as Declined.Jun 21 2021, 4:53 PM

Aklapper removed a subscriber: Growth-Team.May 16 2023, 10:22 AM

Understanding first day: audit other schema logicClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Understanding first day: audit other schema logic
Closed, ResolvedPublic
Actions

Related Objects
Search...