Page MenuHomePhabricator

Understanding first day: audit other schema logic
Closed, ResolvedPublic

Description

Once we have determined in T205762 what other schemas we want use, we need to dig deeper to make sure that the business logic for them will allow us to use them. In particular:

  • Are they still running and active?
  • Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?
  • Are they keyed on IDs that will allow us to join them in?

* Do they have timestamps that we can use for analysis? (Yes, all of them do.)

  • Are they writing to Hive? We learned that not all schemas are writing to Hive, which has to do with the differences in MariaDB and Hive column names and datatypes. We will need to reconcile or work around this

Here are those questions as a table to fill in like a checklist:

SchemaActiveHiveKeysSampling
EchoNo (see notes)N/ArecipientUserId100%
EchoInteractionYesYesuserId100%
EditYesNo (see notes)user.id (see notes)6.25%, but soon to be configurable to 100%
GettingStartedRedirectImpressionYesYesuserId100% of logged-in users, nothing from anonymous users
GuidedTourButtonClickYesYesuserId100%
GuidedTourExitedYesYesuserId100%
GuidedTourExternalLinkActivationYesYesuserId100%
GuidedTourGuiderHiddenYesYesuserId100%
GuidedTourGuiderImpressionYesYesuserId100%
GuidedTourInternalLinkActivationYesYesuserId100%
PrefUpdateYesYesuserId100%
ServerSideAccountCreationYesYesuserId100% in theory, but note "There is no guarantee this will be called in a successful account creation process"

Notes:

  • Echo schema: the echo_event and echo_notification tables in MW can get us data we need.
  • Edit schema: not currently in Hive, but we are working on that in T202348. The user.id field will then be renamed.

Event Timeline

Restricted Application added subscribers: Urbanecm, revi. · View Herald TranscriptOct 11 2018, 8:32 PM

Copying over from T205762, the list of schemas we are interested in using are:

  • Echo
  • EchoInteraction
  • Edit
  • GettingStartedRedirectImpression
  • GuidedTourButtonClick
  • GuidedTourExited
  • GuidedTourExternalLinkActivation
  • GuidedTourGuiderHidden
  • GuidedTourGuiderImpression
  • GuidedTourInternalLinkActivation
  • PrefUpdate
  • ServerSideAccountCreation

We'll also be using the page creation EventBus schema (it's a variant of the revision creation schema).

MMiller_WMF added a comment.EditedOct 16 2018, 10:56 PM

After discussing with @nettrom_WMF and @kostajh it looks like we can answer the questions in the description of this task for each of the schemas listed above. I think @nettrom_WMF can answer all the questions except for "Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?", which can be answered by @kostajh.

Could you both fill in the table on the task description as a checklist to make sure we've got everything? Please leave notes if appropriate. I know we've already discussed some of this, so the same information can be replicated here.

nettrom_WMF moved this task from Triage to Doing on the Product-Analytics board.Oct 18 2018, 8:32 PM
MMiller_WMF updated the task description. (Show Details)Oct 19 2018, 7:08 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 7:27 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 7:55 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 8:05 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 8:09 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 8:13 PM
kostajh updated the task description. (Show Details)Oct 19 2018, 8:16 PM
kostajh updated the task description. (Show Details)
kostajh updated the task description. (Show Details)Oct 22 2018, 3:12 PM
kostajh updated the task description. (Show Details)Oct 22 2018, 3:35 PM
nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 6:37 PM
nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 7:43 PM
nettrom_WMF updated the task description. (Show Details)Oct 22 2018, 8:50 PM

Regarding the note for the ServerSideAccountCreation schema, that "there is no guarantee this will be called in a successful account creation process". Because it was easy, I wrote the HQL query below to check all registered accounts between March and September 2018 on Czech and Korean Wikipedia. In neither of them could I find an account that was not also logged by the schema (zero out of 7,074 in Czech, zero out of 15,378 in Korean). But, we do know that EventLogging isn't designed to be completely reliable (ref the EventLogging vs EventBus documentation).

SELECT sum(if(s.event.userId IS NULL, 1, 0)) AS num_null,
       sum(if(s.event.userId IS NOT NULL, 1, 0)) AS num_not_null
FROM wmf_raw.mediawiki_user AS u
LEFT JOIN event.serversideaccountcreation AS s
ON s.event.userId = u.user_id
   AND s.year = 2018
   AND s.month >= 3
   AND s.month < 9
WHERE u.snapshot = '2018-09'
  AND u.wiki_db = 'cswiki'
  AND s.wiki = 'cswiki'
  AND u.user_registration >= '20180301000000'
  AND u.user_registration < '20180901000000';
nettrom_WMF triaged this task as Normal priority.Oct 22 2018, 9:30 PM

@nettrom_WMF and @kostajh -- thanks so much for doing this thorough audit. Looking at the table, it seems that there is nothing new that we turned up beyond what we already have tasked and are pursuing: getting the Edit schema into Hive, making its sampling rate flexible, and deciding not to worry about the Echo schema for now.

Do you two agree that this task has not surfaced any new information or action items? In that case, we can consider it accomplished.

@nettrom_WMF thanks for checking into that. I consider this task done! :)

@kostajh Of course! And thanks for taking the lead and getting so much of this done before I got around to it! :)

@MMiller_WMF Agree with you here, there doesn't appear to be any new action items.

MMiller_WMF closed this task as Resolved.Oct 23 2018, 9:58 PM