Page MenuHomePhabricator

Understanding first day: audit other schema logic
Closed, ResolvedPublic

Description

Once we have determined in T205762 what other schemas we want use, we need to dig deeper to make sure that the business logic for them will allow us to use them. In particular:

  • Are they still running and active?
  • Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?
  • Are they keyed on IDs that will allow us to join them in?

* Do they have timestamps that we can use for analysis? (Yes, all of them do.)

  • Are they writing to Hive? We learned that not all schemas are writing to Hive, which has to do with the differences in MariaDB and Hive column names and datatypes. We will need to reconcile or work around this

Here are those questions as a table to fill in like a checklist:

SchemaActiveHiveKeysSampling
EchoNo (see notes)N/ArecipientUserId100%
EchoInteractionYesYesuserId100%
EditYesNo (see notes)user.id (see notes)6.25%, but soon to be configurable to 100%
GettingStartedRedirectImpressionYesYesuserId100% of logged-in users, nothing from anonymous users
GuidedTourButtonClickYesYesuserId100%
GuidedTourExitedYesYesuserId100%
GuidedTourExternalLinkActivationYesYesuserId100%
GuidedTourGuiderHiddenYesYesuserId100%
GuidedTourGuiderImpressionYesYesuserId100%
GuidedTourInternalLinkActivationYesYesuserId100%
PrefUpdateYesYesuserId100%
ServerSideAccountCreationYesYesuserId100% in theory, but note "There is no guarantee this will be called in a successful account creation process"

Notes:

  • Echo schema: the echo_event and echo_notification tables in MW can get us data we need.
  • Edit schema: not currently in Hive, but we are working on that in T202348. The user.id field will then be renamed.

Event Timeline

Copying over from T205762, the list of schemas we are interested in using are:

  • Echo
  • EchoInteraction
  • Edit
  • GettingStartedRedirectImpression
  • GuidedTourButtonClick
  • GuidedTourExited
  • GuidedTourExternalLinkActivation
  • GuidedTourGuiderHidden
  • GuidedTourGuiderImpression
  • GuidedTourInternalLinkActivation
  • PrefUpdate
  • ServerSideAccountCreation

We'll also be using the page creation EventBus schema (it's a variant of the revision creation schema).

After discussing with @nettrom_WMF and @kostajh it looks like we can answer the questions in the description of this task for each of the schemas listed above. I think @nettrom_WMF can answer all the questions except for "Do they have sampling logic that does not record 100% of events for Czech and Korean Wikipedias?", which can be answered by @kostajh.

Could you both fill in the table on the task description as a checklist to make sure we've got everything? Please leave notes if appropriate. I know we've already discussed some of this, so the same information can be replicated here.

kostajh updated the task description. (Show Details)

Regarding the note for the ServerSideAccountCreation schema, that "there is no guarantee this will be called in a successful account creation process". Because it was easy, I wrote the HQL query below to check all registered accounts between March and September 2018 on Czech and Korean Wikipedia. In neither of them could I find an account that was not also logged by the schema (zero out of 7,074 in Czech, zero out of 15,378 in Korean). But, we do know that EventLogging isn't designed to be completely reliable (ref the EventLogging vs EventBus documentation).

SELECT sum(if(s.event.userId IS NULL, 1, 0)) AS num_null,
       sum(if(s.event.userId IS NOT NULL, 1, 0)) AS num_not_null
FROM wmf_raw.mediawiki_user AS u
LEFT JOIN event.serversideaccountcreation AS s
ON s.event.userId = u.user_id
   AND s.year = 2018
   AND s.month >= 3
   AND s.month < 9
WHERE u.snapshot = '2018-09'
  AND u.wiki_db = 'cswiki'
  AND s.wiki = 'cswiki'
  AND u.user_registration >= '20180301000000'
  AND u.user_registration < '20180901000000';

@nettrom_WMF and @kostajh -- thanks so much for doing this thorough audit. Looking at the table, it seems that there is nothing new that we turned up beyond what we already have tasked and are pursuing: getting the Edit schema into Hive, making its sampling rate flexible, and deciding not to worry about the Echo schema for now.

Do you two agree that this task has not surfaced any new information or action items? In that case, we can consider it accomplished.

@nettrom_WMF thanks for checking into that. I consider this task done! :)

@kostajh Of course! And thanks for taking the lead and getting so much of this done before I got around to it! :)

@MMiller_WMF Agree with you here, there doesn't appear to be any new action items.