Page MenuHomePhabricator

Update EventLogging code to facilitate move to EditAttemptStep schema
Closed, ResolvedPublic

Description

In order to get edit data flowing into the Data Lake, we decided in T202348 to create a new schema called Schema:EditAttemptStep. This schema is an evolution of the Edit schema. The property names have been updated to make them Data Lake-compatible. Two properties were also added in order to facilitate joining with Schema:ReadingDepth.

The properties that have their names changed between the old and the new schema are as follows:

Old nameNew name
isOversampleis_oversample
action.init.typeinit_type
action.init.mechanisminit_mechanism
action.init.timinginit_timing
action.ready.timingready_timing
action.loaded.timingloaded_timing
action.saveIntent.timingsave_intent_timing
action.saveAttempt.timingsave_attempt_timing
action.saveSuccess.timingsave_success_timing
action.saveFailure.typesave_failure_type
action.saveFailure.messagesave_failure_message
action.saveFailure.timingsave_failure_timing
action.abort.typeabort_type
action.abort.mechanismabort_mechanism
action.abort.timingabort_timing
editoreditor_interface
mediawiki.versionmw_version
page.idpage_id
page.titlepage_title
page.nspage_ns
page.revidrevision_id
editingSessionIdediting_session_id
user.iduser_id
user.editCountuser_editcount
user.classuser_class

The following two properties have been added in the new schema:

  • page_token
  • session_token

The code points that previously logged using the Edit schema will need to be updated to reflect the new schema, hence this task.

Event Timeline

MMiller_WMF created this task.

We assigned this to @nettrom_WMF until we know for sure what the column names are from T202348.

nettrom_WMF renamed this task from Change EventLogging column names in editor code to Update EventLogging code to facilitate move to Edit2 schema.Oct 25 2018, 8:43 PM
nettrom_WMF reassigned this task from nettrom_WMF to Catrope.
nettrom_WMF updated the task description. (Show Details)

Change 470066 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/WikimediaEvents@master] Update Edit schema for rename to Edit2

https://gerrit.wikimedia.org/r/470066

Change 470067 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/VisualEditor@master] Update EventLogging code for Edit->Edit2 schema migration

https://gerrit.wikimedia.org/r/470067

Change 470068 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/WikiEditor@master] Update EventLogging code for Edit->Edit2 schema migration

https://gerrit.wikimedia.org/r/470068

Change 470069 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/MobileFrontend@master] Update EventLogging code for Edit->Edit2 schema migration

https://gerrit.wikimedia.org/r/470069

@Nuria commented on gerrit:

So we are clear, the moment this is deployed events for the "older" Edit schema will stop "flowing" in the system. I think that is what we want but just double checking.

My understanding is that, yes, that is what we want but I am posting here so subscribers in Phab can see and comment on this if there's any issue.

Update: per T202348#4697469 , the new name is now EditAttemptStep, not Edit2.

nettrom_WMF renamed this task from Update EventLogging code to facilitate move to Edit2 schema to Update EventLogging code to facilitate move to EditAttempStep schema.Oct 29 2018, 6:39 PM
nettrom_WMF renamed this task from Update EventLogging code to facilitate move to EditAttempStep schema to Update EventLogging code to facilitate move to EditAttemptStep schema.
nettrom_WMF updated the task description. (Show Details)

Per T202348#4697469, the page_token and session_token properties have been updated to be optional.

Per a conversation with @kostajh, interface is a reserved keyword in JavaScript, so we renamed that field to editor_interface so as to make the code cleaner.

Change 470066 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Update Edit schema for rename to EditAttemptStep

https://gerrit.wikimedia.org/r/470066

Change 470068 merged by jenkins-bot:
[mediawiki/extensions/WikiEditor@master] Update EventLogging code for Edit->EditAttemptStep schema migration

https://gerrit.wikimedia.org/r/470068

Change 470067 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Update EventLogging code for Edit->EditAttemptStep schema migration

https://gerrit.wikimedia.org/r/470067

Change 470069 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Update EventLogging code for Edit->EditAttemptStep schema migration

https://gerrit.wikimedia.org/r/470069

@Catrope: All four patches are merged, can this task be resolved?
(Or if this still needs QA by the Growth-Team, then maybe T207803#4704163 should be reverted?)

@Catrope - looking at EditAttemptStep_18530416 in betalabs, columns ended in _timing have to many Null values.

  • too many NULL values for init_count:
[log]> select  event_init_type, count(*) from EditAttemptStep_18530416 group by event_init_type;
+-----------------+----------+
| event_init_type | count(*) |
+-----------------+----------+
| NULL            |      166 |
| page            |       48 |
| section         |       15 |
+-----------------+----------+
3 rows in set (0.00 sec)
  • all types of events can have NULL
[log]> select distinct(event_action) from EditAttemptStep_18530416 where event_init_type is NULL;
+--------------+
| event_action |
+--------------+
| loaded       |
| ready        |
| saveAttempt  |
| saveFailure  |
| saveIntent   |
| saveSuccess  |
| abort        |
+--------------+
7 rows in set (0.00 sec)
  • event_init_timing has only NULL value
[log]> select distinct(event_init_timing) from EditAttemptStep_18530416;
+-------------------+
| event_init_timing |
+-------------------+
|              NULL |
+-------------------+
1 row in set (0.00 sec)

Any concerns?

The schema is a bit confusing. init_type, init_mechanism, etc are only set when action='init', ready_timing is only set when action='ready', etc.

So if I break down your queries with that in mind, I get what I expect:

MariaDB [log]> select event_init_type, count(*) from EditAttemptStep_18530416 where event_action='init' group by event_init_type;
+-----------------+----------+
| event_init_type | count(*) |
+-----------------+----------+
| page            |       49 |
| section         |       15 |
+-----------------+----------+
2 rows in set (0.00 sec)

MariaDB [log]> select event_init_type, count(*) from EditAttemptStep_18530416 where event_action!='init' group by event_init_type;
+-----------------+----------+
| event_init_type | count(*) |
+-----------------+----------+
| NULL            |      166 |
+-----------------+----------+
1 row in set (0.00 sec)

However, you're right that event_init_timing is always null, and it wasn't like that in the old schema:

MariaDB [log]> select count(*) from EditAttemptStep_18530416 where event_init_timing is not null;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [log]> select count(*) from Edit_17541122 where `event_action.init.timing` is not null;
+----------+
| count(*) |
+----------+
|      105 |
+----------+
1 row in set (0.00 sec)

All the other *_timing fields have more than 0 rows where they're not null, so it looks like something's wrong here.

per @Catrope - need to checked if there's anything else left to clean up

Here's an unsolicited thought! But! This thought should not change your plans or block you in anyway.

It's a bit awkward to conditionally use parts of the schema based on the action type. You are treating this single schema as if it was a bunch of different event types merged together. This might be just a consequence of how EventLogging works; a schema corresponds 1:1 with an event stream. (Modern Event Platform will let you reuse a schema for multiple event streams.)

Instead of having all these different timing fields, of which only one will be used depending on the event type, you could have a single timing field and emit multiple events. E.g.

An init event:

{ 
   action: init
   timing: ...
}

Or a save success event:

{
  action: save_success
  timing: ...
}

Then when you get around to querying this data for analysis, you won't have to change your SQL so much for every timing query. You could even just group by event_type and then average timings, something like:

select avg(timing) from EditAttemptStep group by action

Or even

select sum(timing) from EditAttempStep

However, you're right that event_init_timing is always null, and it wasn't like that in the old schema.

It looks like we stopped logging init_timing due to some other issue which happened in June 2018.

select
left(timestamp, 6) as month,
sum(`event_action.init.timing` is null) / count(*) as null_init_timings
from log.Edit_17541122
where
event_action = "init"
group by left(timestamp, 6);

month	null_init_timings
201712	0.8566
201801	0.8470
201802	0.8463
201803	0.8466
201804	0.8486
201805	0.8492
201806	0.9667
201807	1.0000
201808	1.0000
201809	1.0000
201810	1.0000
201811	1.0000

So it's not related to this change. There's no urgent need to fix it either, since I don't think there's any current use for the data and there are a lot of issues with timing values in general.

@Catrope there's no need to fix the init_timing now, so I don't see anything more than needs to be done here. Am I missing anything?

Instead of having all these different timing fields, of which only one will be used depending on the event type, you could have a single timing field and emit multiple events.

Good idea! I've added this to our laundry list of 'sometime' fixes for this schema (T118063)

Thanks for filing a separate task for the init_timing issue, @Neil_P._Quinn_WMF . With that done, I think we can close this task now.

@Ottomata I think you're right, and schemas that I've made this year do look like that, but this is one of the first large EventLogging schemas ever created, so it does some things that we now know not to do