Page MenuHomePhabricator

Resume refinement of edit events in Data Lake
Closed, ResolvedPublic

Description

The Edit log in the Data Lake seems to contain no data at all:

select 
    year,
    month,
    count(*) as events
from edit
where year >= 0
group by year, month
order by year, month asc
limit 1000

Done. 0 results.

The only files in HDFS are a bunch named _REFINE_FAILED, each containing a single timestamp, from a few days in December 2017.

neilpquinn-wmf@stat1005:~$ hdfs dfs -ls -h -R /wmf/data/event/Edit
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 21:01 /wmf/data/event/Edit/year=2017/month=12
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=20
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=20/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=21
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=21/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=22
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=22/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=23
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=16/hour=23/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:59 /wmf/data/event/Edit/year=2017/month=12/day=17
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=17/hour=0
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 20:57 /wmf/data/event/Edit/year=2017/month=12/day=17/hour=0/_REFINE_FAILED
[....]
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=6
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=6/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=7
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=7/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=8
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=8/_REFINE_FAILED
drwxr-xr-x   - hdfs hadoop          0 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=9
-rw-r--r--   3 hdfs hadoop         26 2017-12-20 21:02 /wmf/data/event/Edit/year=2017/month=12/day=20/hour=9/_REFINE_FAILED

However, the data does exist in MariaDB:

select
    left(timestamp, 8) as date,
    count(*) as events
from `Edit_17541122`
where timestamp >= "201808"
group by left(timestamp, 8)

date	events
20180801	169194
20180802	167854
20180803	157676
20180804	130754
20180805	139192
20180806	173113
20180807	174028
20180808	172547
20180809	171772
20180810	157373
20180811	132995
20180812	142766
20180813	173578
20180814	172509
20180815	122250
20180816	109673
20180817	100810
20180818	87112
20180819	95706
20180820	117001

Any idea what's going on here?

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptAug 20 2018, 11:28 PM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf moved this task from Triage to Tracking on the Product-Analytics board.

Yes, the 'Edit' schema is blacklisted from Hive Refinement, because the schema is so bad incompatible

Yes, the 'Edit' schema is blacklisted from Hive Refinement, because the schema is so bad incompatible

Interesting! Where can I see the blacklist, and where are the requirement for schemas documented? We need to do a lot of fixing of the Edit log anyway (T197499), and I should check to see whether any other schemas owned by my teams are blacklisted.

https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine.pp#L25

We'd have to re-refine to get some error messages to remember what exactly was wrong, but an obvious one is dots in the field names, e.g. action.init.mechanism.

@Neil_P._Quinn_WMF so we moved this to radar because we can't ingest the events as they are right now. So the editing team needs to change the schema to comply with the guidelines. Do ask us if you need help crafting the schema.

nshahquinn-wmf renamed this task from No recent data in the Edit event log in the Data Lake to Resume refinement of edit events in Data Lake.Aug 23 2018, 10:54 PM

@Neil_P._Quinn_WMF so we moved this to radar because we can't ingest the events as they are right now. So the editing team needs to change the schema to comply with the guidelines. Do ask us if you need help crafting the schema.

That makes sense, but I'm not exactly what the guidelines are! According to this snippet I just found on Wikitext, the only problem is the dots in the field names. Is that the case? I'd hate to ask the Editing team to deploy a new version of the schema only to find out there are still problems and they have to do it again.

We have to run events through refine anyway to see what's wrong, since nobody remembers, but let's do this:

  • take the dots out
  • take another close look at the schema and make sure this is how you want to instrument there. To me, events generated by that schema are pretty different from all other schemas because they describe movement through a funnel, and are naturally grouped by the session Id. So ingestion into things like Druid would be pretty meaningless except to compare the same *step* of multiple sessions. I always thought it would make more sense to split it up into multiple schemas that each measure the same kind of event, and to give each of them a session Id. It's up to you, but maybe it'd be best to sit down and talk about it. I know @Halfak had thoughts as well.
  • once we think we have a decent schema, have devs generate a few events and we'll run those through refine and see if there's anything else wrong. This is just to make sure we tell you as soon as possible, and when we review the schema we'll also do this exercise in abstract. So hopefully this step will be a quick check and then they can deploy.

I tried to run a refine for Edit. There might be further complications, but the one I'm currently getting is

org.apache.spark.SparkException: Cannot recognize hive type string: struct<action:string,action.abort.mechanism:string,action.abort.timing:bigint,action.abort.type:string,action.init.mechanism:string,action.init.type:string,action.loaded.timing:bigint,action.ready.timing:bigint,action.saveAttempt.timing:bigint,action.saveFailure.message:string,action.saveFailure.timing:bigint,action.saveFailure.type:string,action.saveIntent.timing:bigint,action.saveSuccess.timing:bigint,editingSessionId:string,editor:string,integration:string,mediawiki.version:string,page.id:bigint,page.ns:bigint,page.revid:bigint,page.title:string,platform:string,user.class:string,user.editCount:bigint,user.id:bigint,version:bigint>
...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 27)

So, at the moment we are stuck on the dots in the field names. It is possible that fixing that would allow refine to succeed.

FYI: Here's the proposal I wrote a while ago: https://meta.wikimedia.org/wiki/Schema_talk:Edit#A_proposal_I_wrote_a_while_ago

I still think it is a good proposal and would address some of the issues that @Milimetric raised, but I don't have much time and energy to drive it forward at the moment.

MMiller_WMF moved this task from Inbox to Upcoming Work on the Growth-Team board.

Growth team is going to pursue this, since we need the data in Hadoop for our "Understanding first day" project. More updates to come.

@nettrom_WMF is going to lay out the next steps for this, and tag engineers as appropriate.

@nettrom_WMF just nudged me to think about whether there are other field name changes we want to make at the same time. Here's what comes to mind:

  • Do we need the action prefix for so many field names (e.g. action.init.type, action.saveAttempt.timing, etc.)? Both Hive and MariaDB further prefix all the event fields with event, so you end up with really cumbersome names like event.action.saveAttempt.timing.
  • editor should be renamed to interface since the word "editor" is already super overloaded here
  • platform currently has the enum value desktop, which would really be better stated as `computer

I've been collecting other longer-term ideas for the schema in T118063—perhaps we should work on some of them now since we're investing heavily in this schema?

Based on conversations with @Neil_P._Quinn_WMF and @Catrope, and going through the schema's current naming conventions, I'd like to propose that the fields listed in the table below get renamed.

Current nameNew name
action.init.typeinitType
action.init.mechanisminitMechanism
action.init.timinginitTiming
action.ready.timingreadyTiming
action.loaded.timingloadedTiming
action.saveIntent.timingsaveIntentTiming
action.saveAttempt.timingsaveAttemptTiming
action.saveSuccess.timingsaveSuccessTiming
action.saveFailure.typesaveFailureType
action.saveFailure.messagesaveFailureMessage
action.abort.typeabortType
action.abort.mechanismabortMechanism
action.abort.timingabortTiming
editorinterface
mediawiki.versionmwVersion
page.idpageId
page.titlepageTitle
page.nspageNs
page.revidrevisionId
user.iduserId
user.editCounteditCount
user.classuserClass

Using pageId, pageTitle, pageNs, revisionId, userId, and editCount follows the naming conventions found in the EventLogging Data fields documentation. Neil proposed to change "editor" to "interface" above, so I incorporated that. All the rest are my own suggestions for what might be decent names.

I am unsure about whether changing the values of the platform enumerator is a simple change or not, but if it is, I'd be happy to incorporate it.

Happy to hear thoughts on this. I'm also tagging @Jdlrobson for his input, Neil said that would be a good idea.

Since SQL is case insensitive, we recommend using snake_case rather than camelCase. This isn't as relevant with EventLogging schemas, but will be a requirement in schemas part of the WIP Modern Event Platform. Check out https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines.

@nettrom_WMF to support joining data with https://meta.wikimedia.org/wiki/Schema:ReadingDepth it might be worth adding a sessionToken and pageToken. That will be useful information (e.g. how do editors read articles before editing)

@Ottomata Switching to camel_case makes sense. It results in a couple more fields being renamed, but that seems reasonable given the large number of renames anyways.

@Jdlrobson Being able to make that connection sounds definitely useful. Is that something we can do this time around, @Catrope? I'll hold on adding the fields until I know more about the workload.

New table with snake_case naming:

Current nameNew name
isOversampleis_oversample
action.init.typeinit_type
action.init.mechanisminit_mechanism
action.init.timinginit_timing
action.ready.timingready_timing
action.loaded.timingloaded_timing
action.saveIntent.timingsave_intent_timing
action.saveAttempt.timingsave_attempt_timing
action.saveSuccess.timingsave_success_timing
action.saveFailure.typesave_failure_type
action.saveFailure.messagesave_failure_message
action.abort.typeabort_type
action.abort.mechanismabort_mechanism
action.abort.timingabort_timing
editorinterface
mediawiki.versionmw_version
page.idpage_id
page.titlepage_title
page.nspage_ns
page.revidrevision_id
editingSessionIdediting_session_id
user.iduser_id
user.editCountuser_editcount
user.classuser_class

user_editcount doesn't follow the snake_case convention, but instead mirrors the name of that field in the MediaWiki database's user table.

Edit: caught a typo in save_intent_timing.

user_editcount doesn't follow the snake_case convention, but instead mirrors the name of that field in the MediaWiki database's user table.

That's fine! Its mostly about not using capital letters; things can get weird (hive will to lower case () everything).

This is going to be a new schema, right? Not just a new Edit schema revision?

one note: sending page_title can run into issues of message length being too long (specially for languages that are notwestern ones). We recomend to send just page_id from client as tittle can be inferred from the pageId using mw API.

We also recommend to have a whole new schema rather than using teh old one and revamping completely field names.

@Ottomata and @Nuria : This is proposed as a revision of the current schema. Creating a new schema would be neat, but to me that is more associated with T118063, something which appears to be a much larger process. The current goal is to get Edit schema data into the Data Lake with as little disruption and effort as possible.

@Nuria : good point about page_title. Removing it would be fine for my use cases, but I'm unsure how it affects @Neil_P._Quinn_WMF, so I'll let him chime in. Also unsure about how much work it'll be to make the change. @Catrope, do you have concerns about that?

I think you should really consider making this a new schema, as this is a backwards incompatible schema change. We'll have to do a little bit of manual work to import this into Hive (mostly just deleting any old data and tables). In this case we can do it because we don't have any real data in Hive. If we did, new data would be rejected, we wouldn't be able to import into Hive.

Imagine how this will look in MySQL though. You'll have a table Edit_101, and your new revision table Edit_102 (just made up these versions). If anyone wants to query thjs data, they'll have to know to use Edit_101 with different column names before this change, and Edit_102 with different column names after this change.

@nettrom_WMF FYiI that there is no additional work on needed on your end to create a new schema. Just thinking of a different name, so, instead of 'Edit' (https://meta.wikimedia.org/wiki/Schema:Edit) you might have https://meta.wikimedia.org/wiki/Schema:EditActivity with your new column names. Edits to meta to update column names, versioning and changes that developers will need to do to send data according to these new column names are the same whether we create a new schema or not.

Given that we are renaming all columns in the schema and any prior record (as @Ottomata mentioned) would be invalid our strong recommendation for system happiness (when it comes to backwards compatibility) is to create a new schema.

Also question (cc @Neil_P._Quinn_WMF) are we OK with just having this data only on hive? (i assume so but please let us know otherwise)

Thanks for the feedback, @Nuria and @Ottomata! Discussed this with the Growth Team, and we all agree that creating a new schema is the better solution in this case.

My inclination is to create a new schema named "Edit2" because of three reasons:

  1. The new schema can be considered a version of the Edit schema, as we're mainly renaming the fields, not really creating a new schema.
  2. There's precedence from other schemas (e.g. when running experiments) to append a version number.
  3. I'd like to not create any new name that might conflict with what could come out of T118063.

The connection between the two schemas can then be documented on the respective schema's talk page. Feel free to let me know if this naming idea sounds like a terrible thing to do! :)

I'm concerned about removing the page title as page titles changes when pages are moved. However, if the data is in the Data Lake, I suspect we can use the mediawiki_page_history table to resolve the title and/or namespace of the page at the time of editing, if that turns out to be important to know. Not sure how complicated the query would be, but it sounds feasible.

Adding the two tokens that @Jdlrobson suggested is doable. @Catrope noted that those would only be available in JS, so some edit events won't have them.

I'd also like to see @Neil_P._Quinn_WMF chime in with his views.

Great thanks! Fine with Edit2, as long as there are good docs explaining as you say. Thank you!

I've created Schema:Edit2, and its talk page has the standard template as well as documentation of its relationship and some information about the properties. Once it's finalized, I'll also update the talk page of Schema:Edit to reflect its successor.

In my earlier overview tables I missed the action.saveFailure.timing property, it has been renamed in the new schema as save_failure_timing.

I added the two token properties with their documentation from Schema:ReadingDepth. Added a note to the description of them that they are only set for JS-created events. Not sure if an empty string is a good choice for what it should be when it's not set, but it's a draft so it can be updated based on what our engineers end up implementing.

Based on the description of the page title property, it refers to it being useful in understanding redlink editing, I decided to keep that property in for now.

Thanks for the feedback, @Nuria and @Ottomata! Discussed this with the Growth Team, and we all agree that creating a new schema is the better solution in this case.

I agree; this seems reasonable. Don't forget that the current schema name is on the purging whitelist, so the new one will have to be added too.

And, that reminds me, the whitelist is by field, so we would have needed to update it anyway with the new names.

My inclination is to create a new schema named "Edit2" because of three reasons:

  1. The new schema can be considered a version of the Edit schema, as we're mainly renaming the fields, not really creating a new schema.
  2. There's precedence from other schemas (e.g. when running experiments) to append a version number.
  3. I'd like to not create any new name that might conflict with what could come out of T118063.

I'm a bit torn about this. On one hand, we do need a broader discussion about the purpose and scope of this data stream, which could make a new name picked today inaccurate or lead to another backwards-incompatible schema change requiring another new name. On the other hand, that discussion might not happen for a long time (it's already been three years since I realized we needed to have it!), and the single most likely outcome is to largely retain the status quo anyway.

On the balance, considering that the current name doesn't reflect the actual unit of observation and is easy to confuse with the mediawiki_revision_create schema, I think we should just go ahead and change the schema name to EditAttemptStep.

Sorry to chime in with this after you've created the new page, Morten!

I'm concerned about removing the page title as page titles changes when pages are moved. However, if the data is in the Data Lake, I suspect we can use the mediawiki_page_history table to resolve the title and/or namespace of the page at the time of editing, if that turns out to be important to know. Not sure how complicated the query would be, but it sounds feasible.

Yes, that seems feasible to me too, although I current have some doubts about the reliability of that data (T205594).

I added the two token properties with their documentation from Schema:ReadingDepth. Added a note to the description of them that they are only set for JS-created events.

I don't object to the addition, but if we start collecting this additional data in the stream, I think that means we need to run the change by Legal. Worth considering since you have a tight timeline.

Not sure if an empty string is a good choice for what it should be when it's not set, but it's a draft so it can be updated based on what our engineers end up implementing.

Is there a reason we can't just leave them unset instead? As far as I know, we do this all the time (e.g. with abortMechanism on any event that isn't an abort) and it just results in a null value in the table.

I don't object to the addition, but if we start collecting this additional data in the stream, I think that means we need to run the change by Legal. Worth considering since you have a tight timeline.

We can collect this data for 90 days without issues, we will be dropping it after 90 days as we do not retain long term records of browsing sessions and pages. So neither one of this two pieces will be on the whitelist that specifies data retained for longer than 90 days.

Looks like this is all taken care of!

select
    to_date(dt) as date,
    count(*) as events
from event.editattemptstep
where year = 2018 and month=12
group by to_date(dt)
order by date asc
limit 1000;

  	date		events
1	2018-12-01	475569
2	2018-12-02	518608
3	2018-12-03	552739
4	2018-12-04	540619
5	2018-12-05	536171
6	2018-12-06	513497
7	2018-12-07	496912
8	2018-12-08	502015
9	2018-12-09	555893
10	2018-12-10	545976
11	2018-12-11	532330
12	2018-12-12	522027
13	2018-12-13	517614
14	2018-12-14	481217
15	2018-12-15	452051
16	2018-12-16	485081
17	2018-12-17	366834
18	2018-12-18	405502