Page MenuHomePhabricator

Add a edit attempt identifier to the Wikistories contributor data stream
Closed, ResolvedPublic

Description

The wikistories_contribution_event stream captures a variety of events from the Wikistories editing funnel: story_builder_open, add_frame, publish_success, and so on.

However, the stream doesn't have a unique identifier for each edit attempt. Without this, we cannot properly group events from a single journey through the funnel.

Let's add such an ID (call it contribution_attempt_id?). I would suggest using mw.user.generateRandomSessionId; there are existing IDs we could reuse (like the pageview token), but since we have no interest in joining this stream with another one, we should just be explicit about that and create a new one.

Event Timeline

@nshahquinn-wmf what is the importance/urgency of this task?

Would you be able to handle the schema part while I make the code change?

@nshahquinn-wmf what is the importance/urgency of this task?

Low. This will be very useful when we try to use this stream to understand use of the story builder, but I think it will be a fairly long time until we get to that stage. I proposed it in kanban since it seems like we might as well get it done now, but if you want to defer it, I'm fine with that.

Would you be able to handle the schema part while I make the code change?

Yep, no problem. I'll go ahead and do that, and you can merge the patch and change the code whenever you like.

SBisson edited projects, added Wikistories (R2); removed Wikistories.
SBisson updated Other Assignee, added: SBisson.
SBisson moved this task from Backlog to Dev on the Inuka-Team (Kanban) board.

[...]
Yep, no problem. I'll go ahead and do that, and you can merge the patch and change the code whenever you like.

Sounds like a plan

Change 836266 had a related patch set uploaded (by Neil P. Quinn-WMF; author: Neil P. Quinn-WMF):

[schemas/event/secondary@master] Add Wikistories contribution_attempt_id

https://gerrit.wikimedia.org/r/836266

nshahquinn-wmf updated Other Assignee, added: nshahquinn-wmf; removed: SBisson.

Okay, the patch is ready.

Note that, in addition to adding the contribution_attempt_id field, we will need to update the $schema field to '/analytics/mediawiki/wikistories_contribution_event/1.1.0'.

@SBisson actually, please go ahead and merge the patch as soon as you have a chance (you can still do the instrumentation change whenever). I've been working on improving the description fields in the schema and it will simplify things if I can base it on this patch.

Change 836266 merged by jenkins-bot:

[schemas/event/secondary@master] Add Wikistories contribution_attempt_id

https://gerrit.wikimedia.org/r/836266

Change 841953 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/Wikistories@master] Add contribution_attempt_id to contribution events

https://gerrit.wikimedia.org/r/841953

Change 841953 merged by jenkins-bot:

[mediawiki/extensions/Wikistories@master] Add contribution_attempt_id to contribution events

https://gerrit.wikimedia.org/r/841953

This has been added. I will move this for sign off.

image.png (642×1 px, 155 KB)

Sorry for the delay!

I just checked the values of this field that we've received so far. The results are very strange! Since this data goes back 90 days, all the nulls are expected, but the many values of codfw and eqiad are definitely wrong. They should have failed validation against the schema since they don't match the specified pattern (^[0-9a-z]{20}$), but there haven't been any validation errors for this schema.

valuefrequency
null1340
codfw1215
eqiad50
7d4a499adcd850ec57693
c8d489b1613f97a9124c3
ce00b7c2a11114e5ea502
92de0c331e9d2d29fae32
cdfce631ceeaab2e262c2
03f21bdc208d425b7ffd2
88f3a3f6b6b005c520e82
......

However, if I only look at the data going back to 2022-10-17 (the start of the train where our instrumentation code was deployed), those weird values disappear and everything looks correct.

valuefrequency
null27
7d4a499adcd850ec57693
c8d489b1613f97a9124c3
3bd60ed9a56b8fb2fda02
cdfce631ceeaab2e262c2
65de010762602c8ea53c2
56e1fecbdd705f7ba0062
a6c128c4b0315732fbf52
92de0c331e9d2d29fae32
3aa0bd519016f1a5ed582
......

So my guess is that it's some problem with the data ingestion or storage. I'll report that to Data Engineering in case it's a sign of a deeper problem, but from our perspective I think this is done.

nshahquinn-wmf moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

I need to add this new field to the sanitization allowlist.

Change 852308 had a related patch set uploaded (by Neil P. Quinn-WMF; author: Neil P. Quinn-WMF):

[analytics/refinery@master] Retain hashed Wikistories contribution_attempt_id

https://gerrit.wikimedia.org/r/852308

Waiting on review by Data Engineering.

Change 852308 merged by Mforns:

[analytics/refinery@master] Retain hashed Wikistories contribution_attempt_id

https://gerrit.wikimedia.org/r/852308