Page MenuHomePhabricator

Add distinct event to talk_page_edit for when a new section is added
Closed, ResolvedPublic

Description

In an attempt to rectify the the bucketing imbalances identified in T291308, we will use this ticket to add distinct event to talk_page_edit when a new section is added

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Change 762497 had a related patch set uploaded (by DLynch; author: DLynch):

[schemas/event/secondary@master] talk_page_edit: add a new component_type for topics

https://gerrit.wikimedia.org/r/762497

Change 762499 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/DiscussionTools@master] Log talk_page_edit events for adding a new topic

https://gerrit.wikimedia.org/r/762499

Change 762497 merged by jenkins-bot:

[schemas/event/secondary@master] talk_page_edit: add a new component_type for topics

https://gerrit.wikimedia.org/r/762497

Change 762499 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Log talk_page_edit events for adding a new topic

https://gerrit.wikimedia.org/r/762499

DLynch added a project: Skipped QA.

This can't be tested by QA, as it's entirely server-side logging. @MNeisler can watch for talk_page_edit events with component_type=topic to verify it's working.

MNeisler edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

This should be out on all wikis now, so events will be coming in.

MNeisler moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.

I've confirmed we are now logging component_type = 'topic' events in the talk_page_edit schema as of 17 Febraury 2022. All associated data with these events also appears to be logging as expected.

Before resolving this task, I want to make sure I understand what each event type means and the expected ids that should be logged with them. Here are the four different event types I'm seeing in talk_page_edit (with mappings to how their ids are currently logged). @DLynch Can you confirm if what's documented below is correct/expected? Let me know if it would help to see some examples from the data.

New topic (and first comment)
component_type = ‘topic’
topic_id = unique identifier for the topic
comment_parent_id = topic_id
comment_id = unique identifier of the comment just posted
Definition: New topic posted (also implies the first comment in discussion)

Top-level comment
component_type = ‘comment’
topic_id = unique identifier for the topic the user is commenting to
comment_parent_id = topic_id
comment_id = unique identifier for the comment
component_type = ‘comment’
Definition: Top-level comment posted but not the first comment in a topic.

Non-top level comment
component_type = ‘comment’
topic_id = unique identifier for the topic the user is commenting to
comment_parent_id = Unique identifier of the comment that the user is responding to. (Does not equal topic_id for these events)
comment_id = unique identifier for the comment
Definition: @DLynch Are these events expected? If so, how are they different then a response event?

Response
component_type = ‘response’
topic_id = unique identifier for the topic the user is responding to
comment_parent_id = Unique identifier of the comment that the user is responding to. (Does not equal topic_id for these events)
comment_id = Unique identifier for the comment
Definition: A response to a comment

@MNeisler You shouldn't be seeing anything that's not a top-level comment with component_type=comment -- comments should always have topic_id = comment_parent_id. Could you give me some examples of those events so I can track them down? (It outright shouldn't be possible to get a comment event when the parent isn't a heading...)

Other than that, everything described is what I would expect.

MNeisler moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.

Could you give me some examples of those events so I can track them down?

@DLynch

Sure, I provided some examples of events where a component_type = comment does not have topic_id = comment_parent_id that were logged yesterday in this google doc.

There were only 28 of these types of events so they are not that common and they seem to only occur for integration = page edits.

Query used the retrieve the above examples:

SELECT
*
FROM
event.mediawiki_talk_page_edit
WHERE
year = 2022
ANd month = 03
AND day = 01
AND component_type = 'comment'
AND comment_parent_id != topic_id
DLynch updated Other Assignee, added: DLynch.

Sampling through those, I think I see what's going on, which is a case I wasn't considering. What you've labeled "Non-top level comment" occurs when a user leaves a comment in response to a less-than-section-level heading -- the topic_id remains the section-level heading, but it's labeled as a comment because it's a direct response to a heading. This is uncommon in our results because subheadings within a discussion on a talk page are themselves pretty uncommon.

e.g. https://en.wikipedia.org/w/index.php?title=Wikipedia:Reliable_sources/Noticeboard&diff=prev&oldid=1074647735&diffmode=source or https://en.wikipedia.org/w/index.php?title=Talk:Sex&diff=prev&oldid=1074586086&diffmode=source

Now I know what's causing it, I think this is probably reasonable as a way for the data to be represented, unless you have any concerns?

MNeisler updated Other Assignee, added: ppelberg; removed: DLynch.

@DLynch. That makes sense. I don't have any concerns since I can distinguish these events from "top-level" comments in the analysis. I primarily wanted to understand their source and make sure it wasn't a bug. Thanks for looking into it.

I've documented this info in the talk_page_edit spec for future reference.

@ppelberg - Reassigning to you for sign-off.