Page MenuHomePhabricator

Implement a way to relate the components of a conversation
Open, Needs TriagePublic

Description

This task involves the work with relating replies to comments and comments to topics in such a way that this information/data can be used, at scale, to evaluate the impact topic subscriptions (T273920) are having on the rates and speeds with which people receive responses to the topics they starts and comments they post on talk pages.

Requirements

Behavior

  • Talk page edits are categorized and related such that we can answer the questions below. Note: these questions are borrowed from T280895.
    • How much time, on average, elapses between when someone posts on a talk page (e.g. start a new topic, comments in an existing one) and another person responds to them?
    • What percentage of comments and headings receive a response from another person within __hours/days of being posted?

Meta

  • The logic we are implementing as part of this task should act on/be applied to all talk page edits, regardless of the editing interface someone used to publish said edits.
    • Said another way: whether someone posted a comment using the Reply Tool or full-page wikitext editing should *not* affect how said edit is categorized. We met a similar requirement as part of implementing Topic Subscriptions (T263820).

Open questions

  • What – if any – other requirements will need to be met for @MNeisler/Product Analytics to aggregate/query the data we are already tracking.
    • See T280100#7174055 for more information about the data we are already tracking.

To answer the questions identified above, the instrumentation will need to include:

  • A way to distinguish the components of the conversation we decide to track (topic, comment, response)
  • Unique topic identifier to relate comments to topics
  • Unique comment identifier to relate response to comments

Done

  • The instrumentation needed to fulfill the requirements above is implemented
  • Any additional tickets are filed (e.g. a ticket for QA)

Event Timeline

The definition of comment is incorrect, I believe. All initial responses to a topic are indented once. Only the actual topic-starting text is not indented.

To illustrate this, here's some annotated source I copied from Help_talk:Talk_pages.

image.png (368×688 px, 58 KB)
image.png (368×688 px, 71 KB)

Technically our data model treats the text of the topic as just the first comment. You can see this with dtdebug=1:

image.png (372×2 px, 131 KB)

The definition of comment is incorrect, I believe. All initial responses to a topic are indented once. Only the actual topic-starting text is not indented.

@DLynch before traveling further down the path of revising the definitions I proposed in the task description, can you share how you think about these two questions?

  1. What about how the software is currently written would constrain our ability to answer these questions?
    • A) "For all comments and new topics, how much time, on average, elapses between when someone posts on a talk page and another person responds?"
    • B) "For all comments and topics posted after a certain date, what percentage of said "comments" and "topics" receive a response from another person within __hours/days?"
  2. What – if any – changes could we make to help us more accurately/complete answer "A)" and "B)" above?
  1. What about how the software is currently written would constrain our ability to answer these questions?

For both of these, we're not currently storing data in a way that would be helpful. Technically all this is present in the echo_event table, but relating the initial topic-event to the reply-event would be challenging (there's a comment-id and parent-id stored, but it's in the JSON-blob part of the echo_event row). It could also be extracted by analyzing the wikitext on the page at any given moment.

  1. What – if any – changes could we make to help us more accurately/complete answer "A)" and "B)" above?

We should probably implement some logging to a schema. We can do logging at the same time as the echo_event rows are created, just sending the data in a more convenient fashion for us. We could directly store a time-since-parent-comment value, if you don't mind us trusting what the wikitext says about the parent's timestamp. (If not, @MNeisler could presumably look up the logged row for the parent -- assuming that we're doing 100% logging and not sampling.)

ppelberg updated the task description. (Show Details)

On the definitions:

  • What you called "direct response", I would call "reply"
  • What you called "comment", I would call "top-level comment"

That's how I described the structure in the recent doc page: https://www.mediawiki.org/wiki/Extension:DiscussionTools/How_it_works#Data_structures

Ok, I want to parse out – what I see as – the two open threads we have open in this ticket:

  1. Definitions: converging on semantic definitions for each of the unique elements/components of a talk page conversation that make up its structure
  2. Implementation: deciding on the approach for storing said "talk page structure"

Note: to avoid confusion and simplify this task, I've revised the task description to describe the behavior being asked for more generically; I've also removed the DEFINITIONS section.


1. Definitions

On the definitions:

  • What you called "direct response", I would call "reply"
  • What you called "comment", I would call "top-level comment"

That's how I described the structure in the recent doc page: https://www.mediawiki.org/wiki/Extension:DiscussionTools/How_it_works#Data_structures

This is a big help, @matmarex.

Can you comment what – if anything – is inaccurate and/or incomplete about the two statements below?

  • NO changes need to be made the current Data structure to satisfy the task description's ===Requirements.
  • In order to expose the number of comments within a conversation, we will likely need to extend the current Data structure to include a third "item", topics. Although, the work to implement this change does NOT need to happen as part of this task. Instead, it can happen as part of T269950.

2. Implementation

  1. What about how the software is currently written would constrain our ability to answer these questions?

For both of these, we're not currently storing data in a way that would be helpful. Technically all this is present in the echo_event table, but relating the initial topic-event to the reply-event would be challenging (there's a comment-id and parent-id stored, but it's in the JSON-blob part of the echo_event row). It could also be extracted by analyzing the wikitext on the page at any given moment.

Understood.

  1. What – if any – changes could we make to help us more accurately/complete answer "A)" and "B)" above?

We should probably implement some logging to a schema. We can do logging at the same time as the echo_event rows are created, just sending the data in a more convenient fashion for us. We could directly store a time-since-parent-comment value, if you don't mind us trusting what the wikitext says about the parent's timestamp. (If not, @MNeisler could presumably look up the logged row for the parent -- assuming that we're doing 100% logging and not sampling.)

@DLynch + @MNeisler: regarding actually, "...implement[ing] some logging to a schema..." does the below seem like the right order of operations to y'all?

  1. Converge on shared definitions for the components of the conversation we are wanting to track.
  2. Agree on/acknowledge any "lossyness" we may need to accept between how we'll have semantically defined the components of the conversation and what we'll be able to codify (read: represent in the software)
  3. Design the schema that will store the talk page structure we'll have specified in the preceding steps
  4. Implement the schema
  5. QA the schema/event logging