Page MenuHomePhabricator

[SPIKE] Document talk page comment data structure
Closed, ResolvedPublic

Description

As part of T264885, we started storing information about comments posted to talk pages.

This task is about documenting the following:

  1. The information/metadata the software is now logging/storing about every comment posted to a talk page, regardless of the interface used to post said comments.
  2. Where and how this "information"/"metadata" is being logged/stored

...so that we can determine what – if any – additional work might need to be done so we can use this new data we are logging to answer questions like those listed in the ===Use cases section of T262107's task description.

Open questions

  • 1. What information/metadata is the software now logging/storing about every comment posted to a talk page, regardless of the interface used to post said comments?
  • 2. Where/how is this information/metadata being logged/stored?

Done

  • All ===Open questions are answered

Event Timeline

This is relation to our expectation that we will soon generate an echo_event row for every comment posted.

I wrote a documentation page that answers these questions (and others): https://www.mediawiki.org/wiki/Extension:DiscussionTools/How_it_works

Relevant fragment about data in Echo events:

Each event has the following properties:

  • (built-in in Echo) Page title
  • (built-in in Echo) Agent (user who caused the event, by leaving the comment)
  • (built-in in Echo) Section title
  • (built-in in Echo) Page revision
  • Subscription item name. (…)
  • New comment's ID and name. (…)
  • New comment's content, a snippet of which is shown in the notifications
  • List of users who were mentioned in the comment

This data is stored in one of Echo's database tables, however only the title and agent can be queried directly. Everything else is in a serialized blob.

We generate an event for every new talk page comment, regardless of whether anyone is subscribed to the thread it's in.

Relevant fragment about data in our own data structures:

DiscussionTools recognizes two kinds of items: headings and comments. (…)

Each item has the following properties:

  • ID and name, which are used to identify the item in different contexts
  • Range, referencing the HTML DOM nodes where it was detected. The range may begin or end in the middle of an element, and may span multiple elements in different parent nodes.
  • Indentation level (always 0 for headings, 1 for top-level comments, 2+ for replies)
  • References to parent item and reply items

Comments additionally have:

  • Signature ranges, as above, referencing the HTML DOM nodes of signatures
  • Author name
  • Date and time

Headings additionally have: (…)

This data structure is ephemeral and not stored anywhere. When it's needed, it is constructed from scratch from the page HTML. (The information is encoded back into the HTML in the formatter though, as described below.)

Given the above (Echo data not being queryable, and our own data not being stored at all), it probably can't be used to answer questions like in T262107, or at least not any interesting ones. But we have the data, and we could work on storing it in a different form to allow that.

I wrote a documentation page that answers these questions (and others): https://www.mediawiki.org/wiki/Extension:DiscussionTools/How_it_works

Excellent, Bartosz.

Given the above (Echo data not being queryable, and our own data not being stored at all), it probably can't be used to answer questions like in T262107, or at least not any interesting ones. But we have the data, and we could work on storing it in a different form to allow that.

Understood.

For the time being, we'll consider T284200 the ticket where we'll spec the work required to store this data in a way that allows us to query/aggregate it.