Page MenuHomePhabricator

Provide an inline discussion feature, "DiscussThis"
Open, LowestPublic

Description

It would be nice to be able to attach a discussion to a particular place or range in a revision. Aspects or implementation details of this feature idea have been discussed before (T147896, T116350), but there is apparently no previous overarching product description.

@Jdforrester-WMF suggests the product name DiscussThis.

UI concept:

  • Discussions are always visible on the talk/flow page associated with the article being commented on. They should be displayed similarly to existing manual solutions, with quoted text followed by comments.
  • Discussions are optionally visible in VE (including the new wikitext mode). There would be a button to mark the discussion "resolved" so as to dismiss it from the VE view, and a way to reverse such dismissal.
  • Discussions can be created from the diff view.
    • A discussion creation link on each wikitext line would provide a degraded non-JS fallback for comment creation.
    • A discussion created in the diff view would remain publicly visible in the diff view.
    • The revision author would be notified.
    • A discussion on the diff view of the current revision would be visible in VE by default.
  • A discussion could also be created from VE. Such discussions would not need to be shown in the diff view in the minimum viable product.

Implementation details:

  • A discussion is canonically associated with a fixed rev_id and a byte offset into the wikitext.
  • The source position can be mapped to a Parsoid DOM node using Parsoid's DSR feature.
  • The source position would be carried forward from one revision to the next by considering a wikitext diff.
  • There would be a table tracking inline discussions, with an autoincremented discussion ID.
  • In the case of discussions hosted on non-Flow talk pages, a parser function would specify the discussion ID. On save, a hook would update a tracking table, enabling the diff and VE views to find the discussion even after archiving of the talk page.

Credit to @cscott and @ssastry who came up with most of the ideas here.

See also:

Event Timeline

Jdforrester-WMF renamed this task from Inline discussion feature "discuss this" to Provide an inline discussion feature, "DiscussThis".Jan 23 2018, 11:21 PM

Pubpub is experimenting with inline discussion (in addition to sidebar discussions). Discussion threads are all linked to either {page as a whole} or to a paragraph. Editors can choose to embed a [summary view] thread inline

Example: https://cursor.pubpub.org/pub/cursor-cursor-2017

This comment was removed by Sj.

See also T149667: Build an article annotation service at 2017 dev summit about a theorised generalised system that would be the underlying part of this, and T103081: Explore ideas for collaborative contribution which could be the starting point for design work on this.

It is rather expensive to track a marked section between revisions if they have several intermingled changes. An alternate to edit distances is to convolve a special locality-sensitive hash over the text. Usually you want a window that is shorter than the text you are looking for, perhaps half the length, and you would realign on word boundaries.

Moving a text by copy-pasting to another location would not cause any problems with an LSH detector. Also a text cut out at one revision and pasted back at a later revision would not pose any problems.

For this kind of text detection to fail the text must be rewritten pretty much, and if it fails then the text probably isn't the same that is discussed anyhow.

The fragment fingerprint is a simple digest, aka the "autoincremented discussion ID", and can be quite short. I have used a 128 bit digest, but it might be possible to use a shorter digest for smaller text fragments.

I would propose that a special URL fragment could be used, and then a JS gadget could try to find the fragment on the page. The found fragment could then be marked, but with the additional difficulty to find the correct start and end points.

(This was part of an old project proposal at University of Oslo, but I don't remember if that is described anywhere at Meta or Mediawiki. I don't think so.)

This could be extended into a tool to replace the current maintenance templates. That would be pretty neat as it would move those templates out of the subject page, thus creating less edit conflicts with new editors.

To make this work the discussions should be given a category, which would resemble the current types of maintenance templates.

It is rather expensive to track a marked section between revisions if they have several intermingled changes. An alternate to edit distances is to convolve a special locality-sensitive hash over the text. Usually you want a window that is shorter than the text you are looking for, perhaps half the length, and you would realign on word boundaries.

Do you have a reference for this? What sort of LSH?

Wikidiff2 is probably O(N^3) in the worst case, but appears to be tractable anyway thanks to limits and optimisation of constant factors. We need to do diffs between revisions anyway, for various reasons. Once we have a diff, we can use it for human review and text compression, and potentially blame maps. It seems to me that tracking a marked section between revisions is not significantly more expensive than the O(N^3) cost of computing the diff.

The fragment fingerprint is a simple digest, aka the "autoincremented discussion ID", and can be quite short. I have used a 128 bit digest, but it might be possible to use a shorter digest for smaller text fragments.

You still need an autoincremented discussion ID, even if you have an LSH identifying its location. You can have more than one discussion associated with a given context. If you resolve a discussion, you would be able to create a new one at the same location without reopening the old one. And it wouldn't make sense to merge discussions across multiple pages just because they refer to the same text.

This could be extended into a tool to replace the current maintenance templates. That would be pretty neat as it would move those templates out of the subject page, thus creating less edit conflicts with new editors.

OK, but that's out of scope.

Also the earlier T89575: Associate non-body content such as annotations and talk to a location in the article. (I've linked that and the others mentioned in the description)
Plus my own bodged together explorations in https://imgur.com/a/Kn3HZ in 2013! (Using tiled browser windows to simulate a sidebar, showing various comparisons and use-cases).
I've long thought this is one of the killer-features. Lots of technical and social complexities to navigate, but oh so powerfully useful.

Just wanted to (unhelpfully) chime in and express support for the potentially transformative power of annotations features on wiki. Per @Quiddity this strikes me as a dangerous but likely killer feature that would fundamentally improve editing and curation processes. On an internal note, I also think this would help mitigate the need to use Google docs within the foundation, which is something I'd love to see less of.

@JMinor: +1 on dogfooding our own software, instead of using google docs and etherpads!

If you look at the way google docs anchors their comments, it's quite simple: it stays associated with the given character location (with appropriate shifts if text ahead/behind is inserted/removed) and if the anchor is removed, the annotation disappears from the sidebar but still appears in the list of "all comments".

I'd caution against going *too* crazy with the anchoring algorithms. Simple things can work well! It's more important (IMO) to get a general framework for annotations in place. The annotations will always be against a certain specific revision, so they can be very precisely located (xpath in DOM or wikitext character range). Then we'll have a general API for "give me a location in the revision X which corresponds to this location in revision Y", which can be stupid-simple at first (based on line diffs, say), but get better over time if/when that's needed. Note that different users of annotations may have different requirements for the "relocate an annotation" implementation -- for example, content translation and the translate extension prefer to do only "precise" relocations, marking a translation as "fuzzy" if the annotated region has changed at all since the annotation was made. For inline discussions, the allowable relocations are likely to be more flexible. T116350#2772884 contains some further discussion and different use cases; as does https://en.wikipedia.org/wiki/User:Cscott/Ideas/Amazing_Article_Annotations .

It is rather expensive to track a marked section between revisions if they have several intermingled changes. An alternate to edit distances is to convolve a special locality-sensitive hash over the text. Usually you want a window that is shorter than the text you are looking for, perhaps half the length, and you would realign on word boundaries.

Do you have a reference for this? What sort of LSH?

Wikidiff2 is probably O(N^3) in the worst case, but appears to be tractable anyway thanks to limits and optimisation of constant factors. We need to do diffs between revisions anyway, for various reasons. Once we have a diff, we can use it for human review and text compression, and potentially blame maps. It seems to me that tracking a marked section between revisions is not significantly more expensive than the O(N^3) cost of computing the diff.

Simplest locality-sensitive hashing in use for this kind of purpose is Nilsimsa Hash, but the implementation for that one is a bit backward. You will probably want to use some <s>of the</s> ideas from spread spectrum coding, and how to find an encoded sequence in a noisy channel (the article).

Nilsimsa was usually used for spam detection, but now I believe most spam detectors use machine learning. Nilsimsa detects a specific prose, while the machine learning systems detects words and phrases with specific sentiments.

The lib I wrote was part of Wikibase, I wanted to detect similarity over labels and descriptions, had a O(n) complexity. The n is the text length on the subject page, the segment length don't really matter. Note that FuzzyComparer.php calculates a score for two strings, and does not convolve a hashing window. If you take this lib and convolve that you will get O(mn) which is bad. To do it right you must add and subtract at the ends of the window, then you get O(n).

The fragment fingerprint is a simple digest, aka the "autoincremented discussion ID", and can be quite short. I have used a 128 bit digest, but it might be possible to use a shorter digest for smaller text fragments.

You still need an autoincremented discussion ID, even if you have an LSH identifying its location. You can have more than one discussion associated with a given context. If you resolve a discussion, you would be able to create a new one at the same location without reopening the old one. And it wouldn't make sense to merge discussions across multiple pages just because they refer to the same text.

We are not talking about the same thing. I guess what you are saying is that you want two or more discussions about the text segment, and you want an identifier for each one of those. My idea was to just point to the text segment, and let the user points to a text segment on the subject page. The link to the subject page would just be a special fragment identifies a text segment. Typically, you have a link https://en.wikipedia.org/wiki/Norway#<magic><segment length><lsh> where <magic> is §, <segment length> is the log₂ of the segment length, and <lsh> is the digest you are looking for in the text. A JS-script can then pick up the fragment and turn it into a highlighted range. You won't have to track this linking, as long as <s>the</s> [sufficient] text remain<se>ed</s> you would hit it with the same link fragment.

Note that this can relocate a segment that is heavily edited, like when you're having one segment from one article and another segment from another article, and sentences are copy-pasted into different positions. This is although a bit dangerous, because it is easily confused with copyright violations.

This could be extended into a tool to replace the current maintenance templates. That would be pretty neat as it would move those templates out of the subject page, thus creating less edit conflicts with new editors.

OK, but that's out of scope.

What I want to do is to get rid of the maintenance templates, and because of that I need a way to point to a troublesome text segment, not just discuss some aspect with the text.

If the use-case is "I want to discuss a particular block of text from a wiki page," wouldn't the simplest and quickest implementation be a JavaScript script that copies the text and inserts it into <blockquote> tags in a new talk page section? It could even include a permalink to the exact version of the wiki page. Maybe we should build that first, see if anyone uses it or likes it, and then discuss this significantly larger engineering undertaking.

If there are other use-cases, maybe those could be fleshed out in this task or elsewhere.

Please consider the learnings from ArticleFeedbackv5 (https://www.mediawiki.org/wiki/Article_feedback/Version_5), which provided a somewhat similar "inline" comment feature. I was working closely with @Fabrice_Florin and his team back then. My personal key learnings are:

  • The feature made readers super happy.
  • The vast majority of the incoming comments was not actionable. This was actually intentional in the products design, and therefor never fully considered in any of the moderation process designs that have been added later.
  • The moderation process that was added later was designed like all comments are potentially relevant. This created a lot of tedious, unproductive workload that made active editors unhappy very fast.
  • The most significant frustration was experienced by editors that cared the most about the articles they maintained so carefully. These editors wanted feedback, but what they got was nothing they could work with. For example, on an article about a mammal the editor wanted feedback like "here is a paper with new information you can add to the article". Instead, they got requests from children asking how old the mammal gets, which is a non-scientific question in the first place, and something science actually does not know about most animals.
  • The moderation process that was added later was designed like all comments are potentially relevant. This created a lot of tedious, unproductive workload that made active editors unhappy very fast.
  • The most significant frustration was experienced by editors that cared the most about the articles they maintained so carefully. These editors wanted feedback, but what they got was nothing they could work with.

Things have changed a bit since that old time. Those points can now be assisted with predictions like ORES do.

Deskana subscribed.

This seems out of scope of the annual plan, and the relevant teams have all deprioritised it, so I doubt this will be worked on for quite some time.