Write a comment parser for PHP HTML
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Esanders
	Oct 1 2019, 11:49 PM

Description

Write a parser for existing page: PHP output -> structured discussion format (thread/level etc.). Needs to handle different types of indentation.

Initially find signatures
[non-blocker] Expand upwards to find the start of a multi-line comment using the previous signature.

Details

	Subject	Repo	Branch	Lines +/-
	Detect and parse timestamps, signatures, comments and threads	mediawiki/extensions/DiscussionTools	master	+882 -1
	[DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads	mediawiki/core	master	+1 K -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T233443 [Epic] Reply Tool
Resolved		ppelberg	T235923 Replies v1.0: release replying to specific comments
Resolved		• iamjessklein	T235592 Replies v1.0: Create mockups
Duplicate		None	T236916 Replying v1.1: Create mockups
Resolved		Esanders	T238177 Create mockup for a new approach to previewing a reply
Resolved		ppelberg	T236196 Replies v1.0: Engineering
Resolved		matmarex	T234404 Write a comment parser for PHP HTML
Resolved		• iamjessklein	T236921 Replies v1.0: conduct usability testing
Resolved		ppelberg	T240063 [Spike] How do we have a test instance that supports a relatively larger number of people testing prototypes?
Resolved		Ryasmeen	T239961 QA Replying v1.0 on prototype server
Resolved		ppelberg	T239966 [Pre-deployment testing] Retaining draft comments/replies
Resolved		ppelberg	T239865 [Pre-deployment testing] Same reply can get posted multiple times if you click on "Reply" button multiple times
Resolved		ppelberg	T239873 [Pre-deployment testing] Clicking on edit on "Talk:Reply test" page, gives error "Uncaught TypeError: Cannot read property 'range' of undefined"
Resolved		matmarex	T239864 [Pre-deployment testing] Getting error " Uncaught TypeError: Cannot read property 'then' of undefined" while replying to a nested reply
Resolved		Esanders	T240639 Highlighting of the replies is incorrect on Beta cluster
Resolved		matmarex	T240640 [Feedback] Not supporting adding multiple signatures in a comment
Resolved		matmarex	T240643 Better handle edit conflicts (race conditions) in DiscussionTools
Resolved		ppelberg	T240259 Implement `unload handlers` for compose
Resolved		ppelberg	T240271 Warn contributors before they discard a draft
Resolved		ppelberg	T236951 Deploy Replying v1.0 to target wikis
Resolved		ppelberg	T240607 Finalize Replying deployment strategy
Resolved		ppelberg	T244869 Publish Replying deployment plan on MediaWiki
Resolved		• Whatamidoing-WMF	T244683 Make target wikis aware of replying deployment strategy
Resolved		Ryasmeen	T244432 Evaluate reliability of #DiscussionTools on target wikis
Resolved		matmarex	T244870 Deploy v1.0 via query string parameter to target wikis
Resolved		matmarex	T243621 Create query string parameter to enable/disable DiscussionTools on target wikis
Invalid		None	T245571 Reply link does not appear on the first time loading DiscussionTools (on hu and ar wiki)
Resolved		matmarex	T245574 After closing the reply box on fr.wiki, there is a blue border showing up on the talk page
Resolved		matmarex	T245692 [Investigate] Why does parser detect a comment when a link to a person's user page, talk page or contributions page is missing.
Resolved		matmarex	T245694 Handle signatures within transcluded templates or "subpages"
Resolved		matmarex	T245695 Reply link is inserted not at the end of the line when a signature is followed by an image (or other stuff which is not text)
Declined		None	T245768 Reply link is not showing up for a comment on fr.wiki
Duplicate		None	T245671 Incorrectly formatted bullet list appearing after clicking on Reply for a specific comment on a talk page on nl.wiki
Resolved		ppelberg	T245784 Reply link seems to be missing on a user talk page on nl.wiki
Duplicate		None	T245785 Reply link is not placed at the end of a paragraph on a user talkpage on nl.wiki
Resolved		ppelberg	T244872 Deploy Replying v1.0 as opt-in Beta Feature
Resolved		Esanders	T245539 Implement BetaFeatures integration
Resolved		ppelberg	T245794 Enable config flag to make Replying v1.0 available as Beta Feature (on target wikis)
Resolved		Jdforrester-WMF	T240468 Deploy DiscussionTools to beta cluster
Resolved		None	T240473 Undertake and pass a performance review for the DiscussionTools extension
Resolved		• Gilles	T240474 Performance review for the DiscussionTools extension
Duplicate		None	T240471 Undertake and pass a security review for the DiscussionTools extension
Duplicate		None	T240472 Security review for the DiscussionTools extension
Resolved	Security	sbassett	T242134 Security Review For Talk pages project
Resolved		Esanders	T240257 Add auto-save to the reply widget
Duplicate		ppelberg	T240470 Deploy DiscussionTools to Wikimedia production
Resolved		matmarex	T241391 Unable to respond to specific comments
Resolved		Esanders	T241393 Abandon changes dialog appears unexpectedly
Resolved		matmarex	T241861 [Regression] Reply link disappears from a talk page after editing it from source mode
Resolved		DLynch	T242184 Create a change tag for edits made using DiscussionTools
Duplicate		DLynch	T245069 Hide generic `DiscussionTools` change tag

Event Timeline

Esanders created this task.Oct 1 2019, 11:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 1 2019, 11:49 PM

Esanders moved this task from Incoming to Ready for Development on the VisualEditor (Current work) board.Oct 1 2019, 11:49 PM

Depending on what exactly the requirements are, you probably want to have a look at includes/DiscussionParser.php from Echo (and perhaps old versions of that code to see what alternatives there are and why they don't work).

matmarex updated the task description. (Show Details)Oct 9 2019, 7:42 AM

I only vaguely recall what it does, and only skimmed it now, but DiscussionParser can only detect a signature of the current user and the preceding section headings. We want to detect any signature, parse the user and time from it, guess the extent of the associated comment, and then guess the thread structure of the comments.

Change 539305 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] [DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/539305

gerritbot added a project: Patch-For-Review.Oct 9 2019, 8:00 AM

I wrote it!

I put it on Gerrit in the same change as the last part: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/539305. Usage: copy-paste signaturedetector.js into your browser console on a discussion page.

For example: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#span_data-segmentid="59"_class="cx-segment"???

Comment ranges are guessed by simply taking everything between two signatures (or a heading and a signature) and treating it as a comment. This seems to work well. Unsigned comments will be treated as part of the following comment because of this, but I think that's okay (and easily fixable in wikitext).

Section headings are thread "roots". The initial comment underneath the heading is treated as a reply to it. All other comments are then connected based on their indentation level (just counting the li/dl elements), and not DOM structure, because it tends to be a mess.

Thoughts:

It's difficult to guess what is the user's intent when a multi-line comment starts and ends at different indentation levels. I found comments where this was because the comment began with a bullet list (or a quote), comments where this was because someone forgot to indent the second line of their comment, and comments which are actually *two* comments but the first one was not signed.
- Currently I made it pick the less-indented level. This seems to work out okay on the page I spent the most time testing (https://pl.wikipedia.org/wiki/Wikipedia:Kawiarenka/Kwestie_techniczne_dyskusja/Archiwum/2018-grudzień), but I'm not sure if there isn't a better solution.
I did not try to handle the "outdent" template yet. It should be straightforward to add detecting and parsing it, but I'm not sure if that would actually be useful for anything, so I didn't.
Maybe we should try to ignore timestamps/signatures in blockquotes.
Apparently the heading of the table of contents also counts as a thread. That's harmless and we can filter it out easily later.

UI bugs in the demo which you should ignore:

The lines connecting replies to their parents cover the bullets when the comments are indented using bullet lists. This would be annoying to work around so I left it in.
Math formulae and references lists apparently confuse the highlighting (the highlighted area of a comment is much larger than it should be), but this is only a visual bug in the demo (the logic detecting the comments works fine in those cases, only drawing the highlight is wrong, because of some invisible elements not being ignored when computing it).
Floated images and other boxes also confuse the highlighting.

• marcella assigned this task to matmarex.Oct 9 2019, 4:06 PM

• marcella renamed this task from Writer a comment parser for PHP HTML to Write a comment parser for PHP HTML.Oct 9 2019, 4:15 PM

I see that there is code for local timezone abbreviations, but it doesn't seem to work: On https://de.wiktionary.org/wiki/Wiktionary:Auskunft (and other pages on de.wikt), which uses MEZ and MESZ instead of CET and CEST, no signatures are found.
Even minor variations in the timestamp format make the parser fail, e.g. deleting a leading 0 in the hour, missing period after abbreviated month, etc. These variations sometimes happen when a missing or wrong signature is fixed manually afterwords, and for old comments (seems like back in 2006 the format was a little different, which of course isn't a real problem, but probably means the parser will break when the timestamp format is changed again).
When the name of the user comes after the timestamp, the parser doesn't find it (uncommon, happens mostly when you accidentally click twice to insert 8 tildes instead of 4, https://de.wikipedia.org/wiki/Diskussion:Stra%C3%9Fenbahn_Berlin#Pl%C3%A4ne_f%C3%BCr_Neubaustrecken)
It's not clear what will happen when a comment has 2 timestamps (E.g. "See my previous comment on <old timestamp> --User <new timestamp>", or "Comment -- User <first timestamp>, modified <second timestamp>"), or even 2 signatures.

Oh, and I just want to say that I admire the magic with perfectly chosen default values that makes the parser even correctly detect signatures by users that only link to their talk page and sign on that talk page. Even though this doesn't generate a real link, the parser works as expected.

Change 539305 abandoned by Bartosz Dziewoński:
[DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

Reason:
Moving to mediawiki/extensions/DiscussionTools repository

https://gerrit.wikimedia.org/r/539305

Change 542993 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

Change 542993 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

Maintenance_bot removed a project: Patch-For-Review.Oct 18 2019, 8:10 PM

In T234404#5562427, @Schnark wrote:

I see that there is code for local timezone abbreviations, but it doesn't seem to work: On https://de.wiktionary.org/wiki/Wiktionary:Auskunft (and other pages on de.wikt), which uses MEZ and MESZ instead of CET and CEST, no signatures are found.

I looked into it and I think it's a bug in the API I used to load the translations: T235978. I only tested it on https://ar.wikipedia.org/, where 'timezone-utc' is localised, and that worked. The final code in the DiscussionTools extension handles localised messages differently and shouldn't be affected by this bug.

Even minor variations in the timestamp format make the parser fail, e.g. deleting a leading 0 in the hour, missing period after abbreviated month, etc. These variations sometimes happen when a missing or wrong signature is fixed manually afterwords, and for old comments (seems like back in 2006 the format was a little different, which of course isn't a real problem, but probably means the parser will break when the timestamp format is changed again).

When the name of the user comes after the timestamp, the parser doesn't find it (uncommon, happens mostly when you accidentally click twice to insert 8 tildes instead of 4, https://de.wikipedia.org/wiki/Diskussion:Stra%C3%9Fenbahn_Berlin#Pl%C3%A4ne_f%C3%BCr_Neubaustrecken)

For the most part, this is intentional. I wanted it to detect timestamps generated by typing '~~~~', but not any other similar-looking text.

If you had a signing bot or something that added thousands of "invalid" signatures, then maybe we should add a special case (I added a special case for the en.wp signing bot, which signs IPv6 users differently than MediaWiki), but otherwise I'd rather leave this for humans to fix. Maybe we should think about some tools for adding signatures in the correct format to unsigned comments.

I think we would like to handle historical discussions, but I don't think it's a priority right now, given that you're unlikely to want to reply to them today. The biggest obsctacle here would be finding the exact date formats that were used historically in each language, and adding them to configuration somewhere.

It's not clear what will happen when a comment has 2 timestamps (E.g. "See my previous comment on <old timestamp> --User <new timestamp>", or "Comment -- User <first timestamp>, modified <second timestamp>"), or even 2 signatures.

Currently this is treated as two separate comments at the same indentation level.

In T234404#5569469, @Schnark wrote:

Oh, and I just want to say that I admire the magic with perfectly chosen default values that makes the parser even correctly detect signatures by users that only link to their talk page and sign on that talk page. Even though this doesn't generate a real link, the parser works as expected.

Ha, I honestly didn't even think about handling that, it works out fine just by accident! :D You should probably credit James for making self-links use actual <a> tags in 2017 (T160480), which is why this works.

ppelberg added a parent task: T236196: Replies v1.0: Engineering.Oct 22 2019, 6:38 PM

matmarex moved this task from In progress to Product owner review on the VisualEditor (Current work) board.Oct 25 2019, 3:45 PM

ppelberg closed this task as Resolved.Dec 24 2019, 7:24 PM

matmarex mentioned this in T244870: Deploy v1.0 via query string parameter to target wikis .Feb 17 2020, 7:40 PM