Page MenuHomePhabricator

Write a comment parser for PHP HTML
Open, Needs TriagePublic

Description

Write a parser for existing page: PHP output -> structured discussion format (thread/level etc.). Needs to handle different types of indentation.

  1. Initially find signatures
  2. [non-blocker] Expand upwards to find the start of a multi-line comment using the previous signature.

Event Timeline

Esanders created this task.Tue, Oct 1, 11:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Oct 1, 11:49 PM
Schnark added a subscriber: Schnark.Fri, Oct 4, 8:06 AM

Depending on what exactly the requirements are, you probably want to have a look at includes/DiscussionParser.php from Echo (and perhaps old versions of that code to see what alternatives there are and why they don't work).

matmarex updated the task description. (Show Details)Wed, Oct 9, 7:42 AM

I only vaguely recall what it does, and only skimmed it now, but DiscussionParser can only detect a signature of the current user and the preceding section headings. We want to detect any signature, parse the user and time from it, guess the extent of the associated comment, and then guess the thread structure of the comments.

Change 539305 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] [DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/539305

I wrote it!

I put it on Gerrit in the same change as the last part: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/539305. Usage: copy-paste signaturedetector.js into your browser console on a discussion page.

For example: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#span_data-segmentid="59"_class="cx-segment"???

Comment ranges are guessed by simply taking everything between two signatures (or a heading and a signature) and treating it as a comment. This seems to work well. Unsigned comments will be treated as part of the following comment because of this, but I think that's okay (and easily fixable in wikitext).

Section headings are thread "roots". The initial comment underneath the heading is treated as a reply to it. All other comments are then connected based on their indentation level (just counting the li/dl elements), and not DOM structure, because it tends to be a mess.

Thoughts:

  • It's difficult to guess what is the user's intent when a multi-line comment starts and ends at different indentation levels. I found comments where this was because the comment began with a bullet list (or a quote), comments where this was because someone forgot to indent the second line of their comment, and comments which are actually *two* comments but the first one was not signed.
  • I did not try to handle the "outdent" template yet. It should be straightforward to add detecting and parsing it, but I'm not sure if that would actually be useful for anything, so I didn't.
  • Maybe we should try to ignore timestamps/signatures in blockquotes.
  • Apparently the heading of the table of contents also counts as a thread. That's harmless and we can filter it out easily later.

UI bugs in the demo which you should ignore:

  • The lines connecting replies to their parents cover the bullets when the comments are indented using bullet lists. This would be annoying to work around so I left it in.
  • Math formulae and references lists apparently confuse the highlighting (the highlighted area of a comment is much larger than it should be), but this is only a visual bug in the demo (the logic detecting the comments works fine in those cases, only drawing the highlight is wrong, because of some invisible elements not being ignored when computing it).
  • Floated images and other boxes also confuse the highlighting.
marcella assigned this task to matmarex.Wed, Oct 9, 4:06 PM
marcella renamed this task from Writer a comment parser for PHP HTML to Write a comment parser for PHP HTML.Wed, Oct 9, 4:15 PM
  • I see that there is code for local timezone abbreviations, but it doesn't seem to work: On https://de.wiktionary.org/wiki/Wiktionary:Auskunft (and other pages on de.wikt), which uses MEZ and MESZ instead of CET and CEST, no signatures are found.
  • Even minor variations in the timestamp format make the parser fail, e.g. deleting a leading 0 in the hour, missing period after abbreviated month, etc. These variations sometimes happen when a missing or wrong signature is fixed manually afterwords, and for old comments (seems like back in 2006 the format was a little different, which of course isn't a real problem, but probably means the parser will break when the timestamp format is changed again).
  • When the name of the user comes after the timestamp, the parser doesn't find it (uncommon, happens mostly when you accidentally click twice to insert 8 tildes instead of 4, https://de.wikipedia.org/wiki/Diskussion:Stra%C3%9Fenbahn_Berlin#Pl%C3%A4ne_f%C3%BCr_Neubaustrecken)
  • It's not clear what will happen when a comment has 2 timestamps (E.g. "See my previous comment on <old timestamp> --User <new timestamp>", or "Comment -- User <first timestamp>, modified <second timestamp>"), or even 2 signatures.

Oh, and I just want to say that I admire the magic with perfectly chosen default values that makes the parser even correctly detect signatures by users that only link to their talk page and sign on that talk page. Even though this doesn't generate a real link, the parser works as expected.

Change 539305 abandoned by Bartosz Dziewoński:
[DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

Reason:
Moving to mediawiki/extensions/DiscussionTools repository

https://gerrit.wikimedia.org/r/539305

Change 542993 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993