Page MenuHomePhabricator

Write a comment parser for PHP HTML
Closed, ResolvedPublic

Description

Write a parser for existing page: PHP output -> structured discussion format (thread/level etc.). Needs to handle different types of indentation.

  1. Initially find signatures
  2. [non-blocker] Expand upwards to find the start of a multi-line comment using the previous signature.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedppelberg
Resolved iamjessklein
DuplicateNone
ResolvedEsanders
Resolvedppelberg
Resolvedmatmarex
Resolved iamjessklein
Resolvedppelberg
ResolvedRyasmeen
Resolvedppelberg
Resolvedppelberg
Resolvedppelberg
Resolvedmatmarex
ResolvedEsanders
Resolvedmatmarex
Resolvedmatmarex
Resolvedppelberg
Resolvedppelberg
Resolvedppelberg
Resolvedppelberg
Resolvedppelberg
Resolved Whatamidoing-WMF
ResolvedRyasmeen
Resolvedmatmarex
Resolvedmatmarex
InvalidNone
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
DeclinedNone
DuplicateNone
Resolvedppelberg
DuplicateNone
Resolvedppelberg
ResolvedEsanders
Resolvedppelberg
ResolvedJdforrester-WMF
ResolvedNone
Resolved Gilles
DuplicateNone
DuplicateNone
ResolvedSecuritysbassett
ResolvedEsanders
Duplicateppelberg
Resolvedmatmarex
ResolvedEsanders
Resolvedmatmarex
ResolvedDLynch
DuplicateDLynch

Event Timeline

Depending on what exactly the requirements are, you probably want to have a look at includes/DiscussionParser.php from Echo (and perhaps old versions of that code to see what alternatives there are and why they don't work).

I only vaguely recall what it does, and only skimmed it now, but DiscussionParser can only detect a signature of the current user and the preceding section headings. We want to detect any signature, parse the user and time from it, guess the extent of the associated comment, and then guess the thread structure of the comments.

Change 539305 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] [DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/539305

I wrote it!

I put it on Gerrit in the same change as the last part: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/539305. Usage: copy-paste signaturedetector.js into your browser console on a discussion page.

For example: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#span_data-segmentid="59"_class="cx-segment"???

image.png (668×1 px, 182 KB)

Comment ranges are guessed by simply taking everything between two signatures (or a heading and a signature) and treating it as a comment. This seems to work well. Unsigned comments will be treated as part of the following comment because of this, but I think that's okay (and easily fixable in wikitext).

Section headings are thread "roots". The initial comment underneath the heading is treated as a reply to it. All other comments are then connected based on their indentation level (just counting the li/dl elements), and not DOM structure, because it tends to be a mess.

Thoughts:

  • It's difficult to guess what is the user's intent when a multi-line comment starts and ends at different indentation levels. I found comments where this was because the comment began with a bullet list (or a quote), comments where this was because someone forgot to indent the second line of their comment, and comments which are actually *two* comments but the first one was not signed.
  • I did not try to handle the "outdent" template yet. It should be straightforward to add detecting and parsing it, but I'm not sure if that would actually be useful for anything, so I didn't.
  • Maybe we should try to ignore timestamps/signatures in blockquotes.
  • Apparently the heading of the table of contents also counts as a thread. That's harmless and we can filter it out easily later.

UI bugs in the demo which you should ignore:

  • The lines connecting replies to their parents cover the bullets when the comments are indented using bullet lists. This would be annoying to work around so I left it in.
  • Math formulae and references lists apparently confuse the highlighting (the highlighted area of a comment is much larger than it should be), but this is only a visual bug in the demo (the logic detecting the comments works fine in those cases, only drawing the highlight is wrong, because of some invisible elements not being ignored when computing it).
  • Floated images and other boxes also confuse the highlighting.
marcella renamed this task from Writer a comment parser for PHP HTML to Write a comment parser for PHP HTML.Oct 9 2019, 4:15 PM
  • I see that there is code for local timezone abbreviations, but it doesn't seem to work: On https://de.wiktionary.org/wiki/Wiktionary:Auskunft (and other pages on de.wikt), which uses MEZ and MESZ instead of CET and CEST, no signatures are found.
  • Even minor variations in the timestamp format make the parser fail, e.g. deleting a leading 0 in the hour, missing period after abbreviated month, etc. These variations sometimes happen when a missing or wrong signature is fixed manually afterwords, and for old comments (seems like back in 2006 the format was a little different, which of course isn't a real problem, but probably means the parser will break when the timestamp format is changed again).
  • When the name of the user comes after the timestamp, the parser doesn't find it (uncommon, happens mostly when you accidentally click twice to insert 8 tildes instead of 4, https://de.wikipedia.org/wiki/Diskussion:Stra%C3%9Fenbahn_Berlin#Pl%C3%A4ne_f%C3%BCr_Neubaustrecken)
  • It's not clear what will happen when a comment has 2 timestamps (E.g. "See my previous comment on <old timestamp> --User <new timestamp>", or "Comment -- User <first timestamp>, modified <second timestamp>"), or even 2 signatures.

Oh, and I just want to say that I admire the magic with perfectly chosen default values that makes the parser even correctly detect signatures by users that only link to their talk page and sign on that talk page. Even though this doesn't generate a real link, the parser works as expected.

Change 539305 abandoned by Bartosz Dziewoński:
[DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

Reason:
Moving to mediawiki/extensions/DiscussionTools repository

https://gerrit.wikimedia.org/r/539305

Change 542993 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

Change 542993 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

I looked into it and I think it's a bug in the API I used to load the translations: T235978. I only tested it on https://ar.wikipedia.org/, where 'timezone-utc' is localised, and that worked. The final code in the DiscussionTools extension handles localised messages differently and shouldn't be affected by this bug.

  • Even minor variations in the timestamp format make the parser fail, e.g. deleting a leading 0 in the hour, missing period after abbreviated month, etc. These variations sometimes happen when a missing or wrong signature is fixed manually afterwords, and for old comments (seems like back in 2006 the format was a little different, which of course isn't a real problem, but probably means the parser will break when the timestamp format is changed again).
  • When the name of the user comes after the timestamp, the parser doesn't find it (uncommon, happens mostly when you accidentally click twice to insert 8 tildes instead of 4, https://de.wikipedia.org/wiki/Diskussion:Stra%C3%9Fenbahn_Berlin#Pl%C3%A4ne_f%C3%BCr_Neubaustrecken)

For the most part, this is intentional. I wanted it to detect timestamps generated by typing '~~~~', but not any other similar-looking text.

If you had a signing bot or something that added thousands of "invalid" signatures, then maybe we should add a special case (I added a special case for the en.wp signing bot, which signs IPv6 users differently than MediaWiki), but otherwise I'd rather leave this for humans to fix. Maybe we should think about some tools for adding signatures in the correct format to unsigned comments.

I think we would like to handle historical discussions, but I don't think it's a priority right now, given that you're unlikely to want to reply to them today. The biggest obsctacle here would be finding the exact date formats that were used historically in each language, and adding them to configuration somewhere.

  • It's not clear what will happen when a comment has 2 timestamps (E.g. "See my previous comment on <old timestamp> --User <new timestamp>", or "Comment -- User <first timestamp>, modified <second timestamp>"), or even 2 signatures.

Currently this is treated as two separate comments at the same indentation level.

Oh, and I just want to say that I admire the magic with perfectly chosen default values that makes the parser even correctly detect signatures by users that only link to their talk page and sign on that talk page. Even though this doesn't generate a real link, the parser works as expected.

Ha, I honestly didn't even think about handling that, it works out fine just by accident! :D You should probably credit James for making self-links use actual <a> tags in 2017 (T160480), which is why this works.