Page MenuHomePhabricator

Write a client-side parser for signatures (username & timestamp)
Open, Needs TriagePublic

Description

Based on engineering discussion

  • The timestamp will be identified using wiki-specific date format information
  • The username will be identified as the owner of the first user, user_talk, or contributions page link to the left of the timestamp

Details

Event Timeline

Esanders created this task.Sep 12 2019, 8:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2019, 8:27 PM

Change 539305 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] [DO NOT MERGE] Client-side parser for signatures

https://gerrit.wikimedia.org/r/539305

I wrote it.

I put it on Gerrit: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/539305. It doesn't belong in this repo, but the right repo doesn't exist yet, and I want to let people comment inline easily, so there it is.

Usage: copy-paste signaturedetector.js into your browser console on a discussion page like…

The parser detects timestamps on the page (supports every language, not just English) and signatures that precede them. For demonstration, it then marks them up with garish colours (the real tool would produce "Reply" buttons nearby instead).

For example: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Can't_follow_links_to_.pdf's

Known problems:

  • Won't detect a signature if the timestamp is wrapped in text formatting, e.g. the signatures of Ellsworth on that page
  • Won't mark up parts of the signature before the userpage link, e.g. the "—" in signatures of Xaosflux on that page
Alsee added a subscriber: Alsee.Sep 29 2019, 1:41 AM

According to Manual:Echo:

The signature must contain a plain wiki link ([[ ]]) to the user's page, user talk page, or contributions page, on the local wiki

I will update the task description here to match, adding 'contributions page' to the list.

Alsee updated the task description. (Show Details)Sep 29 2019, 1:42 AM

Thanks @Alsee, I actually included that in my code. In particular signatures of IP users have a link to their contributions rather than user page.

ppelberg added a comment.EditedOct 3 2019, 12:39 AM

Great work, @matmarex. Documenting a question that came up during today's planning meeting that stemmed from the limitations of you documented here...

Known problems:

  • Won't detect a signature if the timestamp is wrapped in text formatting, e.g. the signatures of Ellsworth on that page
  • Won't mark up parts of the signature before the userpage link, e.g. the "—" in signatures of Xaosflux on that page

Outside of the two scenarios mentioned above, are there any other edge cases you are aware of?

Related to these edge cases, we also talked during today's meeting about the idea of releasing a gadget that would enable contributors to help find scenarios where the parser – as it's currently written – does not detect signatures...do you have a sense for what would be involved in building something like this?

Outside of the two scenarios mentioned above, are there any other edge cases you are aware of?

None than I know of. (Although there are more than just these two people with similar signatures, I just had these examples handy.)

Related to these edge cases, we also talked during today's meeting about the idea of releasing a gadget that would enable contributors to help find scenarios where the parser – as it's currently written – does not detect signatures...do you have a sense for what would be involved in building something like this?

It’d be easy since I coincidentally wrote the demo script so that it’s a standalone file. But I’m not sure if it’s worth the effort to document it, if we release one. I can do it though.

Some edge cases that came up while I was writing reply-link:

  • A comment's wikitext is copied-and-pasted inside another user's comment in an attempt to quote it.
  • People will modify their comment, then leave a new "edited ~~~~" combo on the same line as the comment to indicate that. Rarely, the second signature will be from another user (e.g. when oversightable content is redacted by an OS). Note that the latter will produce the same markup as the previous example, but has the opposite semantics.
  • The unsigned or undated templates are used, but then someone comes in and tries to fill in a proper signature.
  • Signatures with bizarre capitalization ("[[uSeR:" was an *actual* example I saw one day) or bizarre formatting (e.g. extra whitespace)
  • Manual signatures with malformed timestamps ("UTC)" instead of "(UTC)") - rare but really ruins your day

There are many more - those are just the ones that I thought of immediately.

Change 539305 abandoned by Bartosz Dziewoński:
[DO NOT MERGE] Client-side parser for timestamps, signatures, comments and threads

Reason:
Moving to mediawiki/extensions/DiscussionTools repository

https://gerrit.wikimedia.org/r/539305

Change 542993 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

Change 542993 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Detect and parse timestamps, signatures, comments and threads

https://gerrit.wikimedia.org/r/542993

Some edge cases that came up while I was writing reply-link:

  • A comment's wikitext is copied-and-pasted inside another user's comment in an attempt to quote it.
  • People will modify their comment, then leave a new "edited ~~~~" combo on the same line as the comment to indicate that. Rarely, the second signature will be from another user (e.g. when oversightable content is redacted by an OS). Note that the latter will produce the same markup as the previous example, but has the opposite semantics.

Currently this is treated as two separate comments.

I'm not really sure how we should handle quoted comments. I think we have no way to guess which of the two situations we're dealing with. I was wondering if ignoring all signatures in <blockquote> tags would be a reasonable way to allow users to fix comments which we'd otherwise handle incorrectly.

  • The unsigned or undated templates are used, but then someone comes in and tries to fill in a proper signature.
  • Signatures with bizarre capitalization ("[[uSeR:" was an *actual* example I saw one day) or bizarre formatting (e.g. extra whitespace)
  • Manual signatures with malformed timestamps ("UTC)" instead of "(UTC)") - rare but really ruins your day

My parser is very strict and only accepts signatures/timestamps in the exact format that is used when typing '~~~~'. If someone carefully uses the same format (e.g. copy-pastes from a history page), or if the templates use the same format, it'll be fine. Otherwise, the comment will probably be treated as part of the next comment in that section, and a human will have to fix the markup somehow. (This is not necessarily a big problem, there just won't be a "Reply" button for the malformed comment.)

Messy links like "[[uSeR:" will actually not be a problem, because our parser runs on the HTML output rather than the wikitext, so the link targets are normalized. (This also makes it work correctly on wikis/languages with many aliases for the 'User' namespace.)

This seems like something that could be easily broken by signature customization and possibly page formatting. I imagine matching signature output would work reasonably well if users are cooperative, but that may not always be the case, and it would be preferable to avoid things that don't work with what is currently allowed.

I suggested during the community consultation that it could be better to modify signature output to include a marker (e.g. an extension tag), such that the parser would just look for the marker instead of trying to match the signature's generated HTML. Would this be technically infeasible (e.g. because this would break other MediaWiki sites in certain ways)? Would the benefit of doing so be unjustifiable due to the time and effort needed to make the change?

Alsee added a comment.Tue, Nov 5, 10:29 PM

The community already has guidelines for custom signatures. (Or at least Enwiki does, and I expect any non-minimal wiki will too.) I believe our current rules already require at least one link to user/user_talk/user-contributions, and if we don't already have a rule about not screwing up the timestamp then that rule would certainly be added as soon as anything breaks.

Our sig rule is basically "do anything you want, so long as it doesn't cause inconvenience to other editors".

@Jc86035 We're considering this, the discussion is at T230653: Use a parser function to encapsulate signatures. Please add a comment there if you have an opinion :)

To answer your question here briefly – it is technically feasible; however, adding extra markup would make the discussion comments less readable in the normal wikitext editing mode, and we're trying hard not to make any negative changes for users who will want to keep using the system they are used to. Changing the markup is not required and we don't want anyone to oppose the discussion improvements because of this.

The current code is based mostly on identifying timestamps rather than signatures, and so it's quite resilient against funky customizations. The only parsing of signatures we do is looking for a link to user/usertalk/contributions page preceding the timestamp. Per @Alsee this is indeed codified on en.wp (WP:SIGLINK), and although other wikis don't always have a hard rule like that, it seems like a common sense requirement to me.

The problematic cases I've seen so far are when the user customizes the timestamp as well (so they don't sign with ~~~~, but copy-paste some snippet or subst some template that generates their signature). Those comments will behave similarly to unsigned comments, and won't get a "Reply" button. I am hoping that other editors will notice this and ask them to stop doing that… This problem wouldn't be fixed by changing the signature output, either, since this is not using the signature markup at all, as far as I can tell.

Jc86035 added a comment.EditedWed, Nov 6, 9:53 AM

Thanks for the link. I hadn't noticed that task.

It doesn't logically make sense to me to allow part of the software to be broken by something that is (implicitly) allowed by another part of the software. If I were implementing this without a parser function or similar, I would either want to restrict signature customizations (similar to T140606) so as to only allow those which could be matched by the signature parser, or make the parser versatile enough to detect any allowed signature.

(Anecdotally, over the years I've noticed a small number of Meta users who use an interwiki link to their Wikipedia user page and/or user talk page in their Meta signature. I'm having trouble finding an example of this which isn't a copy-paste signature, but the software does allow me to set my signature in this way. Meta does not have any policies or guidelines relating to signatures.)

It doesn't logically make sense to me to allow part of the software to be broken by something that is (implicitly) allowed by another part of the software. If I were implementing this without a parser function or similar, I would either want to restrict signature customizations (similar to T140606) so as to only allow those which could be matched by the signature parser, or make the parser versatile enough to detect any allowed signature.

Yeah, this is a good point.

We should probably do something about this as a part of this project. Mismatched HTML also messes up our parser, so thank you for finding T140606: Check user signature for linter errors. I couldn't find a task specifically about requiring the links, so I filed T237700: Require user signatures to contain a link to their user page, talk page or contributions. I hope that we'll find time to look into these two.