Page MenuHomePhabricator

archivebot.py should recognize template-based signatures that implicitly uses a fixed timezone
Open, Needs TriagePublic

Description

Signatures are sometimes added using templates which are not (and should not be) substituted, instead of the ~~~~ syntax, typically when they are intended to be multilingual: https://commons.wikimedia.org/w/index.php?oldid=250408757&diff=250408884 Archivebot should support these template-based signatures. Currently template-based signatures are ignored by the bot and treated as if there is no signature.

To avoid hardcoding all kinds of signature templates used across different wikis, we could perhaps use rendered HTML. Can Parsoid be used to convert HTML into wikitext to get a "normal"-looking signature text out of a template-based signature? Or perhaps the other way around, centralizing all of the timestamp parsing for HTML instead of wikitext? The former might be easier to implement because most of the existing archivebot can be reused, while the latter approach might be more stable going forward.

Event Timeline

whym created this task.Jul 8 2017, 2:36 AM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJul 8 2017, 2:36 AM

It simply expects a timezone:

>>> from pywikibot.textlib import TimeStripper
>>> import pywikibot
>>> site = pywikibot.Site('commons', 'commons')
>>> ts = TimeStripper(site)
>>> ts.timestripper('20:43, 2 July 2017')
>>> ts.timestripper('20:43, 2 July 2017 (UTC)')
Timestamp(2017, 7, 2, 20, 43, tzinfo=tzoneFixedOffset(0, UTC))
>>> ts.timestripper('{{20:43, 2 July 2017 (UTC)}}')
Timestamp(2017, 7, 2, 20, 43, tzinfo=tzoneFixedOffset(0, UTC))
>>> ts.timestripper('{{20:43, 2 July 2017}}')
Dvorapa added a subscriber: Dvorapa.

It simply expects a timezone:

Yes, but in this example SignBot uses {{unsigned2|18:32, 5 July 2017|Mrspeedybug}} that prints (UTC) by default.
Do we need to code an exception for this template on Commons ?

Framawiki renamed this task from archivebot should recognize template-based signatures to archivebot.py should recognize template-based signatures that implicitly uses a fixed timezone.Jul 9 2017, 9:21 PM
Mpaa added a subscriber: Mpaa.EditedJul 9 2017, 9:37 PM

IMO, no.
Each wiki might use a different (set of) templates.
Some other workaround should be found.
I will see if I can figure something out.

Each wiki might use a different (set of) templates.

When I wrote this i was thinking about something like the list of per-site category_redirect_templatesthat exist in pywikibot/families/wikipedia_family.py

whym added a comment.EditedJul 9 2017, 11:58 PM

Not sure if we want to remove the timezone requirement from parsing. We want to discriminate timestamps from other mentions of date and time. An explicitly written timezone symbol is an indication that it's probably not a part of a normal sentence.

TimeStripper is (sometimes too much) flexible and accepts a few characters in between. E.g.

>>> ts.timestripper('We will meet on 1 Jan 2018 at 12:30 (UTC)')
Timestamp(2018, 1, 1, 12, 30, tzinfo=tzoneFixedOffset(0, UTC))

The parsing itself has been a bit heuristic (there are at lease some false positives), but if we simply drop the timezone requirement, I fear that false positives increase. So if we are to do it, it will have to be dropped only for a few selected templates.

Mpaa added a comment.Jul 29 2017, 7:42 PM

I do not think we should remove timezone from parsing.
We could not do Timestamp -now() reliably.