Page MenuHomePhabricator

archivebot.py doesn't support unicode month names
Closed, ResolvedPublic

Description

Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1482/
Reported by: Anonymous user
Created on: 2012-06-30 17:50:30
Subject: archivebot.py doesn't support unicode month names
Original description:
archivebot.py doesn't work well with languages such as Turkish which has some months with unicode characters. Namely:

2 Şubat
4 Mayıs
8 Ağustos
9 Eylül
11 Kasım
12 Aralık


Version: unspecified
Severity: normal
See Also:
https://sourceforge.net/p/pywikipediabot/bugs/1482

Details

Reference
bz55186

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport set Reference to bz55186.
bzimport added a subscriber: Unknown Object (????).

Pywikipedia \[http\] trunk/pywikipedia \(r10432, 2012/06/30, 15:47:55\)
Python 2.7.3 \(default, Apr 10 2012, 23:31:26\) \[MSC v.1500 32 bit \(Intel\)\]
config-settings:
use\_api = True
use\_api\_login = True
unicode test: ok

Command line I used was archivebot.py -l turkish Archive/config

Could you give us a traceback or further informations about that bug? The bot uses the monthnames coming from mediaWiki messages and I don't know what is the significance of the locale setting. Could you try to run the bot without --locale=tr setting?

Sure. There is no traceback error for me to provide though since the code does work, it just ignores some threads.

Run1: archivebot.py -l turkish Archive/config
Fetching template transclusions...
Getting references to \[\[Sablon:Archive/config\]\] via API...
Processing \[\[tr:Kullanici mesaj:??????\]\]
3 Threads found on \[\[tr:Kullanici mesaj:??????\]\]
Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\]
Processing 3 threads
There are only 0 Threads. Skipping

Run2: archivebot.py Archive/config
Fetching template transclusions...
Getting references to \[\[Sablon:Archive/config\]\] via API...
Processing \[\[tr:Kullanici mesaj:??????\]\]
3 Threads found on \[\[tr:Kullanici mesaj:??????\]\]
Looking for: \{\{Archive/config\}\} in \[\[tr:Kullanici mesaj:??????\]\]
Processing 3 threads
There are only 0 Threads. Skipping

Note the Turkish character ı is displayed as i in the CMD window \(I run code using Windows\). The ???? relate to my user talk page http://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1\_mesaj:%E3%81%A8%E3%81%82%E3%82%8B%E7%99%BD%E3%81%84%E7%8C%AB but CMD cannot display unicode.

Oh when I ran the bot initially without -l turkish it ignored all threads. Since it already archived 3 of the 6 initial threads it is still reporting 0 Threads as it cannot see the ones with "Mayıs" month name.

Looked into this a bit.

I've managed to isolate the problem to ~line 237 where all the txt2timestamp functions are. It seems that all of them are raising ValueErrors.

Tried this:
import unicodedata

@line 237
_TM = ''.join((c for c in unicodedata.normalize('NFD', TM.group(0)) if unicodedata.category(c) != 'Mn'))

and then call txt2timestamp with _TM instead of TM.group(0)