Page MenuHomePhabricator

Page.title(as_filename=True) don't remove "\"" (quotes) forbidden character
Open, MediumPublic

Description

Steps to Reproduce

Run the following code snippet:

import pywikibot
site = pywikibot.Site()
page = pywikibot.Page(site, "Are_You_Experienced")
img_links = list(page.imagelinks())
print(img_links[4].title(as_filename=True, with_ns=False))

Expected Result

'An_excerpt_from__Third_Stone_from_the_Sun__by_the_Jimi_Hendrix_Experience,_1967.ogg'

Actual Result

'An_excerpt_from_"Third_Stone_from_the_Sun"_by_the_Jimi_Hendrix_Experience,_1967.ogg'

Notes

Most likely, the problem is with this line, where we don't account for "\"" character:
https://github.com/wikimedia/pywikibot/blob/23dcbe6b12d6451e6c8e51d9af00c91a9c943f04/pywikibot/page.py#L377

System Info

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptOct 27 2019, 11:32 AM
Xqt triaged this task as Medium priority.Oct 27 2019, 11:55 AM
Mpaa added a subscriber: Mpaa.Oct 27 2019, 6:07 PM

This is OS dependent, on Linux it is acceptable.

HAKSOAT added a subscriber: HAKSOAT.Dec 3 2019, 7:10 PM

I'd like to work on this.

Hello @HAKSOAT! I'm unsure if this task is ready to be fixed as @Mpaa had a valid argument above to be discussed before.

I just want to give my perspective on an issue in order for it to be better triaged:

  1. pywikibot claims to support both Windows and Linux systems, so functionality must have a common denominator of "what is acceptable on all systems". If pywikibot doesn't support Windows anymore, probably we will need to remove the installation instructions for Windows to avoid confusion.
  2. if we don't support Windows anymore, it's still beneficial to restrict filenames to the POSIX portable filename character set, which will guarantee that filename is portable between all Posix-like systems. For example, Kaggle, which is arguably one of the most popular platforms for data storage & processing, relies on the above-mentioned restrictions. Just a few days ago it was explained to me, that Kaggle does not support filenames not following POSIX portable filename character set. And probably, there are a lot of other use cases when this becomes unportable.

To sum up, it was a misleading experience for me to have an API flag .title(as_filename=True), which actually produces valid filepath only in some specific cases. In the end, I end up having a code equivalent to convert_to_portable_filename(img.title(as_filename=True)), which doesn't seem right and also wasted me a lot of time in the process of realizing that I need to implement this and re-download data.

So it would be great to either 1) guarantee that filename is valid under any circumstances (i.e. under POSIX portable filename character set) 2) or add another option e.g. .title(as_portable_filename=True) 3) or remove .title(as_filename=True) flag completely to avoid confusions and make it clear that user is required to convert titles to filenames on their own

Thanks,
Oleh