Page MenuHomePhabricator

Page.title(as_filename=True) don't remove "\"" (quotes) forbidden character
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce

Run the following code snippet:

import pywikibot
site = pywikibot.Site()
page = pywikibot.Page(site, "Are_You_Experienced")
img_links = list(page.imagelinks())
print(img_links[4].title(as_filename=True, with_ns=False))

Expected Result

'An_excerpt_from__Third_Stone_from_the_Sun__by_the_Jimi_Hendrix_Experience,_1967.ogg'

Actual Result

'An_excerpt_from_"Third_Stone_from_the_Sun"_by_the_Jimi_Hendrix_Experience,_1967.ogg'

Notes

Most likely, the problem is with this line, where we don't account for "\"" character:
https://github.com/wikimedia/pywikibot/blob/23dcbe6b12d6451e6c8e51d9af00c91a9c943f04/pywikibot/page.py#L377

System Info

Details

Related Changes in Gerrit:

Event Timeline

Xqt triaged this task as Medium priority.Oct 27 2019, 11:55 AM

This is OS dependent, on Linux it is acceptable.

Hello @HAKSOAT! I'm unsure if this task is ready to be fixed as @Mpaa had a valid argument above to be discussed before.

I just want to give my perspective on an issue in order for it to be better triaged:

  1. pywikibot claims to support both Windows and Linux systems, so functionality must have a common denominator of "what is acceptable on all systems". If pywikibot doesn't support Windows anymore, probably we will need to remove the installation instructions for Windows to avoid confusion.
  2. if we don't support Windows anymore, it's still beneficial to restrict filenames to the POSIX portable filename character set, which will guarantee that filename is portable between all Posix-like systems. For example, Kaggle, which is arguably one of the most popular platforms for data storage & processing, relies on the above-mentioned restrictions. Just a few days ago it was explained to me, that Kaggle does not support filenames not following POSIX portable filename character set. And probably, there are a lot of other use cases when this becomes unportable.

To sum up, it was a misleading experience for me to have an API flag .title(as_filename=True), which actually produces valid filepath only in some specific cases. In the end, I end up having a code equivalent to convert_to_portable_filename(img.title(as_filename=True)), which doesn't seem right and also wasted me a lot of time in the process of realizing that I need to implement this and re-download data.

So it would be great to either 1) guarantee that filename is valid under any circumstances (i.e. under POSIX portable filename character set) 2) or add another option e.g. .title(as_portable_filename=True) 3) or remove .title(as_filename=True) flag completely to avoid confusions and make it clear that user is required to convert titles to filenames on their own

Thanks,
Oleh

I'm being bold and removing the good first task tag; it can be put back when there is more clarity on the next steps :)

and also wasted me a lot of time in the process of realizing that I need to implement this and re-download data.

Obviously developers and maintainers are not happy when bugs slip-through and users are annoyed. Anyhow, it does not seem right to me that you are complaining of having "wasted" your time, when people are donating their free time to carry on this project.

and also wasted me a lot of time in the process of realizing that I need to implement this and re-download data.

Obviously developers and maintainers are not happy when bugs slip-through and users are annoyed. Anyhow, it does not seem right to me that you are complaining of having "wasted" your time, when people are donating their free time to carry on this project.

I apologise if that phrasing looked offensive to you. I didn't have any intent to complain or to blame I just wanted to support my argument with neutral language and a real-world example. I am very grateful to people who donate their time to the project, and because of that, I spent some time writing a detailed description of the problem. That is, to help properly prioritise backlog so that whatever time someone decides to donate, it would be used for the most important things and make the biggest impact.

Sorry again,
Oleh

Change 570867 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [bugfix] Fix as_filename for Windows

https://gerrit.wikimedia.org/r/570867

I did some research and it's correct: Only one Windows illegal filename character (") has been missing from our list. I fixed it then with detailed explanation.

Dvorapa moved this task from Backlog to Needs Review on the Pywikibot board.
Dvorapa changed the subtype of this task from "Task" to "Bug Report".

In my opinion, POSIX compliant filenames would be too much there as modern OSes and software can handle Unicode filenames just well

" is the last character to be added to the list. Be aware, that Pywikibot's aim is not to produce filenames adjusted for your OS and your filesystem needs - this is your own responsibility. So if your OS or filesystem has got some additional rules, you would still have to use convert_to_acceptable_filename() or convert_to_portable_filename() anyway

Change 570867 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Fix as_filename for Windows

https://gerrit.wikimedia.org/r/570867

@OlehOnyshchcak, thanks for the clarification. Appreciated!