Page MenuHomePhabricator

Serve files as UTF-8
Closed, ResolvedPublic

Description

Author: avarab

Description:
It's annoying to have to manually switch settings when viewing attachments.


Version: unspecified
Severity: minor
URL: http://bugzilla.wikimedia.org/attachment.cgi?id=455&action=view

Details

Reference
bz1972

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:19 PM
bzimport set Reference to bz1972.
bzimport added a subscriber: Unknown Object (MLST).

avarab wrote:

Actually, they aren't being served with any specific character set, changing the
summary to reflect this.

""""
$ printf "GET /attachment.cgi?id=455&action=view HTTP/1.0\nHost:
bugzilla.wikimedia.org\n\n"|nc bugzilla.wikimedia.org 80|head
HTTP/1.1 200 OK
Date: Wed, 27 Apr 2005 21:42:38 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.11
Content-disposition: inline; filename="LanguageCs_1.5.php"
Content-length: 104226
Connection: close
Content-Type: text/plain; name="LanguageCs_1.5.php"

<?php
/** Czech (česky)
"""

Regardless, it would be good to explicitly serve them as UTF-8.

river wrote:

i'm not clear what you want to do here.

do you want to set charset=UTF-8 for every (text) attachment served?

or, do you want to auto-convert text files to UTF-8 on upload, and then set
charset=UTF-8?

if the latter, this should probably be reported as a BugZilla enhancement request.

The uploaded patches are already in UTF-8; they're just not being sent with a charset in the Content-type header.

Bug 609 describes the equivalent issue with bugmail.

river wrote:

yes, but you can't assume all files will be UTF-8, so you either send the wrong
encoding with some files, or you need to convert them as needed, or somehow
otherwise detect the encoding to send.

avarab wrote:

(In reply to comment #2 and comment #4)

I want to set charset=utf-8 for every text attachment served.

Practically speaking the only attachments we get with characters that are not in
ASCII are patches for Language files, and since we'll be going all-UTF-8 in 1.5
these are going to be in UTF-8. There's really no need to make some 100% correct
character set detection system (and AFAIK such a thing isn't even possible),
serving them all as UTF-8 is good enough for our purposes.

zigger wrote:

Resolving as FIXED sometime past. Current content-type response header for the
example is:

Content-Type: text/plain; name="LanguageCs_1.5.php"; charset=UTF-8