Page MenuHomePhabricator

Add charset=utf-8 by default to lighttpd
Closed, ResolvedPublic

Description

Our documentation recommends manually overriding every single MIME type to fix mojibake issues. This is nonsensical since Wikimedia been UTF-8 default for 10+ years, all tools assume a full UTF-8 pipeline. Additionally, Chrome 55 (Dec 2016) removed Character Encoding switching option. Autodetect systems may fail by falling back the OS's locale or statistical heuristics may fail only on edge cases.

Since lighttpd has to enumerate extensions to MIME type anyway, it makes sense to do it from the start.

Details

Related Gerrit Patches:
operations/software/tools-webservice : masterSet custom mime-types

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 19 2017, 5:52 PM

The mimetype is assigned via: lighttpdwebservice.py#L44:

include_shell "/usr/share/lighttpd/create-mime.assign.pl"

which has these contents:

#06:01:51 0 ✓ zhuyifei1999@tools-webgrid-lighttpd-1401: ~$ cat /usr/share/lighttpd/create-mime.assign.pl
#!/usr/bin/perl -w
use strict;
open MIMETYPES, "/etc/mime.types" or exit;
print "mimetype.assign = (\n";
my %extensions;
while(<MIMETYPES>) {
  chomp;
  s/\#.*//;
  next if /^\w*$/;
  if(/^([a-z0-9\/+-.]+)\s+((?:[a-z0-9.+-]+[ ]?)+)$/) {
    foreach(split / /, $2) {
      # mime.types can have same extension for different
      # mime types
      next if $extensions{$_};
      $extensions{$_} = 1;
      print "\".$_\" => \"$1\",\n";
    }
  }
}
print ")\n";

Unfortunately, /etc/mime.types does not know whether a mime is binary or text:

06:05:54 0 ✓ zhuyifei1999@tools-bastion-02: ~$ grep -P 'javascript|css|html|txt|jpg|png|svg' /etc/mime.types
application/javascript				js
application/xhtml+xml				xhtml xht
application/vnd.ericsson.quickcall
application/vnd.pwg-xhtml-print+xml
#application/x-httpd-eruby			rhtml
#application/x-httpd-php			phtml pht php
image/jp2					jp2 jpg2
image/jpeg					jpeg jpg jpe
image/png					png
image/svg+xml					svg svgz
text/css					css
text/html					html htm shtml
text/plain					asc txt text pot brf srt
text/x-server-parsed-html

If we make charset=utf-8, we should preferably not affect the binary files.

Here are the changes I made to that script on my server:

...
      # mime types
      next if $extensions{$_};
      $extensions{$_} = 1;
      if (substr($1, 0, 5) eq "text/" or $1 eq "application/javascript") {
         print "\".$_\" => \"$1; charset=utf-8\",\n";
      } else {
         print "\".$_\" => \"$1\",\n";
      }
    }
...

That still leaves out extensions .log, .conf, and README not listed in mime.types, but are covered by Bühler's script.

I wonder if there downside to slapping ; charset=utf-8 on all files. It should really only affect the character decoding.

Change 489409 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Set custom mime-types

https://gerrit.wikimedia.org/r/489409

Change 489409 merged by jenkins-bot:
[operations/software/tools-webservice@master] Set custom mime-types

https://gerrit.wikimedia.org/r/489409

Mentioned in SAL (#wikimedia-cloud) [2019-02-20T23:17:29Z] <zhuyifei1999_> begin build new tools-webservice package T178601 T193646 T215683

Mentioned in SAL (#wikimedia-cloud) [2019-02-20T23:30:52Z] <zhuyifei1999_> begin rebuilding all docker images T178601 T193646 T215683

bd808 closed this task as Resolved.Feb 21 2019, 1:04 AM
bd808 claimed this task.
bd808 added a subscriber: bd808.
$ curl -i https://tools.wmflabs.org/bd808-test2/foo.txt
HTTP/2 200
server: nginx/1.13.6
date: Thu, 21 Feb 2019 01:02:49 GMT
content-type: text/plain; charset=utf-8
content-length: 9
accept-ranges: bytes
etag: "2007568543"
last-modified: Thu, 21 Feb 2019 00:49:32 GMT
strict-transport-security: max-age=86400
x-clacks-overhead: GNU Terry Pratchett
content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: wikibooks.org *.wikibooks.org wikidata.org *.wikidata.org wikimedia.org *.wikimedia.org wikinews.org *.wikinews.org wikipedia.org *.wikipedia.org wikiquote.org *.wikiquote.org wikisource.org *.wikisource.org wikiversity.org *.wikiversity.org wikivoyage.org *.wikivoyage.org wiktionary.org *.wiktionary.org *.wmflabs.org wikimediafoundation.org mediawiki.org *.mediawiki.org wss://tools.wmflabs.org; report-uri https://tools.wmflabs.org/csp-report/collect;

🦄🎉