Page MenuHomePhabricator

Add charset=utf-8 by default to lighttpd
Closed, ResolvedPublic


Our documentation recommends manually overriding every single MIME type to fix mojibake issues. This is nonsensical since Wikimedia been UTF-8 default for 10+ years, all tools assume a full UTF-8 pipeline. Additionally, Chrome 55 (Dec 2016) removed Character Encoding switching option. Autodetect systems may fail by falling back the OS's locale or statistical heuristics may fail only on edge cases.

Since lighttpd has to enumerate extensions to MIME type anyway, it makes sense to do it from the start.

Event Timeline

The mimetype is assigned via:

include_shell "/usr/share/lighttpd/"

which has these contents:

#06:01:51 0 ✓ zhuyifei1999@tools-webgrid-lighttpd-1401: ~$ cat /usr/share/lighttpd/
#!/usr/bin/perl -w
use strict;
open MIMETYPES, "/etc/mime.types" or exit;
print "mimetype.assign = (\n";
my %extensions;
while(<MIMETYPES>) {
  next if /^\w*$/;
  if(/^([a-z0-9\/+-.]+)\s+((?:[a-z0-9.+-]+[ ]?)+)$/) {
    foreach(split / /, $2) {
      # mime.types can have same extension for different
      # mime types
      next if $extensions{$_};
      $extensions{$_} = 1;
      print "\".$_\" => \"$1\",\n";
print ")\n";

Unfortunately, /etc/mime.types does not know whether a mime is binary or text:

06:05:54 0 ✓ zhuyifei1999@tools-bastion-02: ~$ grep -P 'javascript|css|html|txt|jpg|png|svg' /etc/mime.types
application/javascript				js
application/xhtml+xml				xhtml xht
#application/x-httpd-eruby			rhtml
#application/x-httpd-php			phtml pht php
image/jp2					jp2 jpg2
image/jpeg					jpeg jpg jpe
image/png					png
image/svg+xml					svg svgz
text/css					css
text/html					html htm shtml
text/plain					asc txt text pot brf srt

If we make charset=utf-8, we should preferably not affect the binary files.

Here are the changes I made to that script on my server:

      # mime types
      next if $extensions{$_};
      $extensions{$_} = 1;
      if (substr($1, 0, 5) eq "text/" or $1 eq "application/javascript") {
         print "\".$_\" => \"$1; charset=utf-8\",\n";
      } else {
         print "\".$_\" => \"$1\",\n";

That still leaves out extensions .log, .conf, and README not listed in mime.types, but are covered by Bühler's script.

I wonder if there downside to slapping ; charset=utf-8 on all files. It should really only affect the character decoding.

Change 489409 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] Set custom mime-types

Change 489409 merged by jenkins-bot:
[operations/software/tools-webservice@master] Set custom mime-types

Mentioned in SAL (#wikimedia-cloud) [2019-02-20T23:17:29Z] <zhuyifei1999_> begin build new tools-webservice package T178601 T193646 T215683

Mentioned in SAL (#wikimedia-cloud) [2019-02-20T23:30:52Z] <zhuyifei1999_> begin rebuilding all docker images T178601 T193646 T215683

bd808 claimed this task.
bd808 edited projects, added cloud-services-team (Kanban); removed Patch-For-Review.
bd808 subscribed.
$ curl -i
HTTP/2 200
server: nginx/1.13.6
date: Thu, 21 Feb 2019 01:02:49 GMT
content-type: text/plain; charset=utf-8
content-length: 9
accept-ranges: bytes
etag: "2007568543"
last-modified: Thu, 21 Feb 2019 00:49:32 GMT
strict-transport-security: max-age=86400
x-clacks-overhead: GNU Terry Pratchett
content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: * * * * * * * * * * * * wss://; report-uri;