HomePhabricator

Improve reading encoding from scraped pages

Description

Improve reading encoding from scraped pages

Some websites use less common character encoding,
or don't include encoding in the response but do
include it in the html of the page. This change allows
us to read a wider range of character encodings
from the response as well as scrape encodings
directly from the page when not included in the
response headers.

  • Read encoding in content-type response header

directly.

  • Read encoding in meta tags if not available in

response headers, prefering charset to http-equiv
tags

  • Decode response body using iconv-lite
  • Defaults to utf-8 if no encoding found in either

response headers or body of page

Tests:

  • Add server based test for non-standard charset

in response

  • Add server based test for no charset in

response but charset in html http-equiv meta tag

  • Add parsing based test for metatag charset in

html since no convenient examples exist in the wild

  • Move test_files to test/utils/static and remove

unused json test file

Bug: T95833
Change-Id: I469084d3c0d36c5d43083d329f8c5625b7f0e1ba

Details

Provenance
MvolzAuthored on
Parents
rGCIT0b37b2c6affd: Add the /_info routes and some tests
Branches
Unknown
Tags
Unknown
ChangeId
I469084d3c0d36c5d43083d329f8c5625b7f0e1ba