Page MenuHomePhabricator

http response doesn't contain a charset
Closed, ResolvedPublic

Description

reflinks.py seems to be completely non-working now: you can try

pwb.py reflinks.py -lang:ru -family:wikipedia -start:! -v -ignorepdf

I get unicode decode errors, I also got this one:

WARNING: Http response status 404
WARNING : Redirect 404 : http://www.billboard.com/news/richard-s-prayer-stays-atop-u-k-chart-949770.story
Http response doesn't contain a charset.
Traceback (most recent call last):
  File "/srv/paws/pwb/pwb.py", line 263, in <module>
    if not main():
  File "/srv/paws/pwb/pwb.py", line 257, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "/srv/paws/pwb/pwb.py", line 121, in run_python_file
    main_mod.__dict__)
  File "/srv/paws/pwb/scripts/reflinks.py", line 797, in <module>
    main()
  File "/srv/paws/pwb/scripts/reflinks.py", line 793, in main
    bot.run()
  File "/srv/paws/pwb/scripts/reflinks.py", line 605, in run
    linkedpagetext = self.NON_HTML.sub(b'', linkedpagetext)
TypeError: can't use a bytes pattern on a string-like object
Dropped throttle(s).
<class 'TypeError'>
CRITICAL: Closing network session.
Network session closed.

I tried to start script from localhost, from Wikimedia Labs and using PAWS but it doesn't help.
In this case the error relates to this article:
https://ru.wikipedia.org/wiki/(You_Drive_Me)_Crazy

But it crashes practically every edit.

Rubinbot@PAWS:~$ pwb.py version
Pywikibot: [https] r-pywikibot-core.git (785fee8, g7905, 2017/03/04, 21:00:12, n/a)
Release version: 3.0-dev
requests version: 2.13.0
  cacerts: /srv/paws/lib/python3.4/site-packages/requests/cacert.pem
    certificate test: ok
Python: 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1]
PYWIKIBOT2_DIR: /srv/paws
PYWIKIBOT2_DIR_PWB: /srv/paws/pwb
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: /srv/paws
Usernames for family "commons":
        *: Rubinbot (no sysop configured)
Usernames for family "mediawiki":
        *: Rubinbot (no sysop configured)
Usernames for family "wikiquote":
        *: Rubinbot (no sysop configured)
Usernames for family "wikimedia":
        *: Rubinbot (no sysop configured)
Usernames for family "wiktionary":
        *: Rubinbot (no sysop configured)
Usernames for family "wikiversity":
        *: Rubinbot (no sysop configured)
Usernames for family "wikiboots":
        *: Rubinbot (no sysop configured)
Usernames for family "wikipedia":
        *: Rubinbot (no sysop configured)
Usernames for family "wikidata":
        *: Rubinbot (no sysop configured)
Usernames for family "meta":
        *: Rubinbot (no sysop configured)
Usernames for family "wikisource":
        *: Rubinbot (no sysop configured)
Rubinbot@PAWS:~$

Event Timeline

Rubin16 triaged this task as High priority.Mar 5 2017, 4:24 PM

as the script isn't working at all: I get 5 errors per 1 edit

This happens only in Python3 for me.

See T94688.
Seems that linkedpagetext is str, not bytes as expected?

Now I get also:

Traceback (most recent call last):
  File "pwb.py", line 262, in <module>
    if not main():
  File "pwb.py", line 255, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "pwb.py", line 121, in run_python_file
    main_mod.__dict__)
  File "./scripts/reflinks.py", line 798, in <module>
    main()
  File "./scripts/reflinks.py", line 793, in main
    bot.run()
  File "./scripts/reflinks.py", line 588, in run
    linkedpagetext = f.content
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 181, in content
    return self.decode(self.encoding)
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 138, in encoding
    if not self.charset and not self.header_encoding:
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 121, in header_encoding
    content_type = self.response_headers['content-type']
  File "/home/user/anaconda3/lib/python3.6/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-type'
Dropped throttle(s).
<class 'KeyError'>
CRITICAL: Closing network session.
Network session closed.

(Pdb) self.response_headers
{'Date': 'Tue, 30 May 2017 20:03:29 GMT', 'Server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips', 'Last-Modified': 'Wed, 08 Feb 2006 19:49:40 GMT', 'ETag': '"338b0-40c4dcbea7d00"', 'Accept-Ranges': 'bytes', 'Content-Length': '211120', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}

Not all headers contain 'content-type'.

If this is bypassed, I still get the original bug.

Change 371973 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [bugfix, i18n, PEP8] Make reflinks.py work smoothly

https://gerrit.wikimedia.org/r/371973

Seems that linkedpagetext is str, not bytes as expected?

@Mpaa You're right, there was a mistake in requests implementation, see T173352

Change 371973 merged by jenkins-bot:
[pywikibot/core@master] [bugfix, i18n, PEP8] Make reflinks.py work smoothly

https://gerrit.wikimedia.org/r/371973