Page MenuHomePhabricator

HttpRequest: not all headers contain 'content-type'.
Closed, ResolvedPublic

Description

pwb.py reflinks.py -lang:ru -family:wikipedia -start:! -v -ignorepdf

Traceback (most recent call last):
  File "pwb.py", line 262, in <module>
    if not main():
  File "pwb.py", line 255, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "pwb.py", line 121, in run_python_file
    main_mod.__dict__)
  File "./scripts/reflinks.py", line 798, in <module>
    main()
  File "./scripts/reflinks.py", line 793, in main
    bot.run()
  File "./scripts/reflinks.py", line 588, in run
    linkedpagetext = f.content
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 181, in content
    return self.decode(self.encoding)
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 138, in encoding
    if not self.charset and not self.header_encoding:
  File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 121, in header_encoding
    content_type = self.response_headers['content-type']
  File "/home/user/anaconda3/lib/python3.6/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-type'
Dropped throttle(s).
<class 'KeyError'>
CRITICAL: Closing network session.
Network session closed.



(Pdb) self.response_headers
{'Date': 'Tue, 30 May 2017 20:03:29 GMT', 'Server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips', 'Last-Modified': 'Wed, 08 Feb 2006 19:49:40 GMT', 'ETag': '"338b0-40c4dcbea7d00"', 'Accept-Ranges': 'bytes', 'Content-Length': '211120', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}

Event Timeline

@Mpaa Do you have any idea which link or which article caused this trouble?

It happens in this page: Page((31) Евфросина)

@Mpaa I see, it is caused by the following link: https://sbn.psi.edu/pds/asteroid/EAR_A_5_DDR_ALBEDOS_V1_1/data/albedos.tab

$ curl -I https://sbn.psi.edu/pds/asteroid/EAR_A_5_DDR_ALBEDOS_V1_1/data/albedos.tab
HTTP/1.1 200 OK
Date: Tue, 15 Aug 2017 08:41:16 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips
Last-Modified: Wed, 08 Feb 2006 19:49:40 GMT
ETag: "338b0-40c4dcbea7d00"
Accept-Ranges: bytes
Content-Length: 211120

But for me this is just a warning and is skipped:

$ python pwb.py reflinks -page:"(31) Евфросина" -lang:ru -user:Dvorapa
Retrieving 1 pages from wikipedia:ru.
No charset found for http://www.psi.edu/pds/asteroid/EAR_A_5_DDR_ALBEDOS_V1_1/data/albedos.tab
No content-type found for http://www.psi.edu/pds/asteroid/EAR_A_5_DDR_ALBEDOS_V1_1/data/albedos.tab
No changes were needed on [[(31) Евфросина]]

It is just a warning, but it should definitely end with No title found instead of current No content-type found

Wierd, for me it stops here, both in py2 ad py3:

File "/home/user/python/core/pywikibot/comms/threadedhttp.py", line 121, in header_encoding
  content_type = self.response_headers['content-type']
File "/usr/local/lib/python2.7/dist-packages/requests/structures.py", line 54, in __getitem__
  return self._store[key.lower()][1]
This comment was removed by Dvorapa.

@Mpaa I can reproduce this error only on 6 months old copy of pwb, current version only skips the link. Which version of pwb are you using?

I am always aligned with master (pull --recurse-submodules). Am I missing something?

user@pc:~/python/core {master}$ python scripts/version.py 
Pywikibot: [ssh] pywikibot-core.git (0fc98a7, g8516, 2017/08/15, 17:28:18, n/a)
Release version: 3.0-dev
requests version: 2.18.3
  cacerts: /usr/local/lib/python2.7/dist-packages/certifi/cacert.pem
    certificate test: ok

@Dvorapa, how does it pass this line?!
Does self.response_headers contain the key?!

File "pywikibot/comms/threadedhttp.py", line 121, in header_encoding
content_type = self.response_headers['content-type']

@Mpaa I see, I forgot one fix in this file that is not in master yet. Guess what?

Change 371973 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [bugfix, i18n, PEP8] Make reflinks.py work smoothly

https://gerrit.wikimedia.org/r/371973

Change 372166 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] threadedhttp: add default for content-type

https://gerrit.wikimedia.org/r/372166

Change 372166 merged by jenkins-bot:
[pywikibot/core@master] threadedhttp: add default for content-type

https://gerrit.wikimedia.org/r/372166

Xqt reassigned this task from Dvorapa to Mpaa.