Page MenuHomePhabricator

reflinks.py work with old user agent
Closed, ResolvedPublic

Description

When reflinks.py works, sometimes web pages return information that "You use an outdated browser" as a title of the page.

e.g. http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=MOB;n=212237;fld=134;from=1-34;rnd=0.16269673267379403 redirects to http://base.consultant.ru/cons/static4012_00_88_184768/invalid_browser.htm?ext=1; when the user-agent is not recognised. (the title of that page is "Устаревший или неподдерживаемый веб-обозреватель" which in Russian means "outdated browser")

reflinks.py does not set a user agent for the bot. It uses urllib2, which has a default user-agent of "Python-urllib/2.6" (on Python 2.6) according to https://docs.python.org/2/library/urllib2.html.

The problem can be seen and tested with the following command, which uses https://en.wikipedia.org/wiki/User:John_Vandenberg/test_T113596 as a test page. Notice the script suggests adding a title "Устаревший или неподдерживаемый веб-обозреватель", which is incorrect.

$ python pwb.py reflinks -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T113596
No handlers could be found for logger "pywiki"
Retrieving 1 pages from wikipedia:en.


>>> User:John Vandenberg/test T113596 <<<
@@ -3 +3 @@
- <ref>http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=MOB;n=212237;fld=134;from=1-34;rnd=0.16269673267379403</ref>
+ <ref>[http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=MOB;n=212237;fld=134;from=1-34;rnd=0.16269673267379403 Устаревший или неподдерживаемый веб-обозреватель<!-- Bot generated title -->]</ref>

Edit summary: Bot: Converting bare references, using ref names to avoid duplicates, see [[mw:Manual:Pywikibot/refLinks|FAQ]]
Do you want to accept these changes? ([y]es, [N]o, [a]ll, [q]uit): n

The title of http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=MOB;n=212237;fld=134;from=1-34;rnd=0.16269673267379403 should be "Постановление Губернатора МО от 03.07.2015 N 282-ПГ "Об объединении рабочего поселка Львовский Подольского района Московской области и города Подольска Московской области" - КонсультантПлюс:"

Regarding fixing the bug

Pywikibot has a function pywikibot.comms.http.user_agent which produces a slightly better user agent, however it is also not accepted by http://base.consultant.ru

However http://base.consultant.ru does provide the correct resource when the user-agent is spoofed to be Mozilla/5.0 (X11; U; Linux i686; de; rv:1.8) Gecko/20051128 SUSE/1.5-0.1 Firefox/1.5

It is the Firefox/1.5 which makes it work; e.g.

wget -O 'out.html' --user-agent="Firefox/1.5 Pywikibot/2.0rc4 (g5802) httplib2/0.9.1 Python/2.7.10.final.0" "http://base.consultant.ru/cons/cgi/online.cgi?req=doc;base=MOB;n=212237;fld=134;from=1-34;rnd=0.16269673267379403"

As a result, a custom/spoofed user agent is needed for reflinks.py, and possible also weblinkchecker.py (see T71204).

One design problem is that each website may have its own user-agent rejection rules, so to properly solve this problem, pywikibot needs to support a custom user-agent for each website.

This could be a configuration item in pywikibot/config2.py, however that would become out of date and could cause sniffers to reject the user-agent.
Another approach is to obtain the user-agent using fake-useragent or browseragents, or similar.

An acceptable good first bug solution is to have one configuration variable, which defaults to blank. When blank, if fake-useragent or browseragents are installed, prefill the configuration variable with a spoof user-agent from either package. If blank and neither is installed, use pywikibot.comms.http.user_agent.

Details

Related Gerrit Patches:

Event Timeline

Rubin16 created this task.Sep 24 2015, 12:29 PM
Rubin16 raised the priority of this task from to Normal.
Rubin16 updated the task description. (Show Details)
Rubin16 added a project: Pywikibot.
Rubin16 added a subscriber: Rubin16.
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptSep 24 2015, 12:29 PM
jayvdb added a subscriber: jayvdb.

It is using urllib2, and is not setting a user-agent at all.

weblinkchecker.py sets a user-agent, but the code will need a bit of adapting.

Or a better approach is to rip out all of the urllib2 stuff and T111300: Convert reflinks to requests.

Nemo_bis set Security to None.
jayvdb added a comment.Jan 9 2016, 3:25 AM

@Rubin16, can you give an example of where that page title appeared?

https://ru.wikipedia.org/w/index.php?diff=72816694&oldid=72804830
https://ru.wikipedia.org/w/index.php?diff=73531250&oldid=73527279

"Устаревший или неподдерживаемый веб-обозреватель" in Russian means "outdated browser"

@8ohit.dua , please do not create GCI tasks for pywikibot with only yourself as a mentor; you dont have any Pywikibot code experience (please fix some bugs to remedy that!).
@Nemo_bis , could you add me as a mentor so I can edit that task. As the task is currently written, I would -2 any solution that does what the task describes.

jayvdb updated the task description. (Show Details)Jan 10 2016, 2:47 PM
MtDu claimed this task.Jan 14 2016, 8:27 PM
MtDu added a subscriber: MtDu.

I have claimed this task on GCI. I will have some questions, so what time is best for me to get on IRC?
Thanks,
MtDu

MtDu added a comment.Jan 14 2016, 9:55 PM

Hello!
I'm kind of confused on the general bug. Here are my questions:

  1. Where does the user-agent need to actually be added? I looked around pywikibot, and found this. https://dpaste.de/aq6u Is this similar to what I need to do here? (I know there needs to be some logic) A general explanation of this would be appreciated. (What function uses it, why, and where the user-agent needs to be used)
  2. So I would default a variable to be a blank string. How do I check if certain packages are installed in python? In this case, fake-useragent or browseragent.

Thank you for your time and help!
MtDu

https://dpaste.de/aq6u is weblinkchecker.py, which is a separate but related bug : T71204.

Each script in scripts/ is separate from the others.

This task is about scripts/reflinks.py, which does not set a user agent at all.

To check if a package is installed, add the following to the top of the module:

try:
     import browseragent
except ImportError as e
     browseragent = e

Later in code ,

if not isinstance(browseragent, ImportError):
     # use browseragent package

I have claimed this task on GCI. I will have some questions, so what time is best for me to get on IRC?

I am available on IRC mostly during the day and evening Australian Eastern time.

@8ohit.dua , will you be joining the #pywikibot IRC channel to help with mentoring?

MtDu added a comment.Jan 15 2016, 2:13 AM

@jayvdb,
Thanks for clarifying my questions. I will get to work later.
Thanks,
MtDu

Change 264251 had a related patch set uploaded (by MtDu):
[WIP] Set user-agent in reflinks.py

https://gerrit.wikimedia.org/r/264251

jayvdb: I am available on IRC mostly after 8pm IST(UTC+5:30).
I'll be there at # pywikibot channel.

@MtDu, how are you testing your code? Could you describe your test sequence here please. Then reviewers can

  1. comment on whether the test sequence is good
  2. use the test sequence to confirm your solution is fixing the problem.
MtDu added a comment.Jan 15 2016, 9:17 PM

@jayvdb,
Could you guide me on how to test it?
Thanks,
MtDu

jayvdb updated the task description. (Show Details)Jan 15 2016, 9:43 PM

@jayvdb,
Could you guide me on how to test it?
Thanks,
MtDu

I've added a way to test it in this task's description.

Note that fake-useragent is unusable at the moment on Python 2.7; see https://github.com/hellysmile/fake-useragent/issues/14

MtDu added a comment.Jan 15 2016, 11:35 PM

@jayvdb,
So... Is switching to requests a good idea? Now, when I'm testing I'm getting errors like this. https://dpaste.de/Tdio#L552,555,557,648
On line 552, it says str does no have decode attr.
When I delete that, it says
On line 648,
does not have attr close,
When I delete that line,
On line 557,
does not have attr info.
Please guide me.
Thanks,
MtDu

@MtDu, As it seems you are investigating using requests, please discuss that problem on T111300: Convert reflinks to requests. It is a separate task. It is not a mandatory part of the solution to this user-agent bug.

Python 2 should be used for this bug, because Python 3 has a bug: T118674

jayvdb updated the task description. (Show Details)Jan 17 2016, 12:18 AM

Note that fake-useragent is unusable at the moment on Python 2.7; see https://github.com/hellysmile/fake-useragent/issues/14

Re-installing the package fixed my problem. Quite strange, but anyway it now works well.

Change 264251 merged by jenkins-bot:
Set user-agent and convert reflinks.py to use requests

https://gerrit.wikimedia.org/r/264251

jayvdb closed this task as Resolved.Jan 19 2016, 5:17 AM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptDec 1 2016, 9:28 AM