Page MenuHomePhabricator

DOI lookup returns a scraped "missing cookie" page instead of desired content
Closed, ResolvedPublic1 Estimated Story Points

Description

DOI: 10.1056/NEJM200106073442306
Output: "MMS: Error" – http://citoid.wikimedia.org/api?format=mediawiki&search=10.1056%2FNEJM200106073442306

Event Timeline

Elitre raised the priority of this task from to Needs Triage.
Elitre updated the task description. (Show Details)
Elitre added a project: Citoid.
Elitre subscribed.
Jdforrester-WMF renamed this task from DOI in Citoid returns "Cite web" template to DOI lookup failure should return 520 but instead returns a default website.Mar 24 2015, 9:06 PM
Jdforrester-WMF triaged this task as High priority.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF set Security to None.
Jdforrester-WMF moved this task from Backlog to IO Tasks on the Citoid board.
Jdforrester-WMF added a subscriber: Mvolz.

What's going on here is that we're following all the redirects from the doi link (as a way of accessing the actual link) so we're directed to the correct link- so we get http://www.nejm.org/doi/full/10.1056/NEJM200106073442306 - which then doesn't like citoid's lack of cookie support, which then redirects us again to http://www.nejm.org/action/cookieAbsent where the page is correctly scraped- just wasn't what we wanted- so no 520.

One possible solution: only follow ONE redirect for DOI links. This is somewhat risky as sometimes DOIs point to genuine redirects (I'm thinking of plos that does this quite commonly).

The other bad thing about this is that there was not HTTP error for the cookie absent redirect- there might have been such an error in the intermediate connection, needs more investigation. Whether to follow redirects or not is something that gets mixed results from site to site. One solution is to simply follow one redirect at a time and see what the response is like- follow until we get don't get http errors maybe?

Mvolz renamed this task from DOI lookup failure should return 520 but instead returns a default website to RequestFromDOI follows all redirects and sometimes results in a bad page being sent to native scraper instead of good urls to Zotero.Mar 25 2015, 11:53 AM
Mvolz lowered the priority of this task from High to Medium.
Mvolz closed this task as a duplicate of T93876: Restructure requestFromDOI.

Whether to follow redirects or not is something that gets mixed results from site to site. One solution is to simply follow one redirect at a time and see what the response is like- follow until we get don't get http errors maybe?

Following only one redirect is too tricky to be put in practice, IMHO. There might be various reasons why a redirect happened, even for safe or verified sites.

How about following redirects and setting cookies? Is that undesirable for some reason?

Yeah, I think it's safe to follow one redirect for DOIs, try Zotero, and then follow all redirects after the fact. Or at least that's my current plan, see the task I merged this with (maybe I should unmerge and then put both as blocking for this one?).

Yes, we should add cookie support.

Mvolz renamed this task from RequestFromDOI follows all redirects and sometimes results in a bad page being sent to native scraper instead of good urls to Zotero to DOI lookup returns a scraped "missing cookie" page instead of desired content.Mar 25 2015, 12:50 PM

Change 199921 had a related patch set uploaded (by Mvolz):
Restructure requestFromDOI tests

https://gerrit.wikimedia.org/r/199921

@Jdforrester-WMF, this specific case is resolved by https://gerrit.wikimedia.org/r/199921, but the general case of some pages being redirected to "no cookie" pages is not resolved, so maybe that should be the blocker instead?

Change 199921 merged by Mobrovac:
Restructure requestFromDOI tests

https://gerrit.wikimedia.org/r/199921