DOI lookup returns a scraped "missing cookie" page instead of desired content
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	• Elitre
	Mar 24 2015, 6:42 PM

Description

DOI: 10.1056/NEJM200106073442306
Output: "MMS: Error" – http://citoid.wikimedia.org/api?format=mediawiki&search=10.1056%2FNEJM200106073442306

Details

	Subject	Repo	Branch	Lines +/-
	Restructure requestFromDOI + tests	mediawiki/services/citoid	master	+65 -28

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Mvolz	T93785 DOI lookup returns a scraped "missing cookie" page instead of desired content
		Resolved		Mvolz	T93876 Restructure requestFromDOI

Event Timeline

• Elitre created this task.Mar 24 2015, 6:42 PM

• Elitre raised the priority of this task from to Needs Triage.

• Elitre updated the task description. (Show Details)

• Elitre added a project: Citoid.

• Elitre subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2015, 6:42 PM

Jdforrester-WMF renamed this task from DOI in Citoid returns "Cite web" template to DOI lookup failure should return 520 but instead returns a default website.Mar 24 2015, 9:06 PM

Jdforrester-WMF triaged this task as High priority.

Jdforrester-WMF updated the task description. (Show Details)

Jdforrester-WMF set Security to None.

Jdforrester-WMF moved this task from Backlog to IO Tasks on the Citoid board.

Jdforrester-WMF added a subscriber: Mvolz.

What's going on here is that we're following all the redirects from the doi link (as a way of accessing the actual link) so we're directed to the correct link- so we get http://www.nejm.org/doi/full/10.1056/NEJM200106073442306 - which then doesn't like citoid's lack of cookie support, which then redirects us again to http://www.nejm.org/action/cookieAbsent where the page is correctly scraped- just wasn't what we wanted- so no 520.

One possible solution: only follow ONE redirect for DOI links. This is somewhat risky as sometimes DOIs point to genuine redirects (I'm thinking of plos that does this quite commonly).

The other bad thing about this is that there was not HTTP error for the cookie absent redirect- there might have been such an error in the intermediate connection, needs more investigation. Whether to follow redirects or not is something that gets mixed results from site to site. One solution is to simply follow one redirect at a time and see what the response is like- follow until we get don't get http errors maybe?

Mvolz renamed this task from DOI lookup failure should return 520 but instead returns a default website to RequestFromDOI follows all redirects and sometimes results in a bad page being sent to native scraper instead of good urls to Zotero.Mar 25 2015, 11:53 AM

Mvolz claimed this task.Mar 25 2015, 12:04 PM

Mvolz lowered the priority of this task from High to Medium.

Mvolz added a parent task: T93876: Restructure requestFromDOI.

Mvolz closed this task as a duplicate of T93876: Restructure requestFromDOI.

In T93785#1148378, @Mvolz wrote:

Whether to follow redirects or not is something that gets mixed results from site to site. One solution is to simply follow one redirect at a time and see what the response is like- follow until we get don't get http errors maybe?

Following only one redirect is too tricky to be put in practice, IMHO. There might be various reasons why a redirect happened, even for safe or verified sites.

How about following redirects and setting cookies? Is that undesirable for some reason?

Yeah, I think it's safe to follow one redirect for DOIs, try Zotero, and then follow all redirects after the fact. Or at least that's my current plan, see the task I merged this with (maybe I should unmerge and then put both as blocking for this one?).

Yes, we should add cookie support.

Mvolz reopened this task as Open.Mar 25 2015, 12:47 PM

Mvolz added a subtask: T93877: Accept and serve cookies.

Mvolz renamed this task from RequestFromDOI follows all redirects and sometimes results in a bad page being sent to native scraper instead of good urls to Zotero to DOI lookup returns a scraped "missing cookie" page instead of desired content.Mar 25 2015, 12:50 PM

Jdforrester-WMF added a project: VisualEditor 2014/15 Q4 blockers.Mar 25 2015, 7:16 PM

Jdforrester-WMF mentioned this in T93876: Restructure requestFromDOI.Mar 25 2015, 7:50 PM

Mvolz removed a subtask: T93877: Accept and serve cookies.Mar 26 2015, 4:53 PM

Mvolz removed a parent task: T93876: Restructure requestFromDOI.

Mvolz added a subtask: T93876: Restructure requestFromDOI.

Change 199921 had a related patch set uploaded (by Mvolz):
Restructure requestFromDOI tests

https://gerrit.wikimedia.org/r/199921

gerritbot added a project: Patch-For-Review.Mar 26 2015, 4:54 PM

@Jdforrester-WMF, this specific case is resolved by https://gerrit.wikimedia.org/r/199921, but the general case of some pages being redirected to "no cookie" pages is not resolved, so maybe that should be the blocker instead?