Page MenuHomePhabricator

Nature articles internal server error page getting scraped by Zotero and returning 200 status from Zotero
Open, Needs TriagePublic0 Estimate Story Points

Description

Originally reported by @Josve05a here: T1380#1448434

Event Timeline

Mvolz created this task.Jul 12 2015, 7:43 PM
Mvolz raised the priority of this task from to Needs Triage.
Mvolz updated the task description. (Show Details)
Mvolz added a project: Citoid.
Mvolz moved this task to Site specific issues on the Citoid board.
Mvolz added subscribers: Mvolz, Josve05a.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2015, 7:43 PM
Mvolz set Security to None.Jul 12 2015, 7:43 PM
Mvolz added a subscriber: mobrovac.
Mvolz added a comment.Jul 12 2015, 7:46 PM

@mobrovac maybe we should verify status at the location before sending to Zotero to avoid this sort of thing?

@mobrovac maybe we should verify status at the location before sending to Zotero to avoid this sort of thing?

@Mvolz could you check that this is still an issue for us with the new, promisified version of Citoid? I don't think it should be.

Mvolz added a comment.Jul 13 2015, 1:25 PM

Yes, I checked before reporting. Zotero gives us a 200 response with a
filled in citation; we only reject the promise if Zotero gives us a 200 and
an empty response. Otherwise we trust the 200.

Probably the best course of action would be to request results from Zotero and the resource itself in parallel. That way, if Zotero's results are not good, we already have a starting point for native scraping. Doing it in parallel also speeds things up.

Josve05a updated the task description. (Show Details)Oct 15 2016, 11:40 PM
Restricted Application added a project: VisualEditor. · View Herald TranscriptOct 15 2016, 11:40 PM
Mvolz moved this task from Site specific issues to Zotero on the Citoid board.Jan 11 2017, 4:30 PM
Jdforrester-WMF set the point value for this task to 0.Feb 9 2017, 6:16 PM
czar added a subscriber: czar.Mar 14 2017, 2:33 AM

@Mvolz, has this been resolved?

https://citoid.wikimedia.org/api?format=mediawiki&search=http://www.nature.com/ijo/journal/v38/n1/full/ijo201369a.html yields

[{"itemType":"journalArticle","notes":[],"tags":[],"title":"Perceived ‘healthiness’ of foods can influence consumers’ estimations of energy density and appropriate portion size","publicationTitle":"International Journal of Obesity","rights":"© 2013 Nature Publishing Group","volume":"38","issue":"1","pages":"106–112","date":"2014-01-01","DOI":"10.1038/ijo.2013.69","language":"en","url":"http://www.nature.com/ijo/journal/v38/n1/full/ijo201369a.html","abstractNote":"OBJECTIVE:\nMETHODS:\nRESULTS:\nCONCLUSIONS:","libraryCatalog":"www.nature.com","accessDate":"2017-03-14","author":[["G. P.","Faulkner"],["L. K.","Pourshahidi"],["J. M. W.","Wallace"],["M. A.","Kerr"],["T. A.","McCaffrey"],["M. B. E.","Livingstone"]],"source":["Zotero"]}]

https://github.com/zotero/translation-server/issues/15 is still open, though

Mvolz added a comment.Mar 14 2017, 2:14 PM
This comment was removed by Mvolz.
Mvolz added a comment.Apr 6 2017, 10:21 AM

Note about what other websites do, google plus won't scrape it either: you get the error message "this link is not valid." Facebook lets you attach it though. Quora also does not. IME I think the user experience of FB is better than the G+.

We started not scraping error pages because we were sometimes sending back 404 not found errors to users, but not scraping a page with valid metadata is not ideal either. We might consider scraping the error pages again... thoughts?

Mvolz moved this task from Zotero to Service on the Citoid board.Dec 11 2018, 6:01 PM