Page MenuHomePhabricator

Nature articles internal server error page getting scraped by Zotero and returning 200 status from Zotero
Open, Needs TriagePublic0 Estimated Story Points

Description

Originally reported by @Josve05a here: T1380#1448434

Event Timeline

Mvolz raised the priority of this task from to Needs Triage.
Mvolz updated the task description. (Show Details)
Mvolz added a project: Citoid.
Mvolz moved this task to Site specific issues on the Citoid board.
Mvolz added subscribers: Mvolz, Josve05a.

@mobrovac maybe we should verify status at the location before sending to Zotero to avoid this sort of thing?

@mobrovac maybe we should verify status at the location before sending to Zotero to avoid this sort of thing?

@Mvolz could you check that this is still an issue for us with the new, promisified version of Citoid? I don't think it should be.

Yes, I checked before reporting. Zotero gives us a 200 response with a
filled in citation; we only reject the promise if Zotero gives us a 200 and
an empty response. Otherwise we trust the 200.

Probably the best course of action would be to request results from Zotero and the resource itself in parallel. That way, if Zotero's results are not good, we already have a starting point for native scraping. Doing it in parallel also speeds things up.

@Mvolz, has this been resolved?

https://citoid.wikimedia.org/api?format=mediawiki&search=http://www.nature.com/ijo/journal/v38/n1/full/ijo201369a.html yields

[{"itemType":"journalArticle","notes":[],"tags":[],"title":"Perceived ‘healthiness’ of foods can influence consumers’ estimations of energy density and appropriate portion size","publicationTitle":"International Journal of Obesity","rights":"© 2013 Nature Publishing Group","volume":"38","issue":"1","pages":"106–112","date":"2014-01-01","DOI":"10.1038/ijo.2013.69","language":"en","url":"http://www.nature.com/ijo/journal/v38/n1/full/ijo201369a.html","abstractNote":"OBJECTIVE:\nMETHODS:\nRESULTS:\nCONCLUSIONS:","libraryCatalog":"www.nature.com","accessDate":"2017-03-14","author":[["G. P.","Faulkner"],["L. K.","Pourshahidi"],["J. M. W.","Wallace"],["M. A.","Kerr"],["T. A.","McCaffrey"],["M. B. E.","Livingstone"]],"source":["Zotero"]}]

https://github.com/zotero/translation-server/issues/15 is still open, though

This comment was removed by Mvolz.

Note about what other websites do, google plus won't scrape it either: you get the error message "this link is not valid." Facebook lets you attach it though. Quora also does not. IME I think the user experience of FB is better than the G+.

We started not scraping error pages because we were sometimes sending back 404 not found errors to users, but not scraping a page with valid metadata is not ideal either. We might consider scraping the error pages again... thoughts?