Page MenuHomePhabricator

DOI redirect handler for iopscience is very slow/ potentially causes citoid to behave oddly.
Open, MediumPublic8 Estimated Story Points

Description

Reported here:

T106201#1466601

10.1088/0004-637X/802/1/65 gives inconsistent results; on localhost, it takes a ridiculously long time to find the redirect and when it does, the response given isn't the same given what would be expected if the final url was being used. (i.e. it is missing information that you would get if you directly scraped http://iopscience.iop.org/0004-637X/802/1/65/, even though the logs say it has successfully used the translator from Zotero)

When I directly curl the handler however I get a fast 301 response so I don't know why this is causing issues for citoid.

Here's a trace of a case where it never finds the redirect (in other instances, it does.)

{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":10,"req":{"url":"/api?format=mediawiki&search=10.1088%2F0004-637X%2F802%2F1%2F65","headers":{"user-agent":"curl/7.35.0","host":"localhost:1970","accept":"*/*","x-request-id":"2f70ad81-2f9e-11e5-9132-fdd3b13de15a"},"method":"GET","params":{},"query":{"format":"mediawiki","search":"10.1088/0004-637X/802/1/65"},"body":{},"remoteAddress":"127.0.0.1","remotePort":36738},"msg":"","levelPath":"trace/req","request_id":"2f70ad81-2f9e-11e5-9132-fdd3b13de15a","time":"2015-07-21T11:46:55.195Z","v":0}
{"name":"citoid.input.doi","hostname":"mariellesbox","pid":23095,"level":10,"message":"increment:citoid.input.doi:","method":"increment","levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:46:55.196Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"Resolved DOI 10.1088/0004-637x/802/1/65 to URL http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299; Sending to requestFromURL","time":"2015-07-21T11:46:55.752Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"Zotero request made for: http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299","time":"2015-07-21T11:46:55.763Z","v":0}
{"name":"citoid.zotero.req.5xx","hostname":"mariellesbox","pid":23095,"level":10,"message":"increment:citoid.zotero.req.5xx:","method":"increment","levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:46:55.763Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":10,"msg":"No Zotero translator found, looking for redirects","time":"2015-07-21T11:46:55.764Z","v":0}
{"name":"citoid.heap.rss","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.rss:77766656","method":"timing","value":77766656,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:47:45.547Z","v":0}
{"name":"citoid.heap.total","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.total:69563136","method":"timing","value":69563136,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:47:45.548Z","v":0}
{"name":"citoid.heap.used","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.used:38160600","method":"timing","value":38160600,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:47:45.548Z","v":0}
{"name":"citoid.heap.rss","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.rss:77766656","method":"timing","value":77766656,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:48:45.609Z","v":0}
{"name":"citoid.heap.total","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.total:69563136","method":"timing","value":69563136,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:48:45.610Z","v":0}
{"name":"citoid.heap.used","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.used:38194376","method":"timing","value":38194376,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:48:45.610Z","v":0}
{"name":"citoid.heap.rss","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.rss:77799424","method":"timing","value":77799424,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:49:45.659Z","v":0}
{"name":"citoid.heap.total","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.total:69563136","method":"timing","value":69563136,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:49:45.660Z","v":0}
{"name":"citoid.heap.used","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.used:38489328","method":"timing","value":38489328,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:49:45.660Z","v":0}
{"name":"citoid.heap.rss","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.rss:77799424","method":"timing","value":77799424,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:50:45.720Z","v":0}
{"name":"citoid.heap.total","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.total:69563136","method":"timing","value":69563136,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:50:45.721Z","v":0}
{"name":"citoid.heap.used","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.used:38517808","method":"timing","value":38517808,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:50:45.721Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"No redirect detected.","time":"2015-07-21T11:50:56.289Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"Using native scraper on http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299","time":"2015-07-21T11:50:56.290Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"Sucessfully scraped resource at http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299","time":"2015-07-21T11:50:58.509Z","v":0}
{"name":"citoid","hostname":"mariellesbox","pid":23095,"level":20,"msg":"PubMed query made for: http://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=citoid&email=citoid@mediawiki&format=json&ids=10.1088%2F0004-637x%2F802%2F1%2F65","time":"2015-07-21T11:50:58.835Z","v":0}
{"name":"citoid.heap.rss","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.rss:84406272","method":"timing","value":84406272,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:51:45.768Z","v":0}
{"name":"citoid.heap.total","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.total:73678848","method":"timing","value":73678848,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:51:45.768Z","v":0}
{"name":"citoid.heap.used","hostname":"mariellesbox","pid":23095,"level":10,"message":"timing:citoid.heap.used:36875864","method":"timing","value":36875864,"levelPath":"trace/metrics","msg":"","time":"2015-07-21T11:51:45.768Z","v":0}

Event Timeline

Mvolz raised the priority of this task from to Needs Triage.
Mvolz updated the task description. (Show Details)
Mvolz added projects: Citoid, Services.
Mvolz moved this task to Site specific issues on the Citoid board.
Mvolz added subscribers: Mvolz, Josve05a.

@mobrovac, can you replicate this on your localhost? Not sure why it's having so much trouble finding redirect.

I can, but rarely. It seems request is having trouble retrieving http://iopscience.iop.org/0004-637X/802/1/65/ and sometimes the request times out. From the request debug output:

REQUEST { url: 'http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299',
  encoding: null,
  method: 'get',
  retries: 1,
  timeout: 120000,
  pool: { maxSockets: Infinity },
  gzip: true,
  _encodingProvided: true,
  callback: [Function] }
REQUEST make request http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299
REQUEST onRequestResponse http://stacks.iop.org/0004-637X/802/i=1/a=65?key=crossref.0e4dc79673287d33895e683e51eb3299 301 { date: 'Tue, 21 Jul 2015 18:19:36 GMT',
  server: 'Apache/1.3.41 (Unix) mod_perl/1.30',
  'iop-request-id': 'e037c2f3-0cc7-4313-8905-e7ab6b73691f',
  location: 'http://iopscience.iop.org/0004-637X/802/1/65/',
  'content-type': 'text/html; charset=iso-8859-1',
  'set-cookie': 
   [ 'SaneID=93.137.176.1-1437502776422775; path=/; expires=Fri, 24-Jul-20 18:19:36 GMT',
     'IOP_session_live=%2F%2F1437502776270%7C43bbfb33-4e03-45ab-8116-a89e8c269bea%7C20150721-6013aa4dd1b3edb83ac91b13736526e7%7C%7C%7C%7C%7C%7C%7C%7C%7Cguest%2F68e3ffc117fea68490b93c218e172541; Path=/; Domain=.iop.org' ],
  'keep-alive': 'timeout=15, max=100',
  connection: 'Keep-Alive',
  'transfer-encoding': 'chunked' }
REQUEST redirect http://iopscience.iop.org/0004-637X/802/1/65/
REQUEST redirect to http://iopscience.iop.org/0004-637X/802/1/65/
REQUEST {}
REQUEST response end http://iopscience.iop.org/0004-637X/802/1/65/ 301 { date: 'Tue, 21 Jul 2015 18:19:36 GMT',
  server: 'Apache/1.3.41 (Unix) mod_perl/1.30',
  'iop-request-id': 'e037c2f3-0cc7-4313-8905-e7ab6b73691f',
  location: 'http://iopscience.iop.org/0004-637X/802/1/65/',
  'content-type': 'text/html; charset=iso-8859-1',
  'set-cookie': 
   [ 'SaneID=93.137.176.1-1437502776422775; path=/; expires=Fri, 24-Jul-20 18:19:36 GMT',
     'IOP_session_live=%2F%2F1437502776270%7C43bbfb33-4e03-45ab-8116-a89e8c269bea%7C20150721-6013aa4dd1b3edb83ac91b13736526e7%7C%7C%7C%7C%7C%7C%7C%7C%7Cguest%2F68e3ffc117fea68490b93c218e172541; Path=/; Domain=.iop.org' ],
  'keep-alive': 'timeout=15, max=100',
  connection: 'Keep-Alive',
  'transfer-encoding': 'chunked' }
REQUEST make request http://iopscience.iop.org/0004-637X/802/1/65/

It then stops here for 60+ seconds before continuing, or it times out. On the other hand, using curl for the same URL takes much less time:

$ time curl 'http://iopscience.iop.org/0004-637X/802/1/65/' > blah.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  155k    0  155k    0     0  62658      0 --:--:--  0:00:02 --:--:-- 62650

real	0m2.540s
user	0m0.009s
sys	0m0.009s

Aha! Found the culprit - we are not setting the UA in unshorten() which apparently creates problems for iop.org. Watch this with the UA set:

$ time curl 'localhost:1970/api?format=mediawiki&search=10.1088%2F0004-637X%2F802%2F1%2F65' > blah.txt ; echo
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5734  100  5734    0     0   1495      0  0:00:03  0:00:03 --:--:--  1495

real	0m3.846s
user	0m0.009s
sys	0m0.005s

I've got a hacked version, I'll clean it and submit a patch tonight/tomorrow.

Change 226250 had a related patch set uploaded (by Mobrovac):
Ensure the User-Agent header is set on every request

https://gerrit.wikimedia.org/r/226250

Change 226250 merged by jenkins-bot:
Ensure the User-Agent header is set on every request

https://gerrit.wikimedia.org/r/226250

Jdforrester-WMF edited projects, added VisualEditor; removed Patch-For-Review.
Jdforrester-WMF set Security to None.
Jdforrester-WMF edited a custom field.

Deployed, resolving.

Mvolz lowered the priority of this task from High to Medium.

Re-opened as we are no longer detecting redirects for this site now as of the last few weeks.