NIH db misbehaviour causing problems to Citoid
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	• mobrovac
	Apr 26 2016, 2:53 PM

Description

The NIH database seems to have a lot of problems lately. It takes a long time to connect to their servers, and it also takes a fair amount of time for it to answer requests. This causes Citoid to queue up incoming connections which causes new connections to time out as experienced by our monitoring:

icinga-wm: PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

Note that this is a combination of both Zotero and NIH taking their time to respond as well the amount of requests done by Citoid itself.

Because the NIH db is the prime source for PM(C) IDs, we need to keep it around, but find a way limit its impact on Citoid. An obvious idea that comes to mind is to lower the TCP socket connection time-out, but AFAIK we can change that only system-wide which could have unwanted consequences. Any ideas?

Related Objects

Mentioned In: T162886: Parallelise pubmed requests to get IDs earlier in the request chain and skip it when the DOIs are scraped from the page (rarer occurrence.)
T163986: Revamp spec.yaml in citoid
Mentioned Here: rGCIT6683ecd661fb: Update spec.yaml and remove deprecated aspects
T114907: Parallelize scraper and Zotero requests

Event Timeline

• mobrovac created this task.Apr 26 2016, 2:53 PM

Restricted Application added a project: VisualEditor. · View Herald TranscriptApr 26 2016, 2:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jdforrester-WMF triaged this task as High priority.Apr 26 2016, 7:07 PM

Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.

As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop doing that and that would reduce the number of requests by a lot.

For requests made for a pmc or pmid, we have no choice but to use nih website.

Is there any difference in load between urls with this root

http://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/

versus this one?

http://www.ncbi.nlm.nih.gov/pubmed/

or this one?

http://www.ncbi.nlm.nih.gov/pmc/articles/

It's possible there are different servers behind them. The first is their official api but if their paper servers are considerably better we can move some of the work onto that service. If there is no difference then that would't help.

In T133696#2241618, @Mvolz wrote:

As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop doing that and that would reduce the number of requests by a lot.

That would be ideal. @Mvolz can you assess the community impact of such a change? You mentioned on IRC that the medical community wouldn't be happy. Am I correct to assume that to be true only in the case where we stop all requests to NIH ?

For requests made for a pmc or pmid, we have no choice but to use nih website.

Yup, unfortunately :/

Is there any difference in load between urls with this root

While requests to the converter service do complete earlier than the other ones, the problem is caused by the connection to the server itself, which takes an absurd amount of time.

It's possible there are different servers behind them. The first is their official api but if their paper servers are considerably better we can move some of the work onto that service. If there is no difference then that would't help.

Could you do it, nevertheless? It's an overall win (if only a slight one).

Jdforrester-WMF set the point value for this task to 1.Apr 28 2016, 1:25 AM

Mvolz moved this task from Backlog to IO Tasks on the Citoid board.Jul 29 2016, 3:02 PM

Jdforrester-WMF moved this task from TR0: Interrupt to External and Administrivia on the VisualEditor board.Aug 9 2016, 7:35 PM

What's the status on this one? We're still using the doi converter api on every request in order to fill in DOI, PMID, and PMC.

If we need to assess community impact we'd need to talk to community liaisons so if this is still an issue we could ask them.

Mvolz moved this task from IO Tasks to Service: Scraper & Validation on the Citoid board.Oct 28 2016, 9:50 PM

Would maybe a config option for this be of use? Then maybe we should run benchmarks or something?

Another thing that might help is that if we try to parallelise some of the requests as in T114907. We can't always do that, because often the identifier comes from Zotero itself. But in cases we're given the identifier intially (as in requestfromPM or requestfromDOI or the DOI is detected via regex in the url) we could potentially knock a little time off. At present we're requesting after both scraper & zotero come back ever request.

Mvolz claimed this task.Nov 4 2016, 9:13 AM

I think the base problem here is that Citoid strongly depends on the DB. As you point out, we use it throughout Citoid. What would be the direct consequence of not making requests to NIH?

An obvious idea that comes to mind is to lower the TCP socket connection time-out, but AFAIK we can change that only system-wide which could have unwanted consequences. Any ideas?

preq and request make it easy to set per-request timeouts, which don't touch global socket defaults. Perhaps we should just lower the timeout for this upstream service?

I had a case yesterday where I had two different responses for the same request, because in one of them, PubMed timed out. Happy to reduce the timeout just for pubmed, but I have no idea what to set the value to.

The imminent problem with production alerts was dealt with in PS 295678, so lowering the priority. As Citoid is basically an aggregation service, we will always have this problem.
des
We could lower the time-out for certain requests, but the fact of the matter remains that if a server is not responding, Citoid will not produce the desired result.

Mvolz removed Mvolz as the assignee of this task.Jan 12 2017, 9:43 AM

akosiaris mentioned this in T163986: Revamp spec.yaml in citoid.Apr 28 2017, 8:51 AM

This has caused from the N-th time alerts than can not be acted upon in #wikimedia-operations. As a result and in the interest of keeping the service monitored instead of just silencing/ignoring all alerts regarding the service I 've submitted, merged and deployed https://gerrit.wikimedia.org/r/#/c/359111/. That allows us to continue monitoring the service for other problems and ignoring the failing part powered by nih.gov until a someone sees this and provides a better solution.

I 've cherry-picked my change on top of 6683ecd in src/ on tin in order not to push master and haven't pushed the deploy repo update back to gerrit yet.

Mvolz mentioned this in T162886: Parallelise pubmed requests to get IDs earlier in the request chain and skip it when the DOIs are scraped from the page (rarer occurrence.).Jun 29 2017, 10:24 AM

It sounds like there is nothing left to do here. NIH db outages will always affect citoid requests hitting that backend, and the monitoring issue has been addressed separately. Please reopen if there is anything left to do here.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptJul 12 2017, 5:57 PM

• mobrovac edited projects, added Services (done); removed Services.Jul 18 2017, 7:53 PM

NIH db misbehaviour causing problems to CitoidClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Related Objects

Event Timeline

NIH db misbehaviour causing problems to Citoid
Closed, ResolvedPublic1 Estimated Story Points
Actions