Page MenuHomePhabricator

NIH db misbehaviour causing problems to Citoid
Closed, ResolvedPublic1 Estimated Story Points

Description

The NIH database seems to have a lot of problems lately. It takes a long time to connect to their servers, and it also takes a fair amount of time for it to answer requests. This causes Citoid to queue up incoming connections which causes new connections to time out as experienced by our monitoring:

icinga-wm: PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

Note that this is a combination of both Zotero and NIH taking their time to respond as well the amount of requests done by Citoid itself.

Because the NIH db is the prime source for PM(C) IDs, we need to keep it around, but find a way limit its impact on Citoid. An obvious idea that comes to mind is to lower the TCP socket connection time-out, but AFAIK we can change that only system-wide which could have unwanted consequences. Any ideas?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdforrester-WMF moved this task from To Triage to TR0: Interrupt on the VisualEditor board.

As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop doing that and that would reduce the number of requests by a lot.

For requests made for a pmc or pmid, we have no choice but to use nih website.

Is there any difference in load between urls with this root

http://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/

versus this one?

http://www.ncbi.nlm.nih.gov/pubmed/

or this one?

http://www.ncbi.nlm.nih.gov/pmc/articles/

It's possible there are different servers behind them. The first is their official api but if their paper servers are considerably better we can move some of the work onto that service. If there is no difference then that would't help.

As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop doing that and that would reduce the number of requests by a lot.

That would be ideal. @Mvolz can you assess the community impact of such a change? You mentioned on IRC that the medical community wouldn't be happy. Am I correct to assume that to be true only in the case where we stop all requests to NIH ?

For requests made for a pmc or pmid, we have no choice but to use nih website.

Yup, unfortunately :/

Is there any difference in load between urls with this root

While requests to the converter service do complete earlier than the other ones, the problem is caused by the connection to the server itself, which takes an absurd amount of time.

It's possible there are different servers behind them. The first is their official api but if their paper servers are considerably better we can move some of the work onto that service. If there is no difference then that would't help.

Could you do it, nevertheless? It's an overall win (if only a slight one).

What's the status on this one? We're still using the doi converter api on every request in order to fill in DOI, PMID, and PMC.

Mvolz changed the task status from Open to Stalled.Oct 28 2016, 9:49 PM

If we need to assess community impact we'd need to talk to community liaisons so if this is still an issue we could ask them.

Would maybe a config option for this be of use? Then maybe we should run benchmarks or something?

Another thing that might help is that if we try to parallelise some of the requests as in T114907. We can't always do that, because often the identifier comes from Zotero itself. But in cases we're given the identifier intially (as in requestfromPM or requestfromDOI or the DOI is detected via regex in the url) we could potentially knock a little time off. At present we're requesting after both scraper & zotero come back ever request.

I think the base problem here is that Citoid strongly depends on the DB. As you point out, we use it throughout Citoid. What would be the direct consequence of not making requests to NIH?

An obvious idea that comes to mind is to lower the TCP socket connection time-out, but AFAIK we can change that only system-wide which could have unwanted consequences. Any ideas?

preq and request make it easy to set per-request timeouts, which don't touch global socket defaults. Perhaps we should just lower the timeout for this upstream service?

I had a case yesterday where I had two different responses for the same request, because in one of them, PubMed timed out. Happy to reduce the timeout just for pubmed, but I have no idea what to set the value to.

mobrovac lowered the priority of this task from High to Medium.Jan 9 2017, 6:34 PM

The imminent problem with production alerts was dealt with in PS 295678, so lowering the priority. As Citoid is basically an aggregation service, we will always have this problem.
des
We could lower the time-out for certain requests, but the fact of the matter remains that if a server is not responding, Citoid will not produce the desired result.

Mvolz removed Mvolz as the assignee of this task.Jan 12 2017, 9:43 AM

This has caused from the N-th time alerts than can not be acted upon in #wikimedia-operations. As a result and in the interest of keeping the service monitored instead of just silencing/ignoring all alerts regarding the service I 've submitted, merged and deployed https://gerrit.wikimedia.org/r/#/c/359111/. That allows us to continue monitoring the service for other problems and ignoring the failing part powered by nih.gov until a someone sees this and provides a better solution.

I 've cherry-picked my change on top of 6683ecd in src/ on tin in order not to push master and haven't pushed the deploy repo update back to gerrit yet.

GWicke claimed this task.

It sounds like there is nothing left to do here. NIH db outages will always affect citoid requests hitting that backend, and the monitoring issue has been addressed separately. Please reopen if there is anything left to do here.