Page MenuHomePhabricator

Revamp spec.yaml in citoid
Closed, ResolvedPublic0 Estimated Story Points

Description

Citoid icinga alert /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200)

Apparently, http://www.pbs.org/newshour/making-sense/care-peoples-kids/ is being checked, and that returns now a 404.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Maybe this should be reconverted on "not check PBS's website updatime" :-) Note that I understand why it has to be external, but does it need to be a single thing going down?/cannot the check be changed so that there is a slightly better heuristic/different error code is sent when an external dependency is not met.

The entire service (citoid - and zotero that is powering many parts of it) is an external sources parser for providing citations. The problem is intrinsic to the service. We 've been having this problem with other sources as well, see T133696. All in all, I agree that alerting when an external service we absolutely don't control or can do anything about it fails is useless. But that poses a pretty good question which is

  • How do I know the service is working correctly when all of it's HTTP endpoints practically depend on a (arbitrary ?) number of external endpoints working correctly and a failure in any of those is effectively a degradation of service ?

My take would be that a proactive monitoring of all external endpoints and advertisement of those alongside status to citoid's users would be valid approach. That is inform the users (namely VE) that endpoints A,B,C have problems and don't work, so don't expect that functionality working either. That is externalize the problem and let the user handle it. That would also allow to implement an HTTP endpoint saying something like "CRITICAL: X/Y external endpoints are down", "WARNING: 1/Y external endpoints are malfunctioning", "OK: All external endpoints working fine" and greatly help the monitoring system that way.

That approach has 2 big caveats. First one is that is needs state and we tend to favor stateless services. The second is that it needs some re-architecturing of the service and development.

I agree that defining the term the service is working properly is rather tricky in the case of Citoid, given its strong dependency on external sources. While I like @akosiaris' idea, I would go as far as re-architecting the service for this. Perhaps there is a way to simply turn CRITICALs into WARNINGs in such a case? In the example listed in this ticket's title, e.g., having a 404 returned should be a warning, while a 5xx response (or no response at all) should be treated as a critical situation. I acknowledge the solution is not ideal, because we could get a 5xx from Citoid if, say, the external source returns some garbage instead of the actual content.

Alternatively, we might consider storing the content of the external sources somewhere on our servers and then Citoid request them. We would need to serve them from a public IP, though, as Citoid does not allow scraping content from internal IPs for security reasons.

Mvolz renamed this task from Citoid icinga alert /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) to Revamp spec.yaml in citoid.Jun 1 2017, 9:37 AM
Mvolz claimed this task.
Mvolz triaged this task as Medium priority.
Mvolz updated the task description. (Show Details)

Change 356559 had a related patch set uploaded (by Mvolz; owner: Marielle Volz):
[mediawiki/services/citoid@master] Update spec.yaml and remove deprecated aspects

https://gerrit.wikimedia.org/r/356559

Note this patch only replaces pbs.org with example.com. We could certainly try to scrape something we host ourselves... is our uptime better? :)

Another approach is to just try a sample of diverse urls and IFF they all give 5xx or 404s, return a critical warning then. But I have no idea how one would go about implementing that or where that code would live; presumably somewhere upstream?

I think @mobrovac's idea if marking a test as warn only might work; Is there a flag you can use to change it from a critical to a warn inside the spec.yaml? Are there any docs on this?

Change 356559 merged by Mobrovac:
[mediawiki/services/citoid@master] Update spec.yaml and remove deprecated aspects

https://gerrit.wikimedia.org/r/356559

Mentioned in SAL (#wikimedia-operations) [2017-06-01T20:32:00Z] <mobrovac@tin> Started deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986

Mentioned in SAL (#wikimedia-operations) [2017-06-01T20:37:20Z] <mobrovac@tin> Finished deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986 (duration: 05m 20s)

Mvolz removed Mvolz as the assignee of this task.Jun 19 2017, 1:24 PM
Mvolz removed a project: Patch-For-Review.
Mvolz added a subscriber: Mvolz.
Mvolz assigned this task to mobrovac.

Marking this as resolved as per the changes made a year ago... feel free to re-open if you disagree.

RLazarus reassigned this task from mobrovac to Mvolz.
RLazarus added subscribers: Joe, RLazarus.

Reopening this -- we had some alerts today from Citoid that turned out to be latency associated with only 404s (graph). We suspect the root cause was elevated latency from example.com.

We should probably remove that last external URL from the spec and replace it with something we host -- even if the uptime still isn't 100%, at least we'll be alerting on our own issues and not on example.com's.

Marielle, can I ask you to take a look?

Change 566743 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Comment out x-ample for http://www.example.com/thisurldoesntexist

https://gerrit.wikimedia.org/r/566743

Reopening this -- we had some alerts today from Citoid that turned out to be latency associated with only 404s (graph). We suspect the root cause was elevated latency from example.com.

We should probably remove that last external URL from the spec and replace it with something we host -- even if the uptime still isn't 100%, at least we'll be alerting on our own issues and not on example.com's.

Marielle, can I ask you to take a look?

I had a change already open to remove the 404s (which is on purpose) - is this what you meant, or is it for the 200s as well? I'm happy to just kill it straight away as a short term solution. (Although there was some disagreement about this approach - see change).

For things that are supposed to return 200, happy to switch to something we host.

Change 566743 merged by jenkins-bot:
[mediawiki/services/citoid@master] Only use sites we host for ichinga

https://gerrit.wikimedia.org/r/566743

Change 703833 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Temporarily silence alerting test

https://gerrit.wikimedia.org/r/703833