Revamp spec.yaml in citoid
Open, NormalPublic0 Story Points

Description

Citoid icinga alert /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200)

Apparently, http://www.pbs.org/newshour/making-sense/care-peoples-kids/ is being checked, and that returns now a 404.

jcrespo created this task.Apr 27 2017, 11:35 AM
Restricted Application added a project: VisualEditor. · View Herald TranscriptApr 27 2017, 11:35 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

And now it's fine again

Maybe this should be reconverted on "not check PBS's website updatime" :-) Note that I understand why it has to be external, but does it need to be a single thing going down?/cannot the check be changed so that there is a slightly better heuristic/different error code is sent when an external dependency is not met.

The entire service (citoid - and zotero that is powering many parts of it) is an external sources parser for providing citations. The problem is intrinsic to the service. We 've been having this problem with other sources as well, see T133696. All in all, I agree that alerting when an external service we absolutely don't control or can do anything about it fails is useless. But that poses a pretty good question which is

  • How do I know the service is working correctly when all of it's HTTP endpoints practically depend on a (arbitrary ?) number of external endpoints working correctly and a failure in any of those is effectively a degradation of service ?

My take would be that a proactive monitoring of all external endpoints and advertisement of those alongside status to citoid's users would be valid approach. That is inform the users (namely VE) that endpoints A,B,C have problems and don't work, so don't expect that functionality working either. That is externalize the problem and let the user handle it. That would also allow to implement an HTTP endpoint saying something like "CRITICAL: X/Y external endpoints are down", "WARNING: 1/Y external endpoints are malfunctioning", "OK: All external endpoints working fine" and greatly help the monitoring system that way.

That approach has 2 big caveats. First one is that is needs state and we tend to favor stateless services. The second is that it needs some re-architecturing of the service and development.

I agree that defining the term the service is working properly is rather tricky in the case of Citoid, given its strong dependency on external sources. While I like @akosiaris' idea, I would go as far as re-architecting the service for this. Perhaps there is a way to simply turn CRITICALs into WARNINGs in such a case? In the example listed in this ticket's title, e.g., having a 404 returned should be a warning, while a 5xx response (or no response at all) should be treated as a critical situation. I acknowledge the solution is not ideal, because we could get a 5xx from Citoid if, say, the external source returns some garbage instead of the actual content.

Alternatively, we might consider storing the content of the external sources somewhere on our servers and then Citoid request them. We would need to serve them from a public IP, though, as Citoid does not allow scraping content from internal IPs for security reasons.

Jdforrester-WMF set the point value for this task to 0.May 2 2017, 7:18 PM
Mvolz moved this task from Backlog to Production on the Citoid board.May 25 2017, 11:13 AM
Mvolz claimed this task.Jun 1 2017, 9:37 AM
Mvolz renamed this task from Citoid icinga alert /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 404 (expecting: 200) to Revamp spec.yaml in citoid.
Mvolz triaged this task as Normal priority.
Mvolz updated the task description. (Show Details)

Change 356559 had a related patch set uploaded (by Mvolz; owner: Marielle Volz):
[mediawiki/services/citoid@master] Update spec.yaml and remove deprecated aspects

https://gerrit.wikimedia.org/r/356559

Mvolz added a comment.Jun 1 2017, 10:09 AM

Note this patch only replaces pbs.org with example.com. We could certainly try to scrape something we host ourselves... is our uptime better? :)

Another approach is to just try a sample of diverse urls and IFF they all give 5xx or 404s, return a critical warning then. But I have no idea how one would go about implementing that or where that code would live; presumably somewhere upstream?

I think @mobrovac's idea if marking a test as warn only might work; Is there a flag you can use to change it from a critical to a warn inside the spec.yaml? Are there any docs on this?

Change 356559 merged by Mobrovac:
[mediawiki/services/citoid@master] Update spec.yaml and remove deprecated aspects

https://gerrit.wikimedia.org/r/356559

Mentioned in SAL (#wikimedia-operations) [2017-06-01T20:32:00Z] <mobrovac@tin> Started deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986

Mentioned in SAL (#wikimedia-operations) [2017-06-01T20:37:20Z] <mobrovac@tin> Finished deploy [citoid/deploy@ba0db9c]: Update spec to minimise alert noise - T163986 (duration: 05m 20s)

Mvolz removed Mvolz as the assignee of this task.
Mvolz added a subscriber: Mvolz.