Page MenuHomePhabricator

Some Core availability Catchpoint tests might be more expensive than they need to be
Closed, ResolvedPublic

Description

A lot of the availability tests use the Chrome monitor, which costs 1 point per run, where the Emulated monitor would be enough for testing availability.

While running JS with the Chrome Monitor probably triggers more requests, it doesn't inform us about whether the JS experience is broken or not, since we are merely testing for availability.

Was there any rationale to have the "big wikis" tested with the Chrome monitor rather than the Emulated monitor?

Details

Related Gerrit Patches:

Event Timeline

Gilles created this task.Apr 13 2017, 6:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 13 2017, 6:54 AM
Peter added a subscriber: Peter.Apr 18 2017, 8:13 AM

I did some checks, in *Core Services Availability* we do two payment checks with Chrome against the API and check if it returns OK. I'm pretty sure they could use "Object" and just verify the result?

For other tests we need to discuss it, I haven't used CatchPoint in a while. Some tests I think overlap with WebPageTest we don't have the same alerting there.

MoritzMuehlenhoff triaged this task as Medium priority.May 10 2017, 3:31 PM
Gilles moved this task from Inbox to Radar on the Performance-Team board.May 17 2017, 6:52 PM
Krinkle added a subscriber: Krinkle.

We created this task at a time we were considering to use Catchpoint for some of our web performance tests. At that time, a lot of credits were already occupied with questionable tests. As such, this task was created for Ops to re-evaluate some of those and potentially free up credits for us to use.

While that should still happen (in the interest of not wasting anything), untagging Performance-Team given we're no longer interested in using Catchpoint for anything relating to web performance due to the instability of the metrics. I suppose they have different objectives (correctness/availability, not millisecond precise/consistent timing metrics).

faidon assigned this task to Volans.Jan 29 2018, 2:54 AM
faidon raised the priority of this task from Medium to High.

Mentioned in SAL (#wikimedia-operations) [2018-01-31T01:51:25Z] <mutante> catchpoint: recycled gwicke's user and turned it into a user for volans, upgraded him to admin (T162857)

Change 410951 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Icinga: update allowed IPs for external monitoring

https://gerrit.wikimedia.org/r/410951

Change 410951 merged by Volans:
[operations/puppet@production] Icinga: update allowed IPs for external monitoring

https://gerrit.wikimedia.org/r/410951

Volans moved this task from Inbox to In progress on the observability board.
Volans closed this task as Resolved.Apr 19 2018, 9:10 AM
Volans added subscribers: chasemp, faidon.

To summarize the work done recently, I've made an audit of existing checks and fixed/improved some of them that had clear errors or needed to be updated. @chasemp has very kindly offered himself to review the WMCS related checks, users and groups.

The summary of the audit of existing checks and their cost in points, users and group is available in this internal document.

I've also added myself to all the alerts to have a better view of any check that might be less reliable or have more false positive than others in order to improve their reliability on the alarming side.
In addition to improve the API vs Chrome checks, another thing worth investigating is the ability to use public cloud nodes in some cases instead of the backbone nodes given the reduced cost.

There is a medium/long-term plan to review, improve and optimize our external checks that is out of scope of this specific task and will continue in the background. Given that we're not in a shortage of points situation at this moment I'm resolving the task.