This is an umbrella task to aggregate all needed steps to get Maps to production status.
Some notes:
- varnish monitoring / alerting
- backend monitoring / alerting
- probably lots of other stuff that need to be consolidated
This is an umbrella task to aggregate all needed steps to get Maps to production status.
Some notes:
Mobile wiki would like to use maps in their "nearby" feature, just like it is already done in the android app. I propose that we allow the maps tile service access from all WMF servers. This does not mean that the users will be able to insert maps into wikipedia articles, only that the mobile team will be able to use it in their interface.
CC: @Jdlrobson @BBlack
We'd like to use this basically everywhere that maps make sense, obviously. But it's been locked down in a limited-beta-like status because it's not deployed in a production fashion yet. The whole point of this task is to get us past the hurdle where we care how many other production things rely on this and we don't have to lock down access or referrers at all. We're almost there,,,
@BBlack, I was hoping we can enable it so that devs can already play with it. Mobile has this limited beta mode thing specifically for this. But with the current referer check, they can only start developing after referer check is removed, which means it won't become production until a very long time from now. I think doing development and limited tests in parallel with making it production-quality would be more optimal.
Re the flurry of ticket updates here and in related places:
@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533 (and ditto for eqiad once it's alive), as well as the public-side checks on https://maps.wikimedia.org, both of which should alert/page on downtime.
@mobrovac could you comment on how you do that WRT other services? I thought you also use spec.yaml for that?
When reasonably confident on puppetization and monitoring, should get someone who's more-familiar with setting up internal services (in terms of proper puppetization and monitoring, etc) to give this all a once over and approve. Maybe @akosiaris or @Joe ?
This is covered by LVS checks. @Joe set that up, and this particular check for Kartotherian seems to be alive (I received an email alert for it last night). The check performs a full service check, just as it does on individual nodes.
as well as the public-side checks on https://maps.wikimedia.org
I have no info on how public checks are set up. Catchpoint perhaps?
No, I think what you're referring to that Joe set up is the per-service-host checks. Before we started looking at this late yesterday, there were no overall checks on abstract service endpoints (kartotherian.svc and/or maps-lb). The one you got emailed for last night is one I set up last night ( https://gerrit.wikimedia.org/r/#/c/294396/ ) and then later removed in a partial revert ( https://gerrit.wikimedia.org/r/#/c/294406/ ) because it was bad cargo-cult. The one I left behind there is the general icinga LVS check for the internal LVS service, but doesn't alert outside of ops.
as well as the public-side checks on https://maps.wikimedia.org
I have no info on how public checks are set up. Catchpoint perhaps?
Catchpoint is something else we should set up, but icinga also monitors public services. That part is set up, from https://gerrit.wikimedia.org/r/#/c/294397/ last night.
I re-opened T137851 and uploaded https://gerrit.wikimedia.org/r/294454 for it. Basically, the path was wrong because that specific check expects the monitoring spec, not a binary blob.
Catchpoint is something else we should set up, but icinga also monitors public services. That part is set up, from https://gerrit.wikimedia.org/r/#/c/294397/ last night.
Neat!
Referrer check has been removed (T137848) which allow more experimentation to happen. I'd still like to validate our automation of setting up new nodes and the related documentation before we actually call Maps production ready.
Maps are considered production for some time. There will always be things to be improve, but outside of this task