Page MenuHomePhabricator

Epic: switch Maps to production status
Closed, ResolvedPublic

Description

This is an umbrella task to aggregate all needed steps to get Maps to production status.

Some notes:

  • varnish monitoring / alerting
  • backend monitoring / alerting
  • probably lots of other stuff that need to be consolidated

Related Objects

StatusAssignedTask
ResolvedNone
ResolvedMSantos
ResolvedGehel
ResolvedBBlack
ResolvedBBlack
Resolvedori
ResolvedOttomata
ResolvedBBlack
ResolvedBBlack
ResolvedNone
Resolvedmark
ResolvedRobH
ResolvedGehel
ResolvedGehel
ResolvedRobH
ResolvedNone
ResolvedGehel
ResolvedNone
ResolvedNone
ResolvedMaxSem
ResolvedBBlack
ResolvedGehel
ResolvedGehel
OpenNone

Event Timeline

Gehel created this task.Apr 26 2016, 9:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2016, 9:35 PM
Gehel added subtasks: Unknown Object (Task), Unknown Object (Task).
fgiunchedi triaged this task as Normal priority.Apr 27 2016, 3:06 PM
Gehel updated the task description. (Show Details)May 2 2016, 1:33 PM

Mobile wiki would like to use maps in their "nearby" feature, just like it is already done in the android app. I propose that we allow the maps tile service access from all WMF servers. This does not mean that the users will be able to insert maps into wikipedia articles, only that the mobile team will be able to use it in their interface.

CC: @Jdlrobson @BBlack

Yurik moved this task from All map-related tasks to Tracking on the Maps board.May 12 2016, 11:06 PM

We'd like to use this basically everywhere that maps make sense, obviously. But it's been locked down in a limited-beta-like status because it's not deployed in a production fashion yet. The whole point of this task is to get us past the hurdle where we care how many other production things rely on this and we don't have to lock down access or referrers at all. We're almost there,,,

@BBlack, I was hoping we can enable it so that devs can already play with it. Mobile has this limited beta mode thing specifically for this. But with the current referer check, they can only start developing after referer check is removed, which means it won't become production until a very long time from now. I think doing development and limited tests in parallel with making it production-quality would be more optimal.

faidon closed subtask Unknown Object (Task) as Resolved.May 16 2016, 11:42 AM
faidon closed subtask Unknown Object (Task) as Resolved.
MaxSem renamed this task from Switch Maps to production status to Epic: switch Maps to production status.May 27 2016, 9:26 PM
MaxSem added a project: Epic.

Re the flurry of ticket updates here and in related places:

  1. Installing and setting up eqiad doesn't have to block this, it can go under some other meta-task for post-production cleanup/improvements, IMHO.
  2. The puppetization blocker is real, if it's true we're not fully puppetized for easy reinstalls.
  3. We're missing blockers here for monitoring/alerting of the internal + external service endpoints (kartotherian.svc and cache_maps).

@BBlack, the T137617 was the monitoring one - it is now in icinga. Puppetization is a bit less cut and dry - I think everything has been scripted, but db install is a 2 day task, so puppets clearly do not work well for that.

BBlack added a comment.EditedJun 14 2016, 10:24 PM

@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533 (and ditto for eqiad once it's alive), as well as the public-side checks on https://maps.wikimedia.org, both of which should alert/page on downtime.

@mobrovac could you comment on how you do that WRT other services? I thought you also use spec.yaml for that?

When reasonably confident on puppetization and monitoring, should get someone who's more-familiar with setting up internal services (in terms of proper puppetization and monitoring, etc) to give this all a once over and approve. Maybe @akosiaris or @Joe ?

@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533

This is covered by LVS checks. @Joe set that up, and this particular check for Kartotherian seems to be alive (I received an email alert for it last night). The check performs a full service check, just as it does on individual nodes.

as well as the public-side checks on https://maps.wikimedia.org

I have no info on how public checks are set up. Catchpoint perhaps?

@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533

This is covered by LVS checks. @Joe set that up, and this particular check for Kartotherian seems to be alive (I received an email alert for it last night). The check performs a full service check, just as it does on individual nodes.

No, I think what you're referring to that Joe set up is the per-service-host checks. Before we started looking at this late yesterday, there were no overall checks on abstract service endpoints (kartotherian.svc and/or maps-lb). The one you got emailed for last night is one I set up last night ( https://gerrit.wikimedia.org/r/#/c/294396/ ) and then later removed in a partial revert ( https://gerrit.wikimedia.org/r/#/c/294406/ ) because it was bad cargo-cult. The one I left behind there is the general icinga LVS check for the internal LVS service, but doesn't alert outside of ops.

as well as the public-side checks on https://maps.wikimedia.org

I have no info on how public checks are set up. Catchpoint perhaps?

Catchpoint is something else we should set up, but icinga also monitors public services. That part is set up, from https://gerrit.wikimedia.org/r/#/c/294397/ last night.

No, I think what you're referring to that Joe set up is the per-service-host checks. Before we started looking at this late yesterday, there were no overall checks on abstract service endpoints (kartotherian.svc and/or maps-lb). The one you got emailed for last night is one I set up last night ( https://gerrit.wikimedia.org/r/#/c/294396/ ) and then later removed in a partial revert ( https://gerrit.wikimedia.org/r/#/c/294406/ ) because it was bad cargo-cult.

I re-opened T137851 and uploaded https://gerrit.wikimedia.org/r/294454 for it. Basically, the path was wrong because that specific check expects the monitoring spec, not a binary blob.

Catchpoint is something else we should set up, but icinga also monitors public services. That part is set up, from https://gerrit.wikimedia.org/r/#/c/294397/ last night.

Neat!

Gehel added a comment.Jun 17 2016, 8:36 AM

Referrer check has been removed (T137848) which allow more experimentation to happen. I'd still like to validate our automation of setting up new nodes and the related documentation before we actually call Maps production ready.

mxn added a subscriber: mxn.Oct 17 2016, 9:56 AM
Mholloway closed this task as Resolved.Jul 31 2018, 4:50 PM
Gehel added a comment.Jul 31 2018, 4:50 PM

Maps are considered production for some time. There will always be things to be improve, but outside of this task