Page MenuHomePhabricator

migrate RT maint-announce into phabricator
Closed, DeclinedPublic

Description

Final Plan:

We'll need to modify our workflow. Right now in RT, maint-annoucements come in multiple times for a single event. We'll typically get an initial notification of maintenance, then any modifications between initial notification and the event. Then we also tend to get reminders. Some vendors send all three of these, some send only one.

RT allows tickets to be merged, so we merge in the later tickets to the earlier ticket for the maintenance window. Merge is allows in Phab as well but works slightly differently.

  • Add in the maint-announce project and have emails into it trigger creation of a new task with the SRE, maint-announce, and in the S6
    • This includes adding the new #maint-annouce project to the herald to have SRE always travel with it.
  • Vendors/Carriers/Datacenters/Peers email in notices.
  • Ops Clinic Duty person triages incoming notices.
  • If the notice is new they do the following:
    • Add to Operations tracking google calendar.
      • Include the circuit IDs and Task # in the entry.
    • Move notice from backlog/new to 'on calendar' on workboard; this move is shown on the task details. (This workboard doesn't yet exist.)
    • Stall task until end of the maint-window; then resolve task.
  • If the notice is a followup to an existing task:
  • The original task is merged into the new task as a duplicate.
    • If the date/time changed, the google calendar entry is updated.
      • Include the circuit IDs and Task # in the entry.
    • Move notice from backlog/new to 'on calendar' on workboard; this move is shown on the task details. (This workboard doesn't yet exist.)

In RT we went with merging the newer tasks into the orignal, but RT merged in content. Since Phabricator does not, keeping the new task open and merging in the older task give us the more up to date information immediately without requiring anyone to manually copy over data from one task to another.

Considerations:

  • we need to forward maint-announce@rt to maint-announce@phab
  • This does not includes advance whitelisting the domains of our maint-announcement vendors. Since we are using a designated space for maint-announce items to contain them for search and security, and Phabricator natively provides the ability to specify a creation address per-space. We could in the future do this with custom pre-phabricator handling behavior, but that would be prone to far more breakage. We also do not try to whitelist senders in RT currently and it seems prudent to migrate the existing restrictions rather than do it all at once. This ability did not exist when the initial conversations happened many months ago so this is a slight rethinking.
  • we could later move this to phab's internal calendar
  • #acl*operations-team is being used to secure the space and also is applied to tasks incoming to the space by default

This task will detail the overall migration of the maint-announce queue in RT into phabricator.

The maint-announces come in via an alias, and then are piped into RT. Once in RT, the ops clinic person triages announcements and ensures they are placed on the operations tracking gcal.

We'll need to relocate this queue/project into phabricator, as it is the last remaining use of RT.

Currently, maintainance notifications are triaged by our ops clinic duty person for the week. Their steps are detailed on https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Responsibilities

They include:

  • Maintain the 'maint-announce' queue and calendar:
    • This is the ONLY RT queue left for Ops Clinic Duty coverage.
    • Modify ticket Subject to prepend dates of effect in big-endian order (ex: 2014-11-06 to 2014-11-09: Equinix chiller maintenance)
    • Merge follow-up tickets as needed so that there is one per maintenance event
    • There is [https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars a gcal shared with all WMF named 'Ops maintenance & contracts']. All maint-announce queue tickets should be entered into this calendar.
      • Include the circuit IDs and RT#s in the entry. (See entries on 2014-10-07 for examples.)
      • Update the ticket in RT from 'new' to 'open' and comment that it has been added to the ops tracking gcal.

Please note that when this is done, RT can be made read-only and all mail relays from RT killed (NOT FORWARDED).

The maint-requests go to aliases, not to RT, they can be redirected to phabricator.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'll detail out how mail is routed and how we triage the requests shortly.

scfc renamed this task from migrate RT main-announce into phabricator to migrate RT maint-announce into phabricator.Nov 9 2015, 2:49 PM
scfc set Security to None.
This comment was removed by RobH.

We'll need to modify our workflow. Right now in RT, maint-annoucements come in multiple times for a single event. We'll typically get an initial notification of maintenance, then any modifications between initial notification and the event. Then we also tend to get reminders. Some vendors send all three of these, some send only one.

RT allows tickets to be merged, so we merge in the later tickets to the earlier ticket for the maintenance window. One possible way to work this in phabricator is simply link every followup task as a blocker to the initial task and then resolve the followup tasks/emails. The initial task/email is not resolved until AFTER the maintenance window has passed.

Also in chatting with Chase, we should put maint-announcements into their own space. Since these announcements can contain info on when we may lose connectivity, it may allow for directed attacks on remaining links to saturate them. (Perhaps this is paranoid, but I think its a good idea unless @mark or @faidon disagree.)

I've CC'd in @Dzahn. Daniel regularly patrols maint-announce and updates the tracking calendar, so I want to ensure we have him review our potential plan(s).

You can merge in phab also, but comments and content don't come along unless you manually copy.

I see what @greg means about how content is not copied in during merge. So it is not as ideal as how we handled it in RT, but I propose the following workflow in phabricator:

  • Add in the maint-announce project and have emails into it trigger creation of a new task with the SRE, maint-announce, and in the S4 (private operations space).
    • This includes adding the new #maint-annouce project to the herald to have SRE always travel with it.
    • This includes advance whitelisting the domains of our maint-announcement vendors.
    • This expands the use of the S4 space from strictly procurement use, but that was expected (Hence its called Operations Vendors, not Procurement.) I'll likely modify the space name to simply be called Operations Private and allow the #acl*operationsteam and #acl*procurement folks to access it. Anyone who needs additional access will be added to the latter group (who aren't in ops).
      • We may want to rename #acl*procurement into something more on point, but that is 100% cosmetic at this point, as I (@RobH) maintain the membership of said group.
  • Vendors email in notices.
  • Ops Clinic Duty person triages incoming notices.
  • If the notice is new they do the following:
    • Modify ticket Subject to prepend dates of effect in start-end order (ex: 2014-11-06 to 2014-11-09: Equinix chiller maintenance)
    • Add to Operations tracking google calendar.
      • Include the circuit IDs and Task#s in the entry.
    • Move notice from backlog/new to 'on calendar' on the workboard and comment on task that it has been added.
    • Stall task until end of the maint-window; then resolve task.
  • If the notice is a followup to an existing task:
    • I'm uncertain as to which option to use:

A)

  • Any follow up tasks are reviewed to confirm no changes to initial window(s). If so, modify the ORIGINAL task to append in the new task as a duplicate.
    • The data of the duplicate task is not copied, this must be manually copied.
    • If the date/time changed, the google calendar entry is updated.
      • Include the circuit IDs and Task#s in the entry.
    • Updating Opsen comments about processed updates on task. (EG: date window changed, updated google calendar entry.)

B)

  • The original task is merged into the new task as a duplicate.
    • If the date/time changed, the google calendar entry is updated.
      • Include the circuit IDs and Task#s in the entry.
    • Updating Opsen comments about processed updates on task. (EG: date window changed, updated google calendar entry.)

In RT we went with option A, but RT merged in task content. Since Phabricator does not, option B would give us the more up to date information immediately without requiring anyone to manually copy over data from one task to another. I present both, as A is how we used to do it, but B seems better for the new workflow (to me.)

I also want to note on record that we should not disclose our maint-announcements by default. This is why I stated that they should be auto-generated in the S4 operations private space. Since these include when primary links or datacenter facilities may be compromised (due to downtime or scheduled maintenance), it seems like we should err on the side of paranoia. (Yes the info is not entirely non-public, but hopefully someone with the time and resources to track down that kind of information realizes we are a horrible target because Wikipedia!)

Ok, I've had an IRC discussion with @Chase and @Dzahn about this workflow, and I want to modify my suggestion to the following:

  • Add in the maint-announce project and have emails into it trigger creation of a new task with the SRE, maint-announce, and in a NEW space for maint-annoucements.
    • This includes adding the new #maint-annouce project to the herald to have SRE always travel with it.
    • This includes advance whitelisting the domains of our maint-announcement vendors.
    • This creates a new space for the announcements. This space will be private for #acl*operations-team only, as it will have potentially private link data.
      • If we want to allow non-operations members in, it will complicate auditing. I (@RobH) advise against it starting off; and even expanded it would likely only include WMF staff. (Though why anyone else needs to see these for a legitimate reason is unlikely. If any planned maintainance is service affecting, notification is part of triage.)
  • Vendors/Carriers/Datacenters/Peers email in notices.
  • Ops Clinic Duty person triages incoming notices.
  • If the notice is new they do the following:
    • Add to Operations tracking google calendar.
      • Include the circuit IDs and Task # in the entry.
    • Move notice from backlog/new to 'on calendar' on workboard; this move is shown on the task details. (This workboard doesn't yet exist.)
    • Stall task until end of the maint-window; then resolve task.
  • If the notice is a followup to an existing task:
  • The original task is merged into the new task as a duplicate.
    • If the date/time changed, the google calendar entry is updated.
      • Include the circuit IDs and Task # in the entry.
    • Move notice from backlog/new to 'on calendar' on workboard; this move is shown on the task details. (This workboard doesn't yet exist.)

In RT we went with merging the newer tasks into the orignal, but RT merged in content. Since Phabricator does not, keeping the new task open and merging in the older task give us the more up to date information immediately without requiring anyone to manually copy over data from one task to another.

IRC Update: @Dzahn proposes we evaulate using phabricator's calendar tracking for these in lieu of using google.

Benefits: Open source \o/, single interface
Drawbacks: We don't have much experience using phabricator calendar, and its largely a beta feature at this time.

@RobH's Proposal: We don't migrate both into phabricator for announcements AND for calendar just yet, only the announcements. Once we have migrated that process and notifications successfully, we can do a more detailed comparison of phabricator versus google calendar. If we do start using phabricator calendar for this, we may want to move over ssl renewals and support contracts as well.

Ok, I've had an IRC discussion with @Chase and @Dzahn about this workflow

No you haven't, different person. I'm not the Chase you're looking for.

RobH mentioned this in Unknown Object (Task).Dec 3 2015, 4:59 PM
RobH added a parent task: Unknown Object (Task).Dec 3 2015, 5:11 PM

I've assigned this to the @chasemp for his review of the proposed workflow/email notifications of announcements.

chasemp triaged this task as Medium priority.
chasemp updated the task description. (Show Details)

I modified your plan slightly to reflect the technical possibilities of spaces and task creation. i.e. phab does almost all of this natively so we should try without a whitelist first as that kicks a lot of things out into custom behavior we should try to avoid maintaining.

I tried to put it all together in the header.

RobH added a subtask: Unknown Object (Task).Dec 9 2015, 9:17 PM

Please note that the projects needed for this were documented by @chasemp on T103700. We're linking in our testing tasks to this task as well.

As all his proposed changes to my initial suggestions make sense (and are modified to reflect phabricator's workflow), we've gone with that.

RobH added a subtask: Unknown Object (Task).Dec 9 2015, 9:33 PM
RobH closed subtask Unknown Object (Task) as Resolved.

So we've tested task creation by email to maint-announce@phabricator.wikimedia.org. As that is working, I've modified the maint-announce alias to send into both RT and Phabricator for now so I can compare both's queues/projects to ensure nothing is going missing and everything is working as intended.

@Cmjohnson is on clinic duty this week, but has been notified that he should cease triaging the maint-announcements in both phabricator and RT for now. I'll also be sending an email to operations list for other opsen to leave them alone so I can handle them during the migration.

After I see a number of notices work without issue, we can turn off the RT forward and continue archiving the RT service.

I have to put too much info regarding aliases for this to remain in public domain.

RobH shifted this object from the S1 Public space to the Restricted Space space.Dec 10 2015, 9:42 PM
RobH removed subscribers: scfc, StudiesWorld.

Moved into the maint-annoucne space, since then it can only be seen by ops. I pulled off the non wmf staff from the task subscription.

So our initial summary pointed out that we use maint-announce@wikimedia.org for all these notices, but it also wasnt specific enough. We also get notices from cyrusone to cyrusone_alerts@wikimedia.org. They do not allow any other alias but cyrusone_alerts@customer domain.

Right now the alias file is live to send all maint-announce@wikimedia.org to both maint-announce@rt.wikimedia.org and maint-announce@phabricator.wikimedia.org.

The RT emails arrive, but the phabricator emails go nowhere. (I also did not get a bounce from phabricator.)

So we need phabricator to accept the emails sent to the maint-announce@wikimedia.org email address. It also will need to accept any emails in our alias file forwarded to that, which for now only include cyrusone_alerts@wikimedia.org.

This was a test ticket created by a direct mail -> T120944

We gotta check the mail logs to see what's going on here with the alias/redirection.

I sent a mail from external to maint-announce@wikimedia.org while watching log files in all 3 places, mx1001, magnesium (RT) and iridium (Phab).

The result was this:

mx1001: sends the mail to both places, @rt and @phabricator:

2016-01-16 00:50:48 1aKF4p-0007Io-76 => maint-announce@phabricator.wikimedia.org <maint-announce@wikimedia.org> R=phabricator T=remote_smtp S=2711 H=iridium.eqiad.wmnet [10.64.32.150] C="250 OK id=1aKF4q-0007Lt-2t" DT=0s
2016-01-16 00:50:48 1aKF4p-0007Io-76 => maint-announce@rt.wikimedia.org <maint-announce@wikimedia.org> R=rt T=remote_smtp S=2711 H=magnesium.wikimedia.org [208.80.154.5] C="250 OK id=1aKF4q-0006Q7-2r" DT=0s

magnesium (RT) receives the mail and creates ticket: https://rt.wikimedia.org/Ticket/Display.html?id=9920

2016-01-16 00:50:48 1aKF4q-0006Q7-2r => maint-announce <maint-announce@rt.wikimedia.org> R=rt T=rt_pipe S=2946 DT=0s
2016-01-16 00:50:48 1aKF4q-0006Q7-2r Completed

iridium (Phab): receives mail sent to @phabricator:

2016-01-16 00:50:48 1aKF4q-0007Lt-2t => general <maint-announce@phabricator.wikimedia.org> R=phab T=phab_pipe S=2951 DT=0s

but does not create ticket.

error mail is sent back to me with this content:

Your email to Phabricator was not processed, because an error occurred while
trying to handle it:

Phabricator can not process this mail because no application knows how to
handle it. Check that the address you sent it to is correct.

(No concrete, enabled subclass of PhabricatorMailReceiver can accept this
mail.)

original message header attached to that mail shows:

from: Daniel Zahn <dzahn@wikimedia.org>
to: maint-announce@wikimedia.org

and that is _not_ @phabricator.wm.org so phab does not want to handle it.

TLDR: we need to setup a rewrite rule in exim that happens _after_ the routing (alias matching) that actually rewrites the to: field in the mail.

http://www.exim.org/exim-html-current/doc/html/spec_html/ch-address_rewriting.html

I tested the rewrite rule and i can confirm it works.

phabricator created this ticket when i mailed maint-announce@

https://phabricator.wikimedia.org/T127246

The thing now is that this means they only get to phab and not also to RT. The rewrite rule is happening before the aliases.

If we consider that an issue and really want it in both systems, we need to come up with something additional.

@Dzahn and I discussed this in IRC yesterday, I'm merely listing it on task for documentation purposes.

I thought we should have it in both systems, as we know rt maint-announce is working, and having it in both would allow us to audit and ensure phabricator is properly working. I'm not sure I trust phabricator to handle these without testing, as we've had odd issues in the past when setting up things like this (reference: procurement). I admit that I may be overly paranoid about missing maint-announcements.

If possible, the rewrite rule should rewrite into two addresses, one into phabricator, and one into rt.

I amended https://gerrit.wikimedia.org/r/#/c/268851 so it gets applied on iridium itself, added to the exim confif specific to phab, rather than the mx servers.

This works around the issue having to rewrite to multiple recipients.

So the setup is now: maint-announce@ is an alias on mx*, sending it to maint-announce@phab AND maint-announce@rt but only changing the envelope-to, and then when the copy arrives on iridium for phab, exim there rewrites the "to" to maint-announce@phabricator and sends it to phab. phab is then happy to process it and create a ticket, unlike before when the "to" header did not have the phabricator part.

Tested. mail to maint-announce now creates tickets in phabricator and also in RT

17:08 < mutante> robh: ok, applied. wanna send one mail to maint-announce@ for me?
17:09 < robh> sent
17:10 < mutante> https://phabricator.wikimedia.org/T129721
17:10 < mutante> https://rt.wikimedia.org/Ticket/Display.html?id=10080

We should now see incoming tickets in the phab queue and also in RT just like before. After confirming all works fine we can turn off the RT part.

Dzahn removed a parent task: Unknown Object (Task).Mar 12 2016, 1:20 AM

Next steps:

  • - ensure the vendors emailing maint-announce have their alert_addresses/domains whitelisted for incoming mail
  • - compare maint announce messages in RT and phab until we're confident that phabricator has received a message from at least all our major providers.

We also need to whitelist senders in the "direct_comments_allowed" section of the phabricator puppet role.

  • - ensure the vendors emailing maint-announce have their alert_addresses/domains whitelisted for incoming mail

https://gerrit.wikimedia.org/r/#/c/276923

Dzahn mentioned this in Unknown Object (Task).Mar 12 2016, 1:40 AM
Dzahn removed Dzahn as the assignee of this task.Mar 15 2016, 8:19 PM

Apparently that is not how it works. Reverted the whitelist change. So, TLDR; i fixed the part that phabricator accepts tickets in this project in general, by adding a rewrite rule in exim and i could show that it works when we send mail from @wikimedia.org. But somehow we have to tell phabricator to accept mail from whitelist of domains.

@RobH does this ticket have to stay limited to NDAed people? I dont' really see any content from tickets here and i was talking to Dereckson about it and wanted to describe the issue but he can't see the ticket.

Its not viewable due to it being the S6 space, not due to NDA stuff. (The s6 space is operations team only.) This is due to the maint-announcement possibly containing when routes/links are down and could open up possible abuse.

As for this task itself, I think it could just be moved into the S1 public space, and reference the S6 space as the destination of said maint-annoucements. Since I'm the one who moved it, I'm moving it back now.

RobH shifted this object from the Restricted Space space to the S1 Public space.Mar 16 2016, 11:37 PM

This is due to the maint-announcement possibly containing when routes/links are down and could open up possible abuse.

Isn't an ordinary security task sufficient for that sort of thing?

13:46 < mutante> currently the RT ticket just tells us to manually check it and put it on calendar
13:46 < mutante> maybe we can just send it to a list
13:46 < mutante> and do the same thing and be done ?

13:51 < paravoid> it's not like we do much with maint-announces
13:51 < paravoid> and if you don't track the foreign ticket id, every single mail will be a separate task
13:52 < paravoid> just target maint-announce into a mailing list I'd say

I'd tend to decline this, make a ticket to create a new list and send stuff there instead.

Then kill RT

The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one week, and have to be followed up on the following week.

A mailing list seems non-ideal for this.

The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one week, and have to be followed up on the following week.

A mailing list seems non-ideal for this.

Agreed — but Phabricator is also non-ideal and between the two (or the third option of maintaining RT in the long-run) I'd prefer a list. Unless anyone has anything better to suggest or the time to write a tool to parse these notices and create an iCal feed ;)

Agreed.

Perhaps as the clinic person adds to the calendar, they can then also reply to that thread on the list. (We would set the list to never email back the senders to said list, so replies would only stick around on our archives.) That would save each clinic person the trouble of comparing every single open mailing list thread to the calendar, they can just check if they have been replied to.

The mailing list will need to have archives so we can compare and ensure items are triaged. As it is, items are often ignored during one week, and have to be followed up on the following week.

Yes, that was my thought as well, on T132968 i put the question whether it should be public or private archives, but yea, in either case the on-duty person would just check the archive link instead of RT and then add to calendar like before.

It should be private archives, as we don't want the notices public. (At least, we have not in the past, as it points out where our infrastructure may be depreciated for attack.)

OTRS can be used almost like a mailing list, if all members of a queue set up the notifications for it. Then you have archives and triaging. I'm not saying it's a good solution (it depends whether you care about replying to messages by email and other things), just adding to options.

I'm declining this since moving into phabricator seems off the table. T132968 is for the mailing list. Thanks for mentioning the OTRS option but i think we are fine with mailman for now. We don't usually reply to maintenance mails.

mailing list removed again by request. i don't know if moving this to phabricator is a thing that we want to happen in the future or never.