Page MenuHomePhabricator

Mapping of servers to stakeholders
Open, NormalPublic

Description

Problem: When RAID tasks are created, the datacenter is added automatically but it is unclear who the service owners are to notify them of needed maintenance.

It may be valuable to create and make available a server to service owner mapping tool or list.

Possible solutions:

  • Use Netbox to assign "ownership" to servers leveraging either:
    • Custom fields
    • Tenants and Tenant Groups
  • Update the Infrastructure Naming Conventions page to include "ownership" of naming groups.
  • Update Puppet and add a field so that the RAID Degraded tasks can have the appropriate tags applied.
  • Other?

Event Timeline

colewhite triaged this task as Normal priority.Feb 13 2019, 11:28 PM
colewhite created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2019, 11:28 PM
Volans added a subscriber: Volans.Feb 14 2019, 3:32 PM

I went ahead and tried the naming convention approach adding a column to that table and adding my Phabricator username there were relevant. I've actually added a link to the phab profile, probably just the name is enough if we specify that those should be Phabricator account names.

jbond added a subscriber: jbond.Mar 6 2019, 5:05 PM
herron added a subscriber: herron.Mar 6 2019, 5:06 PM
ayounsi added a subscriber: ayounsi.Mar 6 2019, 5:29 PM

I got pointed to this task from T217686, where I suggest another option.

Puppet also seems a good location, but dunno how the implementation would go. Maybe leverage git history/blame to backfill existing classes, and a Jenkins check to ensure future additions.

About the "Infrastructure Naming Conventions" table I think it's a good temporary solution but we should use team names (and Phab team tags) instead of Individuals. Then the teams can re-assign the task to the proper person.

I got pointed to this task from T217686, where I suggest another option.
Puppet also seems a good location, but dunno how the implementation would go. Maybe leverage git history/blame to backfill existing classes, and a Jenkins check to ensure future additions.

One possibility would be some Puppet define that we add to the role definition, which creates a file somewhere on the system and which would also be available in Puppetdb, maybe.

faidon updated the task description. (Show Details)
faidon added subscribers: RobH, crusnov.
herron added a comment.Mar 7 2019, 5:42 PM

A pretty accurate list of stakeholders for a given host can be gleaned from the users, groups, and sudoers config deployed to it.

As an alternative to keeping a separate list of stakeholders in sync manually, we could populate a "stakeholders" fact with the names of groups matching a given gid range known to contain groups of humans who use of the system. Say gids 700-900, or 500-999 (based on data.yaml and login.defs respectively)

This would give us the ability to easily map a group of stakeholders to a list of individuals, as well as populate puppetdb with the stakeholder list. It would also provide some incentive to align group names with organizational role/group, and maintain a strict unix group numbering/naming scheme over time.

To give a few real-world examples

bast1002:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
all-users
ops
elastic1017:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
ops
elasticsearch-roots
an-master1001:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
ops
analytics-privatedata-users
analytics-admins
analytics-search-users
faidon added a subscriber: faidon.Apr 18 2019, 11:11 AM

Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure!

From the various conversations here and on-task it does not appear that we have consensus on:

  • what purpose are we using this information for?
  • where are we storing it and how are we maintaining it going forward?

Even though the scope of the task started as much smaller and distinctly identified with a problem statement at its description (thanks for that!) there a few overlapping problems that have been discussed in the context of this I think:

  1. Who gets HW (e.g. RAID) alerts for a particular piece of hardware;
  2. Who does NetOps reaches out coordinate with for networking changes such as e.g. a switch upgrade;
  3. Who does Infrastructure Foundations reaches out to to coordinate e.g. a trusty->stretch distribution upgrade;
  4. Who does DC Ops reaches out to to coordinate a hardware refresh;
  5. Especially in light of our new access policy model, which hosts a certain (self-managed) access group applies to;
  6. etc.

In addition to this, this is further complicated by:

  • The ambiguity of what a "service owner" means depending on the context. Given we have a tiered administration model, in most cases one of the SRE teams is the primary admin for a particular set of devices (and thus e.g. ensuring the healthiness of the underlying hardware), while working in close collaboration with another team that manages or comanages the service itself. Our entire core platform (mw*) is under that mixed model.
  • Whether we're referring to or intend to document teams (e.g. Analytics), parts of a team (e.g. analytics-ops), leadership of a team (for access requests in particular), or specific individuals within a team (AndrewO/Luca).

Does this sound like an accurate description of the various angles here? I feel that we either need to grow the scope to cover all these, or explicitly say that we're only doing to address problems X, Y and Z, and not will not attempt to address A, B, C at this time. Does that make sense as an approach?

bd808 awarded a token.Jun 5 2019, 7:14 PM