Page MenuHomePhabricator

Mapping of servers to stakeholders
Open, MediumPublic

Description

Problem: When RAID tasks are created, the datacenter is added automatically but it is unclear who the service owners are to notify them of needed maintenance.

It may be valuable to create and make available a server to service owner mapping tool or list.

Possible solutions:

  • Use Netbox to assign "ownership" to servers leveraging either:
    • Custom fields
    • Tenants and Tenant Groups
  • Update the Infrastructure Naming Conventions page to include "ownership" of naming groups.
  • Update Puppet and add a field so that the RAID Degraded tasks can have the appropriate tags applied.
  • Other?

Event Timeline

colewhite triaged this task as Medium priority.Feb 13 2019, 11:28 PM
colewhite created this task.

I went ahead and tried the naming convention approach adding a column to that table and adding my Phabricator username there were relevant. I've actually added a link to the phab profile, probably just the name is enough if we specify that those should be Phabricator account names.

I got pointed to this task from T217686, where I suggest another option.

Puppet also seems a good location, but dunno how the implementation would go. Maybe leverage git history/blame to backfill existing classes, and a Jenkins check to ensure future additions.

About the "Infrastructure Naming Conventions" table I think it's a good temporary solution but we should use team names (and Phab team tags) instead of Individuals. Then the teams can re-assign the task to the proper person.

I got pointed to this task from T217686, where I suggest another option.

Puppet also seems a good location, but dunno how the implementation would go. Maybe leverage git history/blame to backfill existing classes, and a Jenkins check to ensure future additions.

One possibility would be some Puppet define that we add to the role definition, which creates a file somewhere on the system and which would also be available in Puppetdb, maybe.

A pretty accurate list of stakeholders for a given host can be gleaned from the users, groups, and sudoers config deployed to it.

As an alternative to keeping a separate list of stakeholders in sync manually, we could populate a "stakeholders" fact with the names of groups matching a given gid range known to contain groups of humans who use of the system. Say gids 700-900, or 500-999 (based on data.yaml and login.defs respectively)

This would give us the ability to easily map a group of stakeholders to a list of individuals, as well as populate puppetdb with the stakeholder list. It would also provide some incentive to align group names with organizational role/group, and maintain a strict unix group numbering/naming scheme over time.

To give a few real-world examples

bast1002:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
all-users
ops
elastic1017:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
ops
elasticsearch-roots
an-master1001:~$ for i in `seq 501 998`; do getent group $i; done | cut -d : -f 1
ops
analytics-privatedata-users
analytics-admins
analytics-search-users

Thanks @colewhite for raising (and re-raising!) this issue. This is a tricky but important problem to solve for sure!

From the various conversations here and on-task it does not appear that we have consensus on:

  • what purpose are we using this information for?
  • where are we storing it and how are we maintaining it going forward?

Even though the scope of the task started as much smaller and distinctly identified with a problem statement at its description (thanks for that!) there a few overlapping problems that have been discussed in the context of this I think:

  1. Who gets HW (e.g. RAID) alerts for a particular piece of hardware;
  2. Who does NetOps reaches out coordinate with for networking changes such as e.g. a switch upgrade;
  3. Who does Infrastructure Foundations reaches out to to coordinate e.g. a trusty->stretch distribution upgrade;
  4. Who does DC Ops reaches out to to coordinate a hardware refresh;
  5. Especially in light of our new access policy model, which hosts a certain (self-managed) access group applies to;
  6. etc.

In addition to this, this is further complicated by:

  • The ambiguity of what a "service owner" means depending on the context. Given we have a tiered administration model, in most cases one of the SRE teams is the primary admin for a particular set of devices (and thus e.g. ensuring the healthiness of the underlying hardware), while working in close collaboration with another team that manages or comanages the service itself. Our entire core platform (mw*) is under that mixed model.
  • Whether we're referring to or intend to document teams (e.g. Analytics), parts of a team (e.g. analytics-ops), leadership of a team (for access requests in particular), or specific individuals within a team (AndrewO/Luca).

Does this sound like an accurate description of the various angles here? I feel that we either need to grow the scope to cover all these, or explicitly say that we're only doing to address problems X, Y and Z, and not will not attempt to address A, B, C at this time. Does that make sense as an approach?

Does this sound like an accurate description of the various angles here?

Accurate enough to stall the task for 2 years :)

Following John progress on https://gerrit.wikimedia.org/r/c/operations/puppet/+/695230/ here is a proposal that sounds maintainable and not too convoluted (I hope!).

Each Puppet role have a (sub-)team contact (I/F, Service Ops, WMCS, etc), occasionally an individual
Each relevant (eg. ignore base profiles applied everywhere) Puppet profiles have an individual contact, occasionally a team
That will be back-filled manually, and progressively.
Ideally checked by CI, to enforce that profiles/roles are assigned to active teams/users
Ideally with a way to link it to a Phabricator tag (for teams) or individual.
For example using the data.yaml shell name, where the phabricator name is also present as an optional field.
Q. : Are there cases where a profile owner isn't in data.yaml? or doesn't have a Phabricator account?

That list is written to disk in YAML (similar to what Moritz mentioned in an earlier comment and John implemented it) so it can be exploited in multiple ways:

Who gets HW (e.g. RAID) alerts for a particular piece of hardware;

A phabricator task is automatically opened and relevant users or team are CCed

Who does NetOps reaches out coordinate with for networking changes such as e.g. a switch upgrade;
Who does Infrastructure Foundations reaches out to to coordinate e.g. a trusty->stretch distribution upgrade;

Look at the on disk file, or use cumin to get the list of all roles/profiles owners. Potentially open a task with all the relevant owners using a cookbook

Who does DC Ops reaches out to to coordinate a hardware refresh;

Role owners (teams)

Especially in light of our new access policy model, which hosts a certain (self-managed) access group applies to;

Not covered by this use case, probably need to look at LDAP or unix groups

This allows us to not fall in the "service owner" rabbit hole but still tackle the most common usecases (that you listed).
It also allow someone to know who to reach out to about a given role/server without having to rely on git blame.

Change 695230 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] profile::contacts: add a profile and define for adding contact metadata

https://gerrit.wikimedia.org/r/695230

Change 695236 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] (Test): Example PR demonstrating the contacts profile

https://gerrit.wikimedia.org/r/695236

Change 695230 merged by Jbond:

[operations/puppet@production] profile::contacts: add a profile and define for adding contact metadata

https://gerrit.wikimedia.org/r/695230

Change 695236 merged by Jbond:

[operations/puppet@production] (Test): Example PR demonstrating the contacts profile

https://gerrit.wikimedia.org/r/695236

Change 699178 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] profile::contacts: add a profile and define for adding contact metadata

https://gerrit.wikimedia.org/r/699178

Change 699179 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] ssretest: Add contacts to sretest

https://gerrit.wikimedia.org/r/699179

Change 699178 merged by Jbond:

[operations/puppet@production] profile::contacts: add a profile and define for adding contact metadata

https://gerrit.wikimedia.org/r/699178

Change 699179 merged by Jbond:

[operations/puppet@production] ssretest: Add contacts to sretest

https://gerrit.wikimedia.org/r/699179

Q. : Are there cases where a profile owner isn't in data.yaml?

Yes. I've written and maintain a bunch of profiles for WMCS use cases but don't have production shell access.

Change 711400 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Propose a format for profile contact data

https://gerrit.wikimedia.org/r/711400

Change 711400 merged by Jbond:

[operations/puppet@production] Propose a format for profile contact data

https://gerrit.wikimedia.org/r/711400

Change 731934 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for IF services

https://gerrit.wikimedia.org/r/731934

Change 731934 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for IF services

https://gerrit.wikimedia.org/r/731934

Change 732257 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for Data Engineering services

https://gerrit.wikimedia.org/r/732257

Change 732268 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add remaining ownership annotations for ML services

https://gerrit.wikimedia.org/r/732268

Change 732307 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for WMCS services

https://gerrit.wikimedia.org/r/732307

Change 732268 merged by Muehlenhoff:

[operations/puppet@production] Add remaining ownership annotations for ML services

https://gerrit.wikimedia.org/r/732268

Change 732257 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for Data Engineering services

https://gerrit.wikimedia.org/r/732257

Change 732307 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for WMCS services

https://gerrit.wikimedia.org/r/732307

Change 738262 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for additional Traffic services

https://gerrit.wikimedia.org/r/738262

Change 738265 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for additional Data Persistence services

https://gerrit.wikimedia.org/r/738265

Change 738416 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for more o11y services

https://gerrit.wikimedia.org/r/738416

Change 738426 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for more Service SRE services

https://gerrit.wikimedia.org/r/738426

Change 738432 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ownership annotations for more Search Platform services

https://gerrit.wikimedia.org/r/738432

Change 738416 merged by Cwhite:

[operations/puppet@production] Add ownership annotations for more o11y services

https://gerrit.wikimedia.org/r/738416

Change 738432 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for more Search Platform services

https://gerrit.wikimedia.org/r/738432

Change 738265 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for additional Data Persistence services

https://gerrit.wikimedia.org/r/738265

Change 738262 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for additional Traffic services

https://gerrit.wikimedia.org/r/738262

Change 738426 merged by Muehlenhoff:

[operations/puppet@production] Add ownership annotations for more Service SRE services

https://gerrit.wikimedia.org/r/738426

Some notes/thoughts from a chat with @jbond:

  • Based on P36282 and except Data Engineering,Machine Learning all servers have 1 clear team owner
  • role_contacts has been extremely useful in scheduling maintenance (upgrades, network, etc)
  • In light of https://gerrit.wikimedia.org/r/c/operations/puppet/+/849020 there is a need to assign team owner right during or before provisioning

So if that dual team ownership can be solved, I'm thinking that it might be time to move the data to Netbox for better tracking and reporting.

  • Earlier documentation of the owner team
  • Advanced search/export of teams impacted by a maintenance
  • Data easier to manipulate (not linked to Puppet anymore)

A suggestion is to:

  1. Add a custom field for servers and VMs, exact name TBD owner_team, requestor_team. Field with a drop down limited choices
  2. Update the procurement request form or racking task to specify the team, this will be used to hand the server over the requesting team once ready. Or could be looked up from the individual doing the request.
  3. Do a one off migration from role_contacts to this new custom field
  4. Add alerting/reports (eg. empty field, inconsistency between server names,
  5. We could also pre-populate (or suggest) the relevant team from a maintained list (eg. migrate this list to Puppet or Netbox reports https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers ) or from the existing servers with the same names in Netbox

Downside is that it adds a manual (or semi-manual) step (defining the owner team) while it's currently automatic from the roles, but so does the future insertup-$team.