Page MenuHomePhabricator

Request access to servers Dcops group
Closed, ResolvedPublic

Description

Opening this task to address access requests so Dcops group can perform day to day duties

Would like to be able to ssh each server to assist in troubleshooting servers

dmidecode
nvme
smartctl
edac-util
mdadm
dmesg
ledctl

context:
IF historically has been working on reducing the number of people with global root-level access see: [T244840] and [T289779]. Additional considerations for added security controls for SRE edge cases exist [T299989].

Event Timeline

Jclark-ctr triaged this task as Medium priority.Mar 18 2024, 5:37 PM
jcrespo subscribed.

I am going to remove the SRE-Access-Requests because, while it is indeed an access request, it is not immediately actionable by people on clinic duty, but has to be discussed with the owners of the workflow (IF) + the rest of the SREs first on how exactly to provide it.

@Jclark-ctr is the main purpose of this gather debug information on the host?
If that's the case the simplest solution is to write a cookbook that gathers all that info for you and either print it on screen or add it to a task (sanitized by any sensitive information).
If this might work for you let me know which command line options we should run each tools with.

@Volans The main purpose is for gathering debug information I would prefer to grep mesg /log files instead of searching throughout entire output. Mdadm commands would allow us to one day rebuild failed software raids

Mdadm commands would allow us to one day rebuild failed software raids

That should be covered by T364540 no?

But isn't it simper to just grep in the output of a single cookbook as opposed to grep the output of multiple tools?

@Volans i also see this as a learning opportunity most of these are just logs. Some dcops members are very light on linux and we could be expanding knowledge and could be come more valuable members of the team. Although I do love cookbooks but sometimes they fail and would be nice if we could continue to teach and train coworkers

Sorry for the late reply, I had a chat with Willy and with the I/F team, this is our proposal:

  • We create a new POSIX group for dcops that gets deployed to all production nodes, granting ssh access. A minimal set of SUDO policies will be added to be able to read/inspect logs (like syslog, journalctl, dmesg, etc..).
  • Any write action, like rebuilding a RAID array etc.. will require an interaction between DCops and an SRE that owns the service. For example, let's say that DCops need to rebuild a raid array on db1234 after swapping a disk: they will ping any SRE in Data Persistence with a proposed plan of action, and both will sync about when/how to do it. The actual command will be run by the SRE owning the service, but the whole plan will be from DCops.
    • This workflow seems convoluted at first, but it will give us multiple benefits: DCops starting to experiment/propose with commands, (while socializing the fix with a Service Owner SRE) and we'll also be able to avoid mistakes like operating on the wrong node. It seems an unnecessary fence but having root-like commands available is really dangerous, especially on systems that we don't touch every day. For example, I am terrified when I log on a dbXXXX node since it is not my realm, and every wrong command can bring down production easily (say if I execute the wrong action on a DB master node). I usually ask to the SRE owning the DBs if the command is ok before proceeding, so it is not that different from the proposed plan in my opinion.
  • While we try this new way of doing things, I/F can prioritize cookbooks and new tools like sudo_pair (T299989). The latter would be one step closer to be able to execute the real command, since DCops would effectively log in and run the command on the target node, but a Service Owner SRE would need to give the green light (multiple eyeballs are better and more effective).

We can give it a try and see how it goes after 2/3 months, and change it accordingly as we go.

How does it sound @wiki_willy ?

Thanks so much @elukey for putting this proposal together, and for the chat during office hours today. I like the entire idea, and will run it by the rest of the team during our staff meeting next week. For the first bullet around ssh access to all production nodes for a minimal list of read only sudo commands, I think we can just go ahead and proceed with this part. It'll be really beneficial in helping the Dc-Ops engineers troubleshoot/diagnose issues. My only ask here is to see if it's possible expand the list of read only commands to include the following: dmesg, dmidecode, smartctl, nvme, edac-util, mdadm, ledctl, free, uptime, df, top, uname, ipmi-sensors, dhcp, ping, ifconfig. And if we're able to implement this part within a couple weeks, that'll be terrific.

For the second and third bullets, I like the idea of pairing SREs with Dc-Ops engineers - more thorough training, extended collaboration, growth, etc. - along with still being able to run a cookbook, when there's not enough time to sync up together via the workflow of pairing up with a SRE. Let me run this portion by the team to see if there's anything that they would want to tweak to the workflow, and will get back to you soon. Much appreciated again for drafting this up and for all your help!

Thanks,
Willy

ifconfig

I think it's better we include "ip" and "bridge" rather than ifconfig, which is extremely outdated at this point.

If adding the network tools "traceroute" might also be worth including. It's not something that can modify state so it's safe, but it needs sudo to operate in ICMP mode, which is generally needed for it to show all hops in our infra.

"lldpctl" might also be useful for finding issues with cabling mis-matches etc.

Thanks for the input @cmooney. All your suggestions sound good to me, so feel free to swap out ifconfig with ip, bridge, traceroute, and lldpctl. Thanks!

ifconfig

I think it's better we include "ip" and "bridge" rather than ifconfig, which is extremely outdated at this point.

If adding the network tools "traceroute" might also be worth including. It's not something that can modify state so it's safe, but it needs sudo to operate in ICMP mode, which is generally needed for it to show all hops in our infra.

"lldpctl" might also be useful for finding issues with cabling mis-matches etc.

Please add 'lshw' because I use it constantly (since i have root) for determining serials of any items installed in the host, and track hw failures.

Change #1054894 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] admin: add dcops to the system adm POSIX group

https://gerrit.wikimedia.org/r/1054894

Filed a proposal in https://gerrit.wikimedia.org/r/1054894

@wiki_willy I reviewed the list of commands, most of them were already available with no privileges, meanwhile I included others as sudo capabilities. I'd avoid some commands like smartctl or mdadm for the moment since they may trigger side effects, and I'd include those (for the moment) in the list of commands to be executed paired with a service owner SRE. Lemme know :)

Thanks @elukey, that sounds good!

Filed a proposal in https://gerrit.wikimedia.org/r/1054894

@wiki_willy I reviewed the list of commands, most of them were already available with no privileges, meanwhile I included others as sudo capabilities. I'd avoid some commands like smartctl or mdadm for the moment since they may trigger side effects, and I'd include those (for the moment) in the list of commands to be executed paired with a service owner SRE. Lemme know :)

Update: we are still discussing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054894, a larger refactor was proposed and we are working on it. This is why we haven't merged yet, but it should happen soon!

Change #1057814 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ldap: fix add-ldap-group script

https://gerrit.wikimedia.org/r/1057814

Change #1057814 merged by Elukey:

[operations/puppet@production] ldap: fix add-ldap-group script

https://gerrit.wikimedia.org/r/1057814

elukey@ldap-maint1001:~$ sudo add-ldap-group --gid 724 ops-limited
successfully created group ops-limited, with gidNumber 724 and 0 members

elukey@ldap-maint1001:~$ sudo add-ldap-group --members tappof wpao jclark pt1979 robh jhancock vriley --gid 724 --ignore-existing ops-limited
successfully created group ops-limited, with gidNumber 724 and 7 members

Change #1058081 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:openldap::management: Add ops-limited to cross validation.

https://gerrit.wikimedia.org/r/1058081

Change #1054894 merged by Elukey:

[operations/puppet@production] admin: add dcops to the system adm POSIX group

https://gerrit.wikimedia.org/r/1054894

Finally the change is being rolled out!

elukey@an-worker1080:~$ id wpao
uid=21258(wpao) gid=500(wikidev) groups=500(wikidev),4(adm),724(ops-limited)

The above is @wiki_willy's account on a random data engineering node (note the ops-limited group, the new one that allows this).

Next steps:

  • Remove sre-admins and update docs
  • Announce the new group and ideas behind it.

Change #1058101 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] admin: deprecate sre-admins

https://gerrit.wikimedia.org/r/1058101

Change #1058101 merged by Elukey:

[operations/puppet@production] admin: deprecate sre-admins

https://gerrit.wikimedia.org/r/1058101

Mentioned in SAL (#wikimedia-operations) [2024-07-30T13:30:42Z] <elukey> deprecate the sre-admins posix group fleetwide (replaced by ops-limited) - T360356

Everything seems done, I'll leave the task open to wait for questions/feedback/etc..

Change #1058081 merged by Slyngshede:

[operations/puppet@production] P:openldap::management: Add ops-limited to cross validation.

https://gerrit.wikimedia.org/r/1058081

elukey claimed this task.