Page MenuHomePhabricator

role_contacts (service owners) as a custom puppet fact / cumin aliases for owners
Closed, ResolvedPublic

Description

In the context of tickets that require coordination between multiple subteams in SRE, such as reboots, this is an idea to improve effectivenesss.

We already have "role_owner" annotations in puppet/Hiera that indicate service ownership.

These come from T216088 (T217686).

examples:

hieradata/role/common/dbbackups/content.yaml:profile::contacts::role_contacts: ['Data Persistence SREs']

hieradata/role/common/rpkivalidator.yaml:profile::contacts::role_contacts: ['Infrastructure Foundations SREs']

Now if we could create a custom puppet fact containing this same information, then we could do things like:

  • use cumin to ask "what is the kernel version of all machines owned by $subteam" or "which hosts owned by $subteam are still on buster"

all in a single command.

This would be super convenient and immediately give an answer to "which of our hosts are left to do" among a long list of hostnames that require reboots, for example.

It would mean we don't have to create additional spreadsheets for that kind of thing as we are doing now.

Event Timeline

Dzahn triaged this task as Medium priority.Apr 25 2022, 11:02 PM

use cumin to ask "what is the kernel version of all machines owned by $subteam" or "which hosts owned by $subteam are still on buster"

As we pass this value as a paramter to profile::contacts we can allready use cumin to preform theses searches. e.g.

$ sudo cumin 'P{P:contacts%role_contacts ~ "Data Persistence SREs"}' 'uname -r '          
337 hosts will be targeted:
backup[2001-2007].codfw.wmnet,backup[1002-1007].eqiad.wmnet,db[2071-2152].codfw.wmnet,db[1096,1098-1107,1109-1184].eqiad.wmnet,dborch1001.wikimedia.org,dbprov[2001-2003].codfw.wmnet,dbprov[1001-1003].eqiad.wmnet,dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1021].eqiad.wmnet,es[2020-2034].codfw.wmnet,es[1020-1034].eqiad.wmnet,ms-backup2001.codfw.wmnet,ms-backup[1001-1002].eqiad.wmnet,ms-be[2028-2069].codfw.wmnet,ms-be[1028-1033,1035-1071].eqiad.wmnet,ms-fe[2009-2012].codfw.wmnet,ms-fe[1009-1012].eqiad.wmnet,pc[2011-2014].codfw.wmnet,pc[1011-1014].eqiad.wmnet
Ok to proceed on 337 hosts? Enter the number of affected hosts to confirm or "q" to quit 337
===== NODE GROUP =====                                                                                              
(2) dbproxy[1018-1019].eqiad.wmnet                                                                                  
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-10-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(1) db1124.eqiad.wmnet                                                                                              
----- OUTPUT of 'uname -r ' -----                                                                                   
5.10.0-9-amd64                                                                                                      
===== NODE GROUP =====                                                                                              
(5) db[2112,2130].codfw.wmnet,db[1118,1163].eqiad.wmnet,dborch1001.wikimedia.org                                    
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-18-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(8) db[2071-2072,2092].codfw.wmnet,db[1106,1119,1135,1173,1184].eqiad.wmnet                                         
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-16-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(2) db2103.codfw.wmnet,db1103.eqiad.wmnet                                                                           
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-17-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(5) db[2116,2145-2146].codfw.wmnet,db[1134,1164].eqiad.wmnet                                                        
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-14-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(26) db2110.codfw.wmnet,db[1099-1100,1112,1121,1123,1125,1132,1141,1161,1166,1175,1179].eqiad.wmnet,dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1017,1020-1021].eqiad.wmnet,pc1014.eqiad.wmnet                                          
----- OUTPUT of 'uname -r ' -----                                                                                   
5.10.0-12-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(7) db[2078,2087,2089].codfw.wmnet,db[1098,1128].eqiad.wmnet,pc[1011-1012].eqiad.wmnet                              
----- OUTPUT of 'uname -r ' -----                                                                                   
5.10.0-10-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(5) ms-be2067.codfw.wmnet,ms-be[1068-1071].eqiad.wmnet                                                              
----- OUTPUT of 'uname -r ' -----                                                                                   
4.9.0-18-amd64                                                                                                      
===== NODE GROUP =====                                                                                              
(10) ms-be[2066,2068-2069].codfw.wmnet,ms-fe[2009-2012].codfw.wmnet,ms-fe[1009-1011].eqiad.wmnet                    
----- OUTPUT of 'uname -r ' -----                                                                                   
4.9.0-17-amd64                                                                                                      
===== NODE GROUP =====                                                                                              
(13) ms-be[2045,2058-2059,2062-2065].codfw.wmnet,ms-be[1028,1059,1064-1067].eqiad.wmnet                             
----- OUTPUT of 'uname -r ' -----                                                                                   
4.9.0-16-amd64                                                                                                      
===== NODE GROUP =====                                                                                              
(64) ms-be[2028-2044,2046-2057,2060-2061].codfw.wmnet,ms-be[1029-1033,1035-1058,1060-1063].eqiad.wmnet              
----- OUTPUT of 'uname -r ' -----                                                                                   
4.9.0-15-amd64                                                                                                      
===== NODE GROUP =====                                                                                              
(8) backup[2001-2003].codfw.wmnet,backup[1002-1003].eqiad.wmnet,ms-backup2001.codfw.wmnet,ms-backup[1001-1002].eqiad.wmnet                                                                                                              
----- OUTPUT of 'uname -r ' -----                                                                                   
4.19.0-20-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(159) backup[2004-2007].codfw.wmnet,backup[1004-1007].eqiad.wmnet,db[2076,2080,2086,2090,2093-2102,2104-2109,2111,2113-2115,2117-2129,2131-2144,2147-2152].codfw.wmnet,db[1102,1104,1109-1111,1113-1117,1120,1122,1126-1127,1129-1131,1133,1136-1140,1142-1158,1160,1162,1165,1167-1169,1171-1172,1174,1176-1178,1180-1182].eqiad.wmnet,dbprov[2001-2003].codfw.wmnet,dbprov[1001-1003].eqiad.wmnet,es[2020-2034].codfw.wmnet,es[1021-1022,1024-1034].eqiad.wmnet,ms-fe1012.eqiad.wmnet,pc[2011-2014].codfw.wmnet
----- OUTPUT of 'uname -r ' -----                                                                                   
5.10.0-13-amd64                                                                                                     
===== NODE GROUP =====                                                                                              
(22) db[2073-2075,2077,2079,2081-2085,2088,2091].codfw.wmnet,db[1096,1101,1105,1107,1159,1170,1183].eqiad.wmnet,es[1020,1023].eqiad.wmnet,pc1013.eqiad.wmnet                                                                            
----- OUTPUT of 'uname -r ' -----                                                                                   
5.10.0-11-amd64                                                                                                     
================                                                                                                    
PASS |██████████████████████████████████████████████████████████████████| 100% (337/337) [00:03<00:00, 89.68hosts/s]
FAIL |                                                                            |   0% (0/337) [00:03<?, ?hosts/s]
100.0% (337/337) success ratio (>= 100.0% threshold) for command: 'uname -r '.
100.0% (337/337) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
$ sudo cumin 'P{P:contacts%role_contacts ~ "Data Persistence SREs"} and P{F:lsbdistcodename = 'buster'}'
30 hosts will be targeted:
backup[2001-2003].codfw.wmnet,backup[1002-1003].eqiad.wmnet,db[2071-2072,2092,2103,2112,2116,2130,2145-2146].codfw.wmnet,db[1103,1106,1118-1119,1134-1135,1163-1164,1173,1184].eqiad.wmnet,dborch1001.wikimedia.org,dbproxy[1018-1019].eqiad.wmnet,ms-backup2001.codfw.wmnet,ms-backup[1001-1002].eqiad.wmnet
DRY-RUN mode enabled, aborting

And if that syntax is too cumbersome in the day-to-day we could add a few Cumin aliases? like A:hosts-data-persistence and A:hosts-infrastructure-foundations or similar?

use cumin to ask "what is the kernel version of all machines owned by $subteam" or "which hosts owned by $subteam are still on buster"

As we pass this value as a paramter to profile::contacts we can allready use cumin to preform theses searches. e.g.

Oh! So "already exists and works" :) Very nice. Well.. thank you. Very useful.

And if that syntax is too cumbersome in the day-to-day we could add a few Cumin aliases? like A:hosts-data-persistence and A:hosts-infrastructure-foundations or similar?

Good idea. Yea, let me upload a change for our team, actually.

One other idea that came up in this context, @MoritzMuehlenhoff , was if we have to create tickets similar to T304938 in the future we could possibly use this to generate a list of server checkboxes that is already separated into subsections per team/role_owner. That is because subteams want their own section to see what is left to do at a glance and may create their own subtasks or sheets or Etherpads just for that.

Change 786430 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cumin: add "owner" aliases to get lists of host per SRE subteam

https://gerrit.wikimedia.org/r/786430

Change 786848 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update role contacts to refelect team name change

https://gerrit.wikimedia.org/r/786848

Change 786848 merged by Btullis:

[operations/puppet@production] Update role contacts to refelect team name change

https://gerrit.wikimedia.org/r/786848

Change 787436 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:cumin::master: Add documentation and fix minor lint issue

https://gerrit.wikimedia.org/r/787436

Change 787436 merged by Jbond:

[operations/puppet@production] P:cumin::master: Add documentation and fix minor lint issue

https://gerrit.wikimedia.org/r/787436

Change 786430 abandoned by Dzahn:

[operations/puppet@production] cumin: add "owner" aliases to get lists of host per SRE subteam

Reason:

replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/787440

https://gerrit.wikimedia.org/r/786430

Dzahn claimed this task.

This was resolved by John's final change https://gerrit.wikimedia.org/r/c/operations/puppet/+/787440 which I just deployed.

Works for me on cumin2002 now like this:

[cumin2002:~] $ sudo cumin 'A:owner-core-platform' 'uname -r ' 
66 hosts will be targeted:


[cumin2002:~] $ sudo cumin 'A:owner-serviceops' 'uname -r ' 
506 hosts will be targeted:
+owner-traffic: P{P:contacts%role_contacts ~ "Traffic"}
+owner-observability: P{P:contacts%role_contacts ~ "Observability"}
+owner-data-engineering: P{P:contacts%role_contacts ~ "Data Engineering"}
+owner-infrastructure-foundations: P{P:contacts%role_contacts ~ "Infrastructure Foundations"}
+owner-data-persistence: P{P:contacts%role_contacts ~ "Data Persistence"}
+owner-serviceops: P{P:contacts%role_contacts ~ "ServiceOps"}
+owner-core-platform: P{P:contacts%role_contacts ~ "Core Platform"}
+owner-search-platform: P{P:contacts%role_contacts ~ "Search Platform"}
+owner-machine-learning: P{P:contacts%role_contacts ~ "Machine Learning"}
+owner-serviceops-sres: P{P:contacts%role_contacts ~ "ServiceOps SREs"}
+owner-wmcs: P{P:contacts%role_contacts ~ "WMCS"}
+owner-wmcs-sres: P{P:contacts%role_contacts ~ "WMCS SREs"}

Thanks, i will call this resolved. Not a custom fact but same intention / result.

Dzahn renamed this task from role_contacts (service owners) as a custom puppet fact to role_contacts (service owners) as a custom puppet fact / cumin aliases for owners.Apr 28 2022, 6:04 PM